Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Einat Orr, PhD
Einat Orr, PhD Author

Einat Orr is the CEO and Co-founder of lakeFS, a...

Last updated on May 23, 2024

Someone in the organization owns the data lake or data lake house. The owner’s title may change, but the task at hand remains the same. Whether you’re a DataOps team, a Data Engineering team, or an MLOps team, this ownership implies that optimizing the storage cost is now part of the job. 

It’s your job to keep the lake clean and clear while minimizing storage costs. Fortunately, if you adopt the right approach, those goals can be achieved simultaneously.

In this article, we will review the common reasons behind a dirty and oversized data lake (often referred to as a data swamp) and how to avoid it using the right technology.

My lake is a swamp: what went wrong?

Organizations strive to make the most of their data. To do so, they encourage many stakeholders to access and rely on it for their needs. 

This approach to data management is referred to as data democratization. In enterprises, it will come with relevant governance to ensure compliance with relevant standards, such as GDPR.

While this approach may maximize the value of the data, it takes its toll on storage consumption for the following reasons:

1. Duplication for isolation

Data practitioners would like to make changes to data without implications to others during the development, experimentation testing of pipelines or models. In the absence of other tools, creating a copy of the data or a sample of it, are the two options left. Those solutions create many duplications of data and its samples, and are a main building block of the swamp.

2. Duplication for reproducibility

    It is a known good practice in research and engineering to make sure you can reproduce the results of a computation that produces data insights that you share with others. In many business verticals it is more than a good practice, it is a regulation required for as many as seven years back. In many cases organizations solve this by saving the inputs and outputs of a data product for every delivery. When done without the right tools, this will often mean saving the same data many times since the same dataset may play a role in several data products’ insights.

    Retention policies

    A data retention policy is a component of an organization’s overall data management strategy. Since data may accumulate rapidly, it’s critical that enterprises take the steps to determine how long they must retain certain data.

    An organization should only keep data for as long as necessary, whether that’s five months or seven years. Retaining data for longer than necessary consumes extra storage space and increases expenses. Managing a safe data retention policy is not an easy task, and the multiple stories on “How I deleted production data” are a hint to the fire drills it entails. A good retention policy is accurately representing business needs and is easily derived from them. Access to the needed data in the needed performance, with minimal cost. When not done right, lack of suitable retention adds to the creation of a swamp.

    Suboptimal use of Storage Tiers

    A good retention strategy includes use of storage tiers. Main cloud providers allow different storage tiers with changing costs depending on frequency of use.  Using Tiering in an optimal way is a critical aspect of storage costs.

    In some cases using cloud automation for Tiering may be optimal, in other cases it is better to use your own optimization based on your business logic.

    Reduce data storage costs by using data version control

    Working in Isolation

    Building and maintaining multiple ETLs requires teams to develop and test new and existing pipelines on a regular basis. 

    To test ETL pipelines properly on the entire data lake, most data engineers would create a local copy of the entire lake and test on it. When done regularly, this practice can lead to a multiplied usage of data storage, which is entirely unnecessary. 

    The solution to this is using a data versioning tool that provides branching capabilities, enabling the creation of a development and testing environment without copying the data itself. 

    Such a solution can help significantly reduce data storage costs as well as increase data engineering efficiency, by eliminating the need to copy, maintain and get rid of multiple data lake clones.
    lakeFS is an open-source solution that provides Git-like operations on data lakes. Using lakeFS, data engineering teams can instantly create a development and testing environment without copying anything, as lakeFS uses metadata (pointers to the data) and doesn’t create copies of the data itself.

    Create development and testing environment

    Source: lakeFS

    By creating an isolated branch of the entire data lake – without copying it – data engineers can safely develop, test, and maintain high-quality, well-tested ETLs without any implication on storage costs, reducing the spend on data storage by up to 80%.

    Achieving Reproducibility

    Many organizations need to regularly keep multiple copies of their data to track the changes that happen throughout their lifecycle. Whenever there is an error or inconsistency in the data, being able to time-travel between these copies helps a great deal in identifying when this error was introduced and reproducing what caused it. 

    However, keeping versions of the data by copying it has a significant cost in storage and the need to delete, manage, and maintain these copies regularly. 

    While the need to keep track of what has changed in the data over time is very important, there is a better way to achieve this: by having as many versions of the data as needed instead of numerous physical copies.

    The lakeFS open-source project can help here as it allows users to keep multiple versions of the data without copying anything. This is an easy way to achieve a history of the data’s changes over time and keep track of it. 

    On top of that, lakeFS provides tagging, which is a way to give a meaningful name to a specific version of the data, enabling the value of keeping track of the version history of the data without needing to create and maintain multiple copies of the data.

    Optimizing Data Retention

    lakeFS allows retention configuration that is based on lifecycle, as custom with cloud providers, but its strength is in offering to purge objects in the data lake that are not in active use with any live commit or branch. This allows an easy way of controlling retention using a commit lifecycle policy and the right branching strategy. Not more endless digging to datasets, when one can optimize retention by the business logic implemented into the branching and committing strategy of the organization. Read more about lakeFS GC.

    Summary

    Reducing data storage costs in the cloud is always a good practice. In times of economic uncertainty, the ability to save money is key to ensuring the longevity and resilience of organizations. 

    Applying data-specific cost reduction methods can be a one-time engineering investment that can prove itself even in the short term, and undoubtedly in the long run.

    Git for Data – lakeFS

  1. Get Started
    Get Started
  2. Where is data engineering heading in 2024? Find out in this year’s State of Data Engineering Report -

    Read it here
    +