Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Nir Ozeri
Nir Ozeri Author

Nir Ozeri is a seasoned Software Engineer at lakeFS. Over...

August 28, 2023

In today’s data-driven world, managing vast amounts of data efficiently is crucial for organizations of all sizes. As data lakes and object storage systems popularity is on the rise, the need for robust data versioning and governance solutions has grown.

Data retention and storage optimization has become an increasingly complex task, even more so when dealing with versioned data. These versions can quickly accumulate, leading to data redundancy and unnecessary storage consumption.

In addition, compliance with regulatory requirements (such as GPDR, FINRA, HIPAA), requires organizations to dispose of data in a manner consistent with the privacy rights and data retention policies and guidelines. lakeFS is designed to allow version control while deduplicating the storage, so using it is a natural first step in providing a storage-optimized data lake.
In addition, one of lakeFS’ standout features is the Garbage Collector, which addresses both storage optimization, and regulatory compliance.

LakeFS’ Garbage Collector (GC) has undergone several design iterations during its lifecycle, and recently went through a major overhaul which consolidated the process of committed and uncommitted data garbage collection.

In this article, we will discuss the current GC capabilities, how we handle GC in scale and explain the main differences between self-managed GC and our Cloud GC solution.

Garbage Collection in lakeFS

Garbage Collection (GC) can be difficult for big data lakes: it can be both time and resource consuming and practically impossible to scale if done incorrectly

When introducing versioned data into the equation, this can be an even more challenging task.
To tackle these challenges, lakeFS’s GC employs the following mechanisms:

  1. Commit Graph and Reference Tracking – Tracking data references in repositories and branches, the Garbage Collector detects data which is no longer referenced and marks it for deletion according to the retention policy. This ensures that data versions persist as long as they are tied to an existing branch or tag. This safeguard prevents inadvertent deletion of data that may still be in use, maintaining data integrity and continuity in downstream processes.
  2. Granular Data Retention Policies – GC allows for the implementation of granular data retention policies. These policies dictate the retention and removal of data versions based on business rules, compliance requirements, and usage patterns. Such policies enable organizations to strike a balance between data availability and storage cost-efficiency.
  3. Mark and Sweep functionality – GC uses a mark and sweep strategy, allowing users to have more control over the process, split the workload, and to enable auditing.

lakeFS GC distinguishes between two types of data:

  1. Committed Data – data that was committed to a branch in lakeFS
    With committed data, the decision whether to delete the data or not is determined by the retention policy which can be defined in the repository and branch scope.
  2. Uncommitted Data – data that was uploaded to lakeFS but was not committed.
    For uncommitted data, GC will list all data which was uploaded to the repository, was not committed and is no longer referenced by any branch’s staging area. This can occur due to the following:
    • An object is uploaded and subsequently deleted without previously committing it.
    • A reset was performed on a branch, causing it to drop some or all of the staging data.
    • A branch containing uncommitted data was deleted.

Garbage Collection (GC) At Scale

Under the hood, lakeFS’ GC is an Apache Spark application that runs as a job on the lakeFS repository level. It leverages Spark’s partitioning capabilities to divide the workload in a manner which optimally scans through the object store and the lakeFS metadata. To support and leverage the partitioning capabilities, we’ve modified lakeFS to partition the data in the underlying storage into prefixes. Follow this link for a deeper dive into this topic.

Like most Spark applications, there’s a tradeoff between cost and efficiency. As a rule of thumb – Allocating more resources means faster execution of the job, but this also means higher costs.
There’s a fine art to balancing between cost and efficiency and it requires mastering the Spark ecosystem as well as in-depth understanding of the context (in our case the inner workings of GC, lakeFS and even the properties of the specific repository we are running the GC on).

lakeFS’ GC is battle tested in production and has been successfully executed on data lakes containing over 60 million objects.
In the example below – you can see a successful execution of the Garbage Collector (mark + sweep), running on a repository with ~60M objects.

In this example, we were running 100 Executors, and the total execution time was 46 minutes!

lakeFS Garbage Collection at scale

So what’s the maximal data lake size lakeFS GC can support, you might ask? The answer – to be honest, is – we don’t quite know. What we can say is theoretically the process is only bounded by the number of resources provided and time. You can probably get a rough estimation from the example above, and as the data lakes and repositories in lakeFS continue to grow, we will learn more about the limits – and of course improve the process as needed. One thing worth mentioning is that when talking about the size of the data lake – we are not talking about the capacity it takes up in the underlying storage but, in fact, the number of objects in that data lake. The GC process will work the same for a data lake with 10 million objects, regardless of if it takes up 10 PB or 5 MB of space.

As already mentioned, for an efficient GC process – one that does not over allocate resources on the one hand, and allows for a reasonable execution time – a good understanding of the GC internals is required. Understanding the run environment is also important for optimal execution – whether running from a Databricks cluster, an AWS EMR or a local environment. These configurations can become quite tricky and require some expertise with these services. And of course, let’s not forget about Spark…
If during the reading of this paragraph you’ve started developing a headache usually reserved for DevOps engineers 🙂 – fear not! We might have the remedy for you!

lakeFS Cloud – A Managed Garbage Collection Solution

lakeFS Cloud users enjoy a Managed Garbage Collection Service as part of their cloud installation. This service is completely managed by lakeFS and comes with the following perks:

  1. No operational overhead
  2. SLA for object deletion
  3. 24hr support from our team
  4. “Spark who”?

When creating a new installation, enable the Managed Garbage Collection feature.

Set up lakeFS managed garbage collection

All that’s left is to configure the retention rules for your repositories and branches, and let the Managed GC take care of the rest…

Summary

lakeFS’ Garbage Collector automates the process of identifying and removing redundant data versions. It optimizes storage utilization, improves data access performance and enables data governance.

Whether it’s minimizing storage costs, ensuring compliance, or enhancing data access speed, lakeFS’ Garbage Collection feature empowers organizations to stay on top of their data.

lakeFS Cloud Managed GC, takes this feature to the next level by eliminating the operational, cost, and maintenance overhead.

To learn more about lakeFS and the Data Garbage Collection, follow these links:

  1. lakeFS Garbage Collection
  2. lakeFS Cloud Managed GC
  3. Uncommitted Garbage Collection and lakeFS data structure
  4. Data Version Control: What Is It and How Does It Work?
  5. Git for Data Lakes—How lakeFS Scales Data Versioning to Billions of Objects

Need help getting started?

Git for Data – lakeFS

  • Get Started
    Get Started