Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Yoni Augarten
Yoni Augarten Author

July 26, 2021
Guy Hardonag, Co-Creator
Guy Hardonag Co-Creator

“I can remember everything. That’s my curse, young man. It’s the greatest curse that’s ever been inflicted on the human race: memory.”

— Jedediah Leland, Citizen Kane (1941)

lakeFS makes data corruptions easy to avoid and fix by allowing you to travel back in time to any state of your data. This new capability has an implication: by default, all objects are saved in the backing storage forever

However, sometimes you may want to hard-delete your objects, namely delete them from the underlying storage. Reasons for this include cost-reduction and compliance with privacy policies.

Taking Inspiration: The Git filter-branch Command

In Git, the filter-branch command is provided as a means of rewriting history. It can be used, for instance, to remove a set of files from all past commits. For every commit, it goes over the corresponding file tree and reconstructs it without the removed files.

The documentation starts off with a handful of safety and performance warnings, attesting not only to the power this command brings, but also to the potential damage it can cause. The two major downsides of using this command are:

  1. Altering the entire commit lineage causes the hash of every commit to change. This means that the published history is changed. If you’ve ever collaborated on a codebase, or watched an HBO show – you know that trying to change the past is a bad idea.
  2. It doesn’t scale (the documentation describes it as “glacially slow”).

In the domain of source code management, holding on to every deleted file is usually no big deal. When working with data, a retention solution is much more crucial: retaining billions of objects, many of them no longer needed, can have a considerable impact on your storage expenses. You may also have a legal obligation to dispose of personal information due to data protection regulations. That’s why we had to provide a more convenient and safer way to remove historical data, which does not have the above drawbacks.

Garbage Collection in lakeFS

The first step to get started with lakeFS Garbage Collection is to define a retention policy. This is a set of branch-specific rules that dictate how long objects stay accessible after they have been deleted. Remember, deleting an object in lakeFS only means that it is deleted in the branch’s head, but it stays accessible through previous commits. Let’s take a look at an example retention policy in lakeFS.

{
 "default_retention_days": 21,
 "branches": [
   {"branch_id":  "main", "retention_days":  28},
   {"branch_id":  "dev", "retention_days":  7}
 ]
}

In this example, objects are retained for 21 days after deletion by default. However, if they are present in the main branch, they are retained for 28 days. Objects present in the dev branch (but not in any other branch), are retained for 7 days after they are deleted.

After defining the policy, it’s time to run the retention job. This job is a Spark program that needs to be run periodically. It finds all objects that have been deleted prior to the retention period, and removes them from the storage. 

The solution to problem #1 above comes thanks to the separation of data and metadata in lakeFS. lakeFS organizes its metadata in sorted files, called ranges. Our Spark program goes over these ranges and finds objects that can be deleted according to the retention policy. Though the underlying objects are deleted from the storage and no longer accessible, the published history is not rewritten and commit hashes remain intact. 

To find the delete candidates in reasonable time, we again took advantage of the lakeFS metadata structure. The algorithm we came up with divides the repository’s commits into two sets: active and expired. After obtaining those two sets, we subtract ranges they consist of to come up with the list of objects that can be removed. Thanks to the power of Spark, this part takes a couple of minutes for huge repositories. 

This feature was prioritized and implemented thanks to valuable feedback from our users. It marks a new maturity level for the lakeFS project, and we hope you’ll find it beneficial.

About lakeFS

The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.

Our mission is to maximize the manageability of open source data analytics solutions that scale.

To learn more...

Read Related Articles.

Git for Data – lakeFS

  • Get Started
    Get Started
  • Create a Dev/Test Environment for Data Pipelines Using Spark and Python in this LIVE WEBINAR -

    Register here
    +