Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Oz Katz
Oz Katz Author

Oz Katz is the CTO and Co-founder of lakeFS, an...

Published on June 11, 2024

You need data version control capabilities for your data lake but you’re not sure what safeguards lakeFS has in place? If this sounds familiar, this article is for you.

Keep reading to explore the potential risks you may be facing while using a third-party data versioning solution like lakeFS and get a deep dive into all the security measures lakeFS has in place to protect your data.

What are the potential risks while using lakeFS?

Metadata corruption

Metadata corruption happens when system views, processes, and functions are corrupted as a result of issues such as a blackout, virus, hacker attack, hardware failure, failed upgrade, inadequate disk space, shutdown issues, or other causes.

Service outages

Any cloud service like Amazon S3 can experience an outage. These don’t happen often, but the recent major S3 outage in the United States disrupted services like Slack, Github, Giphy, and many others. While extremely unusual, an outage like that may impact your data lake operations.

API failures

What happens if you’re in the middle of a commit or merge operation when lakeFS writes to an S3 bucket, and S3 starts exhibiting API failures? This is a risk you need to consider when using a third-party solution for anything related to your S3 bucket.

Measures lakeFS takes to ensure data security and safety

Commit merge operations are atomic

Using lakeFS, you can be sure that every single commit and merge operation is atomic. This means that they cannot complete only partially. 

What about the branchnig operation? Branches, on the other hand, can only move from one complete commit to another.

Metadata describes what a commit contains

In lakeFS, all metadata that describes what a commit contains is stored in the object store itself. So if you’re using S3, you get to enjoy S3’s 11 9’s of durability.

Storage namespaces

The lakeFS server doesn’t typically write data objects to the underlying object store as this task is relegated to lakeFS clients the leverage pre-signed URLs. 

However, the server does instruct clients where to write objects to in its managed storage namespace. By doing so, it intentionally speards data files across many S3 partitions to reduce the likelihood of throttling by the object store.

Putting the data security guarantees into practice: step-by-step guide

This is what the flow enhanced by lakeFS security guarantees looks like:

  1. The client issues a commit API call to the server with a repository, branch ID, message, and optional metadata.
  2. The server takes note of the current commit ID that the branch is currently pointing to.
  3. The server creates a new staging token, sealing the current one for the branch to ensure new writes are excluded. From here on out, we have a set of changes to commit that are ensured to be immutable.
  4. All the sealed staging tokens are then serialized to the object store, making up a tree of RocksDB-compatible SSTables.
  5. A commit record is written, pointing to the root of that tree on the object store.
  6. Once that’s done, the branch pointer is modified to point to the commit we created. This is an atomic compare-and-swap operation: the new commit takes effect only if the current commit ID is still the one observed in the second step .

What happens if failure occurs at Step 6?

Failing at Step 6 could happen due to two reasons:

Generic error writing to the lakeFS backing KV store

In this case, the server would retry the KV write operation or give up. The commit operation fails, and you’re still pointing to the existing commit.

Compare-and-swap predicate failure

This means someone “beat us to it” – another commit/merge has successfully finished before ours did. In this case, you can restart the flow at the second step. This ensures atomicity and also that parent-child relationships are properly maintained. Just like in Git, each commit points to its parents.

Wrap up

lakeFS is prepared for occurrences such sudden S3 downtime or API failure with a number of guarantees that are part of the branching, committing, and merging flow.

Head over here to learn more about lakeFS security measures: lakeFS Security Reference

Git for Data – lakeFS

  • Get Started
    Get Started
  • Who’s coming to Data+AI Summit? Meet the lakeFS team at Booth #69! Learn more about -

    lakeFS for Databricks