Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Iddo Avneri
Iddo Avneri Author

Iddo has a strong software development background. He started his...

Published on September 22, 2024

Once you start using lakeFS, the files on your object store will form a new representation. The names and paths of the files on the object store will no longer look the same. 

This article provides a high-level overview of the lakeFS file representation to help you understand the lakeFS file representation and how it supports data versioning.

How Data Versioning Works in lakeFS

The lakeFS data version control system allows the following Git-like operations:

Operation What It Does
Branch A consistent repository copy isolated from other branches and changes. Branch creation is a metadata operation that doesn’t duplicate data
Commit An immutable checkpoint providing a complete repository snapshot
Merge Merges atomically update one branch with changes from another
Revert This operation restores a repository to its prior commit
Tag A pointed to one immutable commit with a meaningful name

When you use lakeFS, the files in your object store will form a new structure. Other systems, such as Apache Iceberg, also create a new metadata structure. 

File Storage Formats

For Commits: SSTable

Commits are stored as RocksDB-compatible SSTables. Three reasons made SSTables the storage format of choice:

  1. SSTables offer extremely high read throughput on modern hardware. Using commits representing a 200 million object repository (modeled after the S3 inventory of one of our design partners), we achieved close to 500k random GetObject calls per second. This provides a very high throughput/cost ratio, as high as can be achieved on public clouds.
  2. It’s a well-known storage format that makes it straightforward to develop and consume. The object store makes it accessible to data engineering tools for analysis and distributed computation, decreasing the silo effect of an operational database.
  3. The SSTable format supports delta encoding for keys, making them very space-efficient for data lakes where many keys share the same common prefixes.

Each lakeFS commit is represented as a set of non-overlapping SSTables that make up the entire keyspace of a repository at that commit.

For Metadata: Graveler 

lakeFS metadata is encoded into Graveler, a format that offers a standard way to encode content-addressable Key/Value pairs. 

Requirements for the storage format

lakeFS has more requirements for the storage format:

Being space and time-efficient when creating a commit – 

Assuming a commit changes a single object out of a billion, there’s no need to write a full snapshot of the entire repository. Ideally, users should be able to reuse some data files that haven’t changed to make the commit operations (in both space and time) proportional to the size of the difference as opposed to the total repository size.

Allowing an efficient diff between commits – 

This runs in time proportional to the size of commits’ difference and not their absolute sizes.

To support these requirements, lakeFS is based on a 2-layer Merkle tree that consists of:

  • A set of leaf nodes (“Range”) addressed by their content address, and
  • A “Meta Range,”  which is a special range containing all ranges, representing an entire consistent view of the keyspace:

Representing References And Uncommitted Metadata

lakeFS always saves committed and uncommitted object data in your object store’s storage namespace. However, lakeFS object metadata may be stored in a key-value or object store.

Uncommitted (or “staged”) metadata is malleable and frequently written, unlike committed metadata. This also applies to “refs”—branches, which are pointers to an underlying commit, are adjusted on every commit or merge operation.

Both of these types of metadata are not only mutable but also require strong consistency guarantees while also being fault-tolerant. If we can’t access the current pointer of the main branch, a large portion of the system will be down.

Luckily, this is also much smaller than the committed metadata.

References and uncommitted metadata are currently stored on a key-value store for consistency guarantees. See a list of supported databases for more details.

Understanding lakeFS File Representation

Let’s take a look at the files in the object store once they are managed with lakeFS.

Creating a repository

The first step is to create a new repository via the web interface:

Create a lakeFS repository

This creates an “empty” lakeFS repository understand-lakefs-repo sitting on an S3 bucket my-lakefs-managed-bucket:

Getting started with lakeFS

If you look at the cloud provider’s side, you’ll see a single file created at this time (this example is from AWS):

Cloud provider

The dummy file is created to check the permissions of the AWS role used by lakeFS to write into the bucket. 

Importing files

Importing doesn’t copy files to the bucket. This means that if you change data in the original bucket, lakeFS can no longer manage metadata.

Let’s import data from a bucket s3://my-original-data, which contains a single directory product-reviews with two Parquet files:

Import data from s3

lakeFS never copies the files from a folder like this. 

Writing new objects to the repository

Going forward, you may want to add new objects to the repository. These will be written to the lakeFS-managed bucket.

You can upload the files using the web interface:

Upload object using web interface

The files are named differently once uploaded via lakeFS. The references to these files are maintained in the range and metarange files for the different commits. 

You uploaded the files but haven’t committed the changes yet. You can do that via the UI:  

Commit changes via lakeFS UI

Once committed, new files are added under the _lakefs directory:

Committed files

Wrap up

lakeFS stores files in a data directory and range and metarange files on the object store under the _lakefs directory, associating them with commits. Since lakeFS stores all files on your managed bucket in your cloud, private cloud or on-premises, there are numerous ways to achieve high availability. Your object store stores data, range, and metarange files regardless of where lakeFS runs.

lakeFS