Introduction

In our recent version of lakeFS, we switched to base metadata storage on immutable files stored on S3 and other common object stores. Our design is inspired by Git, but for object stores rather than filesystems, and with (much) larger repositories holding machine-generated commits. The design document is informative but by nature omits much of the reasoning behind it. This is (hopefully) the first in a series of posts that dive deeply into the details and rationale behind it. It discusses why the graveler 2-level storage design is useful for the immutable key-value store needed, and how we benchmarked in order to choose its parameters to get good caching of data from S3. Future posts will discuss why we specifically chose the RocksDB SSTable file format and its implementation using the Pebble SSTable library from CockroachDB.

The new design for committed storage, in brief:

A commit combines some commit metadata with a list of all its stored named objects.
This list is stored in an immutable key-value format termed graveler.
A graveler key-value store is an SSTable mapping keys (lakeFS pathnames) to values (object S3 URLs with some per-object metadata).
This SSTable is split across multiple S3 objects, each holding a range — an SSTable of contiguous key-value pairs.
Each range has an ID, and the list of ranges is stored in a metarange — an SSTable of IDs of range, keyed by that last key in each range.

In essence this is a B+Tree of height 2.

We think this is a great top-level design! It…

transfers a successful architecture for solving a similar problem (git) to our domain;
arranges standard building blocks, such as SSTables and object-store objects, in well-understood patterns;
uses immutability to allow for reproducibility and concurrency.

But it has many knobs, and it leaves many low-level decisions open.

Outline of graveler

We store committed data in graveler. A commit in both Git and lakeFS is a binding of specific versions of the objects (files) in that commit with some additional commit metadata – such as parent commit IDs, commit message, author, user metadata. Actual versions of objects – the “data” – are stored on underlying object storage (typically S3); for our purposes here with an arbitrary path. graveler is our system that maps object paths in a specific revision to the object paths on underlying storage.

This example shows graveler holding 2 revisions. File “a/file” changes between revisions, file “a/nother” is deleted, file “be/good” is renamed to “bat/man”, and file “be/tter” remains unchanged. The object names on underlying storage are immaterial to graveler. Additionally, because lakeFS is a versioned object store, it has no direct concept of “directories”: files are attached to paths. Splitting on separator characters (typically “/”) occurs at a higher level; graveler need merely provide an efficient “seek until >=” operation.

Just as in Git, commits never change. So graveler is an immutable key-value store, that behaves as a large SSTable. However, the design of graveler faces some different forces from that of Git:

graveler must be able to hold many more paths. A repository with merely millions of paths is small.
graveler uses a remote underlying storage. Remote files are fetched using the tiered filing system described in a previous blog post, “Tiers in the Cloud: How lakeFS caches immutable data on local-disk”. But access to any object that is not cached locally is slow, and random access is only an option after successfully retrieving the entire object.

This leads to different implementation choices.

Metadata object sizes

A lakeFS repository uses the underlying storage for two different tasks:

Storing user data objects. The number and size of these objects is controlled by the user, and an efficient workload already controls these parameters to allow efficient operation on the underlying storage.
Storing metadata objects. lakeFS controls the number and size of these objects, and must do so efficiently.

lakeFS has multiple typical access patterns to committed data:

Fully sequential (rare): When delivering a full listing of objects, or when computing a diff or merge of very different commits.
Sequential with random skips: When listing objects with paths matching a particular prefix, or when computing a diff or merge of rather similar commits.
Random but local skips: A typical “big data” application will access particular subsets of paths, and access paths within each subset in roughly lexicographical order.
Random: When accessing a single file.

So SSTables are a good fit for committed data in lakeFS, just as it is for unchanging data in many other object stores. Indeed, Git stores its files in tree objects — essentially per-directory SSTables.

Picking “good” object sizes requires benchmarking. We use the very comprehensive S3 benchmark numbers from github.com/dvassallo/s3-benchmark.

Time to download a file from an object store can roughly be modelled as t_initial + size/bandwidth: a roughly fixed cost to start downloading, followed by time to download all remaining bytes. Here are some of dvassallo’s numbers for times achieved for 10 threads running on c5.4xlarge instances. t_initial is roughly the time to request.

Instance type	Payload (KiB)	Threads	Rate (Mib/s)	Request time (msec)		Response time (msec)
Instance type	Payload (KiB)	Threads	Rate (Mib/s)	(p90)	(p99)	(p90)	(p99)
c5.18xlarge	128	10	84.477	16.4	46.9	18.2	48.2
c5.18xlarge	256	10	156.162	16.7	50.7	20.4	53.7
c5.18xlarge	512	10	241.555	18.2	48.3	24.7	102.7
c5.18xlarge	1024	10	443.879	16.6	50.8	25.7	82.9
c5.18xlarge	2048	10	714.105	15.8	43.1	34	73.3
c5.18xlarge	4096	10	815.543	16.5	70.3	55	116.2
c5.18xlarge	8192	10	893.252	15.3	39.2	89.2	136.1
c5.18xlarge	16384	10	922.756	15.4	56.1	171.8	203.4
c5.18xlarge	32768	10	926.44	14.8	53.8	347.2	394.3

A note on benchmarking S3
S3 timing is unfortunately not generally repeatable.  These numbers represent the lowest latency I could measure over 3 regions (us-west-2, us-east-1, eu-central-1) over a few days.  Other measured latencies were around 18-19 msec for 90th percentile of time to request.

In all cases time to request is barely affected by object size At object sizes below 512 KiB it is comparable to the time to receive the entire object. So object sizes below 512 KiB are always less effective for our access patterns: taking into account that they perform more accesses, they require strictly more time to fetch metadata on any number of files.

Large objects are less effective for all access patterns that access only parts of the repository, they potentially fetch more data than needed. At the same time, graveler design allows us to re-use shared objects between SSTables for different commits. In order to do this we need those objects to be shared; this is easier to do with smaller objects: An object needs to be rewritten whenever it changes. So splitting an SSTable between more objects makes each object more re-usable.

In light of this, we designed graveler to use object sizes of 1-10 MiB.

Lightweight B+Trees

A repository can contain hundreds of millions of objects at a single commit. Git overloads the directory structure of the files in the repository in order to determine tree objects. Each tree object holds a single directory, linking to additional tree objects for subdirectories.

lakeFS data repositories are somewhat different from Git source code repositories. Firstly, they tend to have orders of magnitude more objects – even billions of objects. But performance of directory scans is much less important in data lakes than performance of sequential scans, and directory structures can be very deep (8 or more levels are common). This allows us to separate concerns differently.

Graveler stores a sorted list of key-value pairs in multiple consecutive files. “Leaf” files are termed ranges. These hold the actual values: path to object on underlying object store and additional per-object metadata. We try to keep leaf files to have sizes in the range on 1-10 MiB. The different ranges hold non-overlapping ranges of keys. The list of references to range objects is stored in a separate file termed a metarange. It holds the range files in order of their keys, indexed by last key.

So the graveler SSTable is stored pretty much as a B+Tree with 2 levels: a root (metarange) and leaf nodes (ranges). Internal nodes are identified by digests of their contents, making this a Merkle tree. Nodes with the same contents receive identical identifiers, allowing for re-using nodes between similar trees as well as constant-time comparisons.

Only 2 levels?

We do not currently limit the size of metarange files. A typical object entry is <512 bytes, so a 10 MiB range file can hold 20K object entries. Given that we have more control over names of range file objects, a 10 MiB metarange file can hold rather more than 20K object entries, for a total of 400M object entries.
What happens if the metarange file is too large? We just keep it large. It turns out that we do not lose much! Almost all of the nodes in any B+Tree are near the leaves. The massive branching-factor of our nodes means that we would add at most 1 level if we split the metarange file: even splitting the root into just 100 entries would give a graveler file capable of holding 40G object entries. The gains from adding intermediate nodes are not much: a minimal increase in caching and dedupe efficiency (almost all of the cache and dedupe are for ranges, not metaranges), in exchange for no reduction in latency (all accesses would now require 3 steps rather than 2 as before).

A note about Git packfiles
As mentioned above, Git holds many more tree objects per file than we hold range objects.  It reduces the number of files that it needs to access by collecting small objects into “pack files”.  This significantly reduces the total size of repository metadata, and allows it to regain much of the added cost of small objects because it opens fewer files.

In lakeFS, the size of repository metadata is negligible compared to the size of actual user data.  Splitting the SSTable by directory structures would increase the time for many sequential scans: systems that use object stores are generally not designed to use directories efficiently.  We decided to skip this added implementation complexity for now (but may revisit it in future).

Metadata deduped

We store range and metarange objects on underlying storage using content-addressing: the pathname is based on a digest of the file contents. This naturally dedupes these objects in storage.

Range files can change between revisions, so deduping improves with decreasing range file size. Multiple close commits can typically share range files. E.g., if a repository uses object names that hold a date component, distant dates typically change very little between commits. Recall that the size of metadata is small compared to size of data. So deduping by sharing ranges is not so much about saving storage as it is about efficiency:

Shared range files are more efficient because they need only be fetched once into the cache.
Shared range files can allow vastly more efficient iteration. For example:
1. When writing a commit, if no files have changed in a range file then that file need not even be processed, let alone written. Regardless of the number of files in the tree, a commit that touches a single file requires reading 2 files and writing 2 files: the old and new metarange files, and old and new versions of the range file holding that changed file.
2. When diffing commits, only the 2 metarange files and all changed range files need to be read. All other range files share digests and are skipped. (For 3-way diffs, 3 metarange files and all changed range files need to be read.)
3. Similarly for merging: only input metaranges and changed range files need to be read or written.

Smaller range files naturally provide for more deduping and increase efficiency, up to a limit where the cost of fetching more files outweighs the need to fetch fewer of them.

What next?

In upcoming episodes about Graveler we shall cover how best to split ranges to increase reuse, how we selected a specific format (RocksDB) library (PebbleDB) for SSTables, and more implementation details of efficient iteration over Graveler.

The code for the committed data store lives in the graveler/committed subdirectory of lakeFS. The fastest way to get started with lakeFS is by going to lakeFS quick start.

About lakeFS

The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.

Our mission is to maximize the manageability of open source data analytics solutions that scale.

Concrete Graveler: Committing Data to Pebble SSTables

Introduction

Outline of graveler

Metadata object sizes

Lightweight B+Trees

Only 2 levels?

Metadata deduped

What next?

About lakeFS

To learn more...

Read Related Articles.

What is Metadata Tracking? Types, Tools & Best Practices

Iceberg Time Travel: Snapshots, Rollbacks & Data Version Control

Iceberg Branching Best Practices for Reliable Data Operations

Concrete Graveler: Committing Data to Pebble SSTables

Introduction

Outline of graveler

Metadata object sizes

Lightweight B+Trees

Only 2 levels?

Metadata deduped

What next?

About lakeFS

To learn more...

Read Related Articles.

What is Metadata Tracking? Types, Tools & Best Practices

Iceberg Time Travel: Snapshots, Rollbacks & Data Version Control

Iceberg Branching Best Practices for Reliable Data Operations

Related articles

How to Build Infrastructure for AI-Ready Data That Supports Scalable AI Workloads

Iceberg Time Travel: Snapshots, Rollbacks & Data Version Control

AI Center of Excellence: How to Build Reliable & Reproducible AI Systems

Pick up the Slack with lakeFS