Welcome to another episode “Concrete Graveler”, our deep-dive into the implementation of Graveler, the committed object storage for lakeFS.
Graveler is our versioned object store, inspired by Git. It is designed to store orders of magnitude more objects than Git does. The last episode focused on how we store a single commit — a snapshot of files — in Graveler. But a lakeFS repository holds more than a single commit, it stores a full history of files. We could obliviously store each commit separately. But reusing storage from other commits is better.
Graveler metaranges and ranges: a brief recap
The Graveler structure is described in the Introduction to “Concrete Graveler: Committing Data to Pebble SSTables”. It explains how the list of the object metadata in a commit is sorted by key (the object path) and stored in consecutive ranges on the object store; a metarange stores the list of ranges. This structure allows efficient support for the common patterns in filesystem access: batch and semi-local random access.
What commits do to metaranges
A commit writes a new metarange file, along with any new range files created for the commit. Reading a commit requires accessing these files. Graveler uses immutable files, so they are easy to cache (see Tiers in the Cloud: How lakeFS caches immutable data on local-disk for details).
So the effectiveness of the cache is directly determined by our ability to reuse files between commits, in any relationship between them. Most data in a single commit in Graveler is held in the ranges files. A commit typically needs to create a new list of files. So it will have to write at least some new ranges files, in addition to a new metarange file. The more “hot” branches and commits share the same ranges files, the better our caching becomes.
A solution in Git
Git side-steps this issue in two ways: it directly represents the directory structure, and its backing store is a local filesystem. Git represents every directory in a separate “tree object” — essentially each directory corresponds to a range in Graveler terms. When a directory changes, its entire tree object is rewritten. So a commit of a single file change requires writing new tree objects for that directory and for all of its parent directories.
This makes sense in Git. A typical repository has a local filesystem backing store, making writing multiple files is not much more expensive than writing a single file. By contrast lakeFS uses a remote backing store, where writing a new object is as expensive as writing over 4 MiB of data! So Graveler tries to use larger range files, and nest them to depth 2: a metarange object, pointing at range objects, each pointing as user file objects.
Getting a break
Taking a step back, Graveler needs to write a sequence of sorted key-value pairs into multiple files. It may break the sequence wherever required. Call the separation between consecutive ranges a break. So the region between two breaks determines a range. If two metaranges share a pair of consecutive breaks and the contents of the objects between those two breaks, then that range is reused. The problem is to find a “good” sequence of breaks. And the only functional requirement is this:
Functional requirement for ranges: Write sorted key-value pairs consecutively into a sequence of files.
Any implementation that promises this will work. So we can select an implementation by estimating the performance that it will allow. We can also change parameters or upgrade an implementation after-the-fact without backfilling old data without affecting correctness.
We would like our implementation to perform well in practice. So it should:
- Generate many reused ranges.
- Avoid generating very large ranges: they increase the cost of random access, because the entire range may need to be fetched in order to read a single object. (We do not currently read partial files: while doing so is certainly possible, we would still want to read around 10 MiB of data on every read. So the added complexity of adding an avoidable caching layer is undesirable, compared to controlling the size of ranges, reading them in full, and using existing OS paging to access those files.
- Avoid generating many small parts: they increase the cost of batch scans, because the time to fetch the first few MiB is almost constant.
Additionally, parameters will ideally all be tunable, implying that they correspond to real-world (measurable) numbers.
Statelessness / stationarity
While not necessary, we did opt for a simple stateless solution. This solution is oblivious to the current space of pre-existing ranges. A stateless solution is a great start: it requires no complex logic to manage mutable “current” state, is simple to reason about, and can offer measurable parameters.
We further chose a one-pass write-only solution: we write new commits directly into a range file, reuse an existing range from a parent commit whenever possible, and for changed ranges can decide after each entry whether or not to break off into a new range. (Parent commits are available during a commit or a merge, so reusing those requires no state.)
Adding a parameter of the max range size in bytes is enough to give a working solution — split a range after its sizes goes past the max range size. This alone is enough to reuse many ranges during regular operation; see the first section of “dynamics”, below, for an example.
Another useful family of such solutions uses two parameters: min and max range sizes, in bytes. It writes ranges consecutively, never breaking a range before it reaches min_size bytes and never allowing a range to grow by more than one entry beyond max_size bytes. Between these two sizes, it can use any heuristic that will (we hope) cause ranges to be reused. The desire for statelessness means that only the last few entries (a fixed number) can be used to decide on a break — and we choose to base all decisions based on the last entry written.
An easy way to select a break is to hash the key of the last entry written to get a known average number of entries. These breaks depend only on keys. So, unless the minimal size interferes, two metaranges will always share breaks — and will share a range file if their contents between the breaks are shared.
Dynamics: how does this behave?
As an example, consider a repository (a lake) organized under two prefixes, input/ and output/. Data enters the system through branch ingest. An input process adds files to this branch under a sorted prefix input/YYYY/MM/DD/hh:mm/, and once an hour these files are committed and eventually merged into main. In this case just reusing the existing ranges during a commit is enough to give excellent reuse.
When committing at, say, 2021-04-26T03:00:00 we expect normally to see no files earlier than 2021-04-26T02:00:00. So all ranges whose max key is before input/2021/04/26/02:00 do not change — they will be reused. If there are newly added keys after 02:00 and before 03:00, that range might be rewritten — so every key-value pair should end up in only 1 or 2 range files.
With this branch and commit behaviour, merging does not need to create any new ranges, either. So in total we still expect to see only one or two ranges for every hour. Some of these ranges will be small — if and when this becomes an issue, we will have to add an (offline) packer for range files.
Hash-based breaks can make a difference for merges between branches that add mostly the same files with the same contents. For instance, suppose we create nearly the same files on 2 different branches in multiple commits, but the commits on the branches add different files each time. Now we merge them. If we break only based on max range sizes, no ranges will match; the merge will need to generate almost all new ranges. But with hash-based breaks ranges will tend to end at the same key. Range endpoints on the 2 branches will often coincide — so some ranges will be shared.
Picking parameter values
min_size and max_size have simple units — bytes. Hashing is determined by a hash function — any function will do, with possible preference to a repo-specific random (or universal) hashing function. And if we only need to control average range size then min_size=0 is a reasonable choice.
The remaining parameter — “raggedness”, or average number of entries beyond min_size — has understandable units, but is harder to tune without knowing the average size of an entry in a range.
Assuming hash function behaviour is indistinguishable from random, expect the size of ranges to follow a geometric distribution trimmed to be between min_size and max_size.
In practice we ship with min_size=0, max_size=20MiB, raggedness=50_000 (entries). E.g. if entries have average size 400 bytes (paths of around 300 bytes, depending on various prefix lengths), then around 65% of ranges will be cut before they reach 20MiB.
The current implementation is simple and understandable. In future we may tweak the heuristic, possibly by learning properties of the keyspace. Further ahead, abandoning one-pass processing offers many possibilities for provably optimal strategies — current splitting is a greedy approximation of some optimization problem.
This framework for ‘Concrete Graveler’ offers many knobs for improvement. However it is entirely possible that the current solution is good enough for practical purposes.