There are several tools in the data version control space, all looking to solve similar problems. Two of the leaders are lakeFS and DVC. In this post, I am going to give an overview of how each has been designed so as to provide a basis for understanding their relative abilities to scale. Being a co-creator of lakeFS, you can well expect me to bring my biases to this, but I will be doing my best to be fair 🙂
Let’s first take a step back and align on the problem that we’re looking to solve, and then examine how these different solutions handle scale.
Data Version Control – Can’t we just use Git?
Before we jump into the specifics and understand how the problem is solved, let’s spend a minute understanding the problem itself. Why do we need a dedicated tool to version data? Can’t we just use Git?
For the full answer, I would recommend my co-founder’s great, in-depth article about data version control. The TL;DR is that source code and data are pretty different. For this article, we’ll focus on three main differences:
- Size – Unlike source code, which is still mostly generated by humans, data comes from many sources: some are human, but many are not. Even data that results from a human interaction (such as a transaction in a shop) gets pretty big when you start to capture granular data about it. Once you get on to purely machine-generated data, the volumes quickly increase – consider areas like log data, sensor data from IoT, video and image feeds, and more.
While a prolific software engineer might generate Kilobytes of code on a good day (or bad, depending on whom you ask), machines will often happily generate many orders of magnitude more. It is not uncommon for companies big or small to have Terabytes or Petabytes of data.
- Velocity of change – For the same reasons, data also changes at a much higher velocity. Data lakes are often partitioned based on time of arrival, such as hourly or daily. This means that a relatively large portion of the data is effectively replaced on a regular basis.
- Access Patterns – The data generated by machines is often also consumed by machines. Typically, a compute engine (whether for analytics, machine learning training, or otherwise) will be the consumer. For this to be efficient, we need to transfer very large amounts of data between a repository and the consuming process. Rare are the cases in which very large datasets are made for humans to consume.
Of course, there are other differences, some more nuanced than the ones specified here – and I do encourage you to read more about the topic.
Designing a Scalable Data Version Control System
lakeFS and DVC both provide data version control—but how do they compare when it comes to scalability? Let’s consider their architecture as one angle on how they will scale.
DVC provides versioning for data files by coupling its metadata into that of a Git repository. This is done by setting up a local directory that needs to exist on all machines that consume data. It’s the cache directory that has several interesting traits:
- Content Addressability – Just like Git, files are stored in the cache directory with an ID that represents their content. This allows DVC to be efficient in storage: files are automatically deduplicated if they represent the exact same bytes.
- File Metadata – Information about files and directory contents is stored in files that are also content addressable within the cache directory. These would typically have a
.dirsuffix and are represented as a JSON encoded array of entries, each representing a child object.
DVC will additionally use auxiliary indices when possible: it will store each file’s modified time and size along with its hash to avoid having to rehash all files to calculate a local diff. This, however, is stored in a temp directory and is not persistent.
- Linking – While the cache directory (might) contain all versions of all files ever referenced, a specific version will likely only reference a subset of those. As DVC aims to be efficient in storage, a version is represented locally as a set of links to the cache. There are several types of links to choose from – reflinks being the “ideal”, supporting both deduplication and copy-on-write for changes, but unfortunately are not supported on all OSs and filesystems. Other options include hard/soft links that are not editable and plain old copying.
- Remotes – Since caching works on a local filesystem directory, sharing and backing up data in DVC is done using remotes. A remote would typically be shared over a network, making data accessible to other machines and users. DVC supports many types of remotes, from object stores (AWS S3, Google Cloud Storage, etc.) to SFTP and even Google Drive.
DVC Scalability Pros
- This architecture is very simple and allows you to get started with DVC without a central server or remote storage.
- When using reflinks, data is deduplicated between cache and workdir and no copy is required.
- Content addressable storage further reduces storage use by automatically deduplicating content that is byte equivalent.
DVC Scalability Cons
- Storing a snapshot of file metadata as a serialized JSON array is highly inefficient. It means that DVC has to deserialize the entire blob to know what a version contains. Whilst this is fine for smaller datasets, it scales poorly for larger collections.
- Concurrency requires duplicating storage anyway. Since a version has to be materialized locally in a working dir to be usable, the only way to use a version is to link to its files or copy them.
- reflinks are not supported at all on Windows and have limited support on other OSs as well. Users would have to choose between the default (copy) or hard/symlinks that impose limitations on mutability
- Diffing between versions is inefficient. This is because each commit is represented as a JSON-serialized array; diffing involves deserialization of both commits and a full scan of both arrays to figure out what changed. This makes diff an O(n) operation, linear to the size of the managed dataset.
lakeFS was designed from the ground up to scale to billions of objects. Both data and metadata are optimized for storage on object stores such as S3, GCS, and Azure Blob.
Unlike DVC, lakeFS by default has no notion of “checking out” a branch. This is by design as well: the larger the data, the less feasible it becomes to consume it from a single machine. Typically, distributed systems such as Apache Spark, Trino, and many others read and write data in parallel from multiple machines.
lakeFS uses a client/server architecture with the following components:
- lakeFS Server: a stateless component providing metadata management and, in some cases, serving objects directly. Data is stored in two places:
- Object Store: the underlying storage for both data and the commit metadata. This allows lakeFS to scale horizontally in both throughput and storage.
- K/V store: the data stored in this enables concurrent access to branches and references. It can be an embedded database within the lakeFS server or an external database such as PostgreSQL, DynamoDB, or CosmosDB.
- lakeFS Clients: lakeFS is compatible with most compute, ingest, and orchestration tools. It provides a native Python and Java SDKs, as well as a CLI, web UI, and a plethora of other integrations.
Separate metadata and object data access paths
The lakeFS clients are designed to separate the data and metadata access paths. Typically, metadata operations will be done by calling the lakeFS server, while reading and writing the data itself will usually be done by directly communicating between the client and object store.
This allows lakeFS to stay out of the data path, making it very efficient even with extremely large volumes of data.
Optimized metadata storage format
lakeFS includes a custom designed metadata storage engine called Graveler. At its core, Graveler implements a versioned key/value store and is optimized for versioning billions of items. lakeFS uses Graveler to map logical paths to a physical location on the object store.
A Graveler commit consists of K/V ranges (stored as rocksDB-compatible sstables) that make up a range tree. Each range is stored by a hash representing its value (yes, similar to Git and DVC), with a top meta-range holding a list of ranges.
This model has several big advantages:
- ranges, being immutable and content addressable, are easy to store on an object store.
- since a hash always points to the same range, ranges are also easily cacheable without complex invalidation logic.
- changes happening on a specific range will result in a new range being created. The rest of the tree is easily reused. This makes commits, merges, and diffs linear to the size of the change rather than the total size of the commit.
lakeFS Scalability Pros
- Data held in lakeFS can be read by multiple parallel consumers.
- Clients can read data directly from the object store, using lakeFS only for metadata and taking advantage of the existing vast scalability of object stores.
- Copy-on-write can be used by all clients and is not OS dependent. By using CoW clients, it reduces the amount of data stored and transferred.
- Diffs—as well as commits and merges—are efficient because of how the metadata storage engine is implemented.
lakeFS Scalability Cons
- lakeFS requires a server as part of its deployment, which can make initial deployment more daunting. However, lakeFS is available as a fully-managed service on lakeFS Cloud.
- lakeFS was previously best suited to data engineering workloads, and its lack of local checkout made ML/AI use cases less easy to implement because objects from the repository could only be referenced using an object store client. Fortunately, this is no longer the case.
The best of both worlds: Introducing local checkouts with lakeFS
We saw above how lakeFS is designed to scale but previously had limitations in its use in machine learning cases where the lack of local object checkout caused challenges. We are pleased to announce lakeFS 0.106.1 which includes the ability to locally checkout objects from a lakeFS repository.
lakectl now includes the
local command. Using it, you can now “clone” data stored in lakeFS to any machine, track which versions you were using in Git, and create reproducible local workflows that both scale very well and are easy to use.
Here’s an example of how you can use it. In our lakeFS repository, we’ve got several sets of data, and we want to work with just a subset of them locally – the data in the OrionStar folder.
Let’s clone that folder on its own to our local machine:
$ lakectl local clone lakefs://quickstart/main/data/OrionStar/ orion
download QTR1_2007.csv ... done! [867B in 0s]
download PRICES.csv ... done! [1.29KB in 0s]
download ORDER_FACT_UPDATES.csv ... done! [1.44KB in 0s]
download QTR2_2007.csv ... done! [1.67KB in 0s]
download CUSTOMER.csv ... done! [6.86KB in 0s]
Successfully cloned lakefs://quickstart/main/data/OrionStar/ to /Users/rmoff/work/orion.
An important point to note is that—just like a cloned Git repository—the local folder retains references to the
main branch of the
quickstart repository. This means that as well as working with the data locally, we can sync subsequent changes.
With the data locally, we can use whatever we want on our machine to work with it. Here I’ll run a quick DuckDB query, but this could be something more complex like training an ML model or scoring the data before pushing it back to the repository.
🟡◗ SELECT c.customer_name, c.customer_address, co.product_name, co.total_retail_price
> FROM READ_CSV_AUTO('orion/CUSTOMER_ORDERS.csv') co
> INNER JOIN
> READ_CSV_AUTO('orion/CUSTOMER.csv') c ON co.customer_name=c.customer_name;
│ Customer_Name │ Customer_Address │ Product_Name │ Total_Retail_Price │
│ varchar │ varchar │ varchar │ varchar │
│ Kyndal Hooks │ 252 Clay St │ Kids Sweat Round Neck,Large Logo │ $69.40 │
│ Kyndal Hooks │ 252 Clay St │ Fleece Cuff Pant Kid'S │ $14.30 │
│ Dericka Pockran │ 131 Franklin St │ Children's Mitten │ $37.80 │
│ Wendell Summersby │ 9 Angourie Court │ Bozeman Rain & Storm Set │ $39.40 │
│ Sandrina Stephano │ 6468 Cog Hill Ct │ Teen Profleece w/Zipper │ $52.50 │
As a side note, the above may not be the best example—because with the web UI in lakeFS it has DuckDB embedded, so you can work with the data directly:
But I digress! Back to our local example. In the lakeFS repository someone has helpfully added a data dictionary:
To get this—and any other changes in this path—to our local copy, we can run a
$ lakectl local pull
diff 'local:///Users/rmoff/work/orion' <--> 'lakefs://quickstart/0b51ece0d7c39904c20054617165bbc5acc05b5d79b40ff2a1364cb9f15579d7/data/OrionStar/'...
download 00_data_dictionary.pdf ... done! [129.17KB in 1ms]
Successfully synced changes!
What about if we want to track a different branch from the
main one that we initially cloned? We can use
checkout for that, specifying the local folder (from which we’re running the command, so just a “.”) and referencing the branch name:
$ lakectl local checkout . --ref dev
diff 'local:///Users/rmoff/work/orion' <--> 'lakefs://quickstart/8e0e62cb7c20d016eba74abd2510bf8fcaac94563fe860fb4259626bed7e66a2/data/OrionStar/'...
diff 'local:///Users/rmoff/work/orion' <--> 'lakefs://quickstart/7eaeafd3fb90df419ca26eca69ba2b153eae13bd3dccf80ba47cbed1260cb0c1/data/OrionStar/'...
delete local: PRICES.csv ... done! [%0 in 0s]
If you look at the output from the command, you’ll see that the
PRICES.csv file was removed from the
dev branch. Let’s remove a few more files locally, and add one more (
RATINGS.csv)—just to see the two-way sync that can be done here:
$ cp ../RATINGS.csv .
$ rm -v EMPLOYEE_*
$ lakectl local status
║ SOURCE ║ CHANGE ║ PATH ║
║ local ║ removed ║ EMPLOYEE_ADDRESSES.csv ║
║ local ║ removed ║ EMPLOYEE_DONATIONS.csv ║
║ local ║ removed ║ EMPLOYEE_ORGANIZATION.csv ║
║ local ║ removed ║ EMPLOYEE_PAYROLL.csv ║
║ local ║ added ║ RATINGS.csv ║
Now we’ll commit the change:
$ lakectl local commit . -m "Remove employee data and add ratings information" --pre-sign=false
Getting branch: main
diff 'local:///Users/rmoff/work/orion' <--> 'lakefs://quickstart/7eaeafd3fb90df419ca26eca69ba2b153eae13bd3dccf80ba47cbed1260cb0c1/data/OrionStar/'...
diff 'lakefs://quickstart/38df1480d835fe52dc6db31b87ef35b78403d0c9c621c74a6d032c686551abc5/data/OrionStar/' <--> 'lakefs://quickstart/main/data/OrionStar/'...
delete remote path: EMPLOYEE_ORGANIZATI~ ... done! [%0 in 8ms]
delete remote path: EMPLOYEE_ADDRESSES.~ ... done! [%0 in 8ms]
delete remote path: EMPLOYEE_DONATIONS.~ ... done! [%0 in 8ms]
delete remote path: EMPLOYEE_PAYROLL.csv ... done! [%0 in 8ms]
upload RATINGS.csv ... done! [6.86KB in 9ms]
Both lakeFS and DVC have a similar goal: being able to version data and build repeatable, versioned data products using it.
However, while the mission is the same, they work very differently. DVC uses Git for metadata management and works in tandem with a Git repository, whilst lakeFS is a standalone system built from the ground up to work at any scale.
With the addition of support for
local in lakeFS, you can cover use cases small and large, local and distributed.
Table of Contents