Introduction
When we first thought about a tagline for our open source project lakeFS, we instinctively gravitated to terms like “Data versioning”, “Manage data the way you manage code”, “Git for data”, or any variation of the three that is grammatically correct.
We were very pleased with ourselves for 5 minutes, or maybe 7, before realizing these phrases don’t really mean anything. Or more precisely, mean too many things to properly describe the value of lakeFS. (They are also commonly used by other players in the domain that address completely different use cases.)
So, we decided to map the world of projects declaring data versioning as what they do according to use cases. We organized the existing ones into three categories: Collaboration, Machine Learning Management, and Table Formats.
And lakeFS – which didn’t fit naturally into these categories – got its own: Manageability and Resilience.


Having made sense of the space, let’s go deeper into the problem(s) characterized by these categories. And also discuss how the solution offered by the various tools solves them.
Use Case #1: Collaboration Over Data
The Pain
Data engineers and scientists managing multiple datasets – both external and internal – changing all the time. Managing access to the data and keeping track of the versions over time, is mentally-intensive and error prone.
The Solution
An interface that allows collaboration over the data and version management. The actual repository may be a proprietary database (e.g. DoltHub), or provide efficient access to data distributed within your systems (e.g. Quilt or Splitgraph). These interfaces also grant easy access and management of different versions of the same data set.
Most players in this category also provide collaboration in other aspects of data workflows. Perhaps most notable is the ability to collaborate over ML models. In this category you can find the likes of DAGsHub, DoltHub, data.world, Kaggle, Splitgraph, Quilt, FloydHub and DataLad.
Use Case #2: Managing ML Pipelines
The Pain
There are many, many steps in machine learning pipelines: starting with input data to tagged data, validation sets, feature modeling, optimizing hyper-parameters, and finally productionalization. Simply put, there’s no easy way to manage complexity in these pipelines. Some of the complexity is bourne out of necessity, some of it comes from the variety of tools used that don’t play nice together.
The Solution
MLOps tools. You might be asking yourself, “Why would Ops tools be mentioned in the context of data versioning?” Well, because managing data pipelines is a major challenge in the lifecycle of an ML application.
Since ML is a scientific work, it requires reproducibility, and reproducibility means data + code (at a minimum). There are several MLOps tools that enable data versioning: they include: DVC, Pachyderm, MLflow, and Neptune.
Use Case #3: Transactional Guarantees in Data Lakes
The Pain
Data lakes over object storage are immutable (both objects and formats). This conflicts, however, with the mutability requirements to:
- Comply with GDPR and other privacy regulations (delete records on demand)
- Ingest streaming data (requires appends)
- Backfill or handle late-arriving data (require updates to already saved data).
The Solution
Structured Data Formats that allow Insert, Delete, and Upsert operations. The formats are columnar and provide the ability to change an existing object by saving the delta of the changes into another object.
The metadata of those objects include the instructions on how to generate the latest version of an object from its saved delta objects. We add data versioning mainly to provide concurrency control. In this category you can find the open source projects Apache IceBerg, Apache Hudi, and Delta Lake by Databricks.
Use Case #4: Data Lake Manageability and Resilience
The Pain
Managing multiple data producers and consumers of an object storage based data lake. The consumers access the data using different tools, such as Hadoop/Spark, Presto, and analytic databases.
Coordination between the data contributors and data consumers is challenging. It relies on internal processes and manual updates of catalogs or files. In addition, there’s no easy way to provide isolation without copying data. Additionally, there is no way to ensure consistency between multiple data collections.
The Solution
An interface that allows collaboration over the data and version management. For example, the interface can provide a Git terminology that allows versioning of the lake by branching, committing, and merging changes.
Final Thoughts
We decided to create lakeFS after meeting with over 30 companies managing a data lake. These pains, familiar from our own experience, came up over and over.
lakeFS is designed to make managing data lakes as simple as possible. No matter how big the data is, no matter what format it’s stored in, no matter what technologies you use to analyze it. Go ahead, give lakeFS a try, without installing.
About lakeFS
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.


Databricks Unity Catalog: A Comprehensive Guide to Streamlining Your Data Assets
As data quantities increase and data sources diversify, teams are under pressure to implement comprehensive data catalog solutions. Databricks Unity Catalog is a uniform governance


The Power of Databricks SQL: A Practical Guide to Unified Data Analytics
In the universe of Databricks Lakehouse, Databricks SQL serves as a handy tool for querying and analyzing data. It lets SQL-savvy data analysts, data engineers,


How Data Version Control Provides Data Lineage for Data Lakes
One of the reasons behind the rise in data lakes’ adoption is their ability to handle massive amounts of data coming from diverse data sources,
Originally published August 20, 2020 and updated on November 22, 2021.
Table of Contents