When we first thought about a tagline for our open source project lakeFS, we instinctively gravitated to terms like “Data versioning”, “Manage data the way you manage code”, “Git for data”, or any variation of the three that is grammatically correct.
We were very pleased with ourselves for 5 minutes, or maybe 7, before realizing these phrases don’t really mean anything. Or more precisely, mean too many things to properly describe the value of lakeFS. (They are also commonly used by other players in the domain that address completely different use cases.)
So, we decided to map the world of projects declaring data versioning as what they do according to use cases. We organized the existing ones into three categories: Collaboration, Machine Learning Management, and Table Formats.
And lakeFS – which didn’t fit naturally into these categories – got its own: Manageability and Resilience.
Having made sense of the space, let’s go deeper into the problem(s) characterized by these categories. And also discuss how the solution offered by the various tools solves them.
Use Case #1: Collaboration Over Data
Data engineers and scientists managing multiple datasets – both external and internal – changing all the time. Managing access to the data and keeping track of the versions over time, is mentally-intensive and error prone.
An interface that allows collaboration over the data and version management. The actual repository may be a proprietary database (e.g. DoltHub), or provide efficient access to data distributed within your systems (e.g. Quilt or Splitgraph). These interfaces also grant easy access and management of different versions of the same data set.
Most players in this category also provide collaboration in other aspects of data workflows. Perhaps most notable is the ability to collaborate over ML models. In this category you can find the likes of DAGsHub, DoltHub, data.world, Kaggle, Splitgraph, Quilt, FloydHub and DataLad.
Use Case #2: Managing ML Pipelines
There are many, many steps in machine learning pipelines: starting with input data to tagged data, validation sets, feature modeling, optimizing hyper-parameters, and finally productionalization. Simply put, there’s no easy way to manage complexity in these pipelines. Some of the complexity is bourne out of necessity, some of it comes from the variety of tools used that don’t play nice together.
MLOps tools. You might be asking yourself, “Why would Ops tools be mentioned in the context of data versioning?” Well, because managing data pipelines is a major challenge in the lifecycle of an ML application.
Since ML is a scientific work, it requires reproducibility, and reproducibility means data + code (at a minimum). There are several MLOps tools that enable data versioning: they include: DVC, Pachyderm, MLflow, and Neptune.
Use Case #3: Transactional Guarantees in Data Lakes
Data lakes over object storage are immutable (both objects and formats). This conflicts, however, with the mutability requirements to:
- Comply with GDPR and other privacy regulations (delete records on demand)
- Ingest streaming data (requires appends)
- Backfill or handle late-arriving data (require updates to already saved data).
Structured Data Formats that allow Insert, Delete, and Upsert operations. The formats are columnar and provide the ability to change an existing object by saving the delta of the changes into another object.
The metadata of those objects include the instructions on how to generate the latest version of an object from its saved delta objects. We add data versioning mainly to provide concurrency control. In this category you can find the open source projects Apache IceBerg, Apache Hudi, and Delta Lake by Databricks.
Use Case #4: Data Lake Manageability and Resilience
Managing multiple data producers and consumers of an object storage based data lake. The consumers access the data using different tools, such as Hadoop/Spark, Presto, and analytic databases.
Coordination between the data contributors and data consumers is challenging. It relies on internal processes and manual updates of catalogs or files. In addition, there’s no easy way to provide isolation without copying data. Additionally, there is no way to ensure consistency between multiple data collections.
An interface that allows collaboration over the data and version management. For example, the interface can provide a Git terminology that allows versioning of the lake by branching, committing, and merging changes.
Ready to better manage your data lake?
Read Related Articles.
Originally published August 20, 2020 and updated on November 22, 2021.