Data Versioning – Does It Mean What You Think It Means?

Einat Orr, PhD

Last updated on January 28, 2026

Home > Blog > Data Versioning – Does It Mean What You Think It Means?

Learn from AI, ML & data leaders from Dell, Lockheed Martin, Red Hat & more

On Demand WATCH NOW

Introduction

When we first thought about a tagline for our open source project lakeFS, we instinctively gravitated to terms like “Data versioning”, “Manage data the way you manage code”, “Git for data”, or any variation of the three that is grammatically correct.

We were very pleased with ourselves for 5 minutes, or maybe 7, before realizing these phrases don’t really mean anything. Or more precisely, mean too many things to properly describe the value of lakeFS. (They are also commonly used by other players in the domain that address completely different use cases.)

So, we decided to map the world of projects declaring data versioning as what they do according to use cases. We organized the existing ones into three categories: Collaboration, Machine Learning Management, and Table Formats.

And lakeFS – which didn’t fit naturally into these categories – got its own: Manageability and Resilience.

Having made sense of the space, let’s go deeper into the problem(s) characterized by these categories. And also discuss how the solution offered by the various tools solves them.

Use Case #1: Collaboration Over Data

The Pain

Data engineers and scientists managing multiple datasets – both external and internal – changing all the time. Managing access to the data and keeping track of the versions over time, is mentally-intensive and error prone.

The Solution

An interface that allows collaboration over the data and version management. The actual repository may be a proprietary database (e.g. DoltHub), or provide efficient access to data distributed within your systems (e.g. Quilt or Splitgraph). These interfaces also grant easy access and management of different versions of the same data set.

Most players in this category also provide collaboration in other aspects of data workflows. Perhaps most notable is the ability to collaborate over ML models. In this category you can find the likes of DAGsHub, DoltHub, data.world, Kaggle, Splitgraph, Quilt, FloydHub and DataLad.

Use Case #2: Managing ML Pipelines

The Pain

There are many, many steps in machine learning pipelines: starting with input data to tagged data, validation sets, feature modeling, optimizing hyper-parameters, and finally productionalization. Simply put, there’s no easy way to manage complexity in these pipelines. Some of the complexity is bourne out of necessity, some of it comes from the variety of tools used that don’t play nice together.

The Solution

MLOps tools. You might be asking yourself, “Why would Ops tools be mentioned in the context of data versioning?” Well, because managing data pipelines is a major challenge in the lifecycle of an ML application.

Since ML is a scientific work, it requires reproducibility, and reproducibility means data + code (at a minimum). There are several MLOps tools that enable data versioning: they include: DVC, Pachyderm, MLflow, and Neptune.

Use Case #3: Transactional Guarantees in Data Lakes

The Pain

Data lakes over object storage are immutable (both objects and formats). This conflicts, however, with the mutability requirements to:

Comply with GDPR and other privacy regulations (delete records on demand)
Ingest streaming data (requires appends)
Backfill or handle late-arriving data (require updates to already saved data).

The Solution

Structured Data Formats that allow Insert, Delete, and Upsert operations. The formats are columnar and provide the ability to change an existing object by saving the delta of the changes into another object.

The metadata of those objects include the instructions on how to generate the latest version of an object from its saved delta objects. We add data versioning mainly to provide concurrency control. In this category you can find the open source projects Apache IceBerg, Apache Hudi, and Delta Lake by Databricks.

Use Case #4: Data Lake Manageability and Resilience

The Pain

Managing multiple data producers and consumers of an object storage based data lake. The consumers access the data using different tools, such as Hadoop/Spark, Presto, and analytic databases.

Coordination between the data contributors and data consumers is challenging. It relies on internal processes and manual updates of catalogs or files. In addition, there’s no easy way to provide isolation without copying data. Additionally, there is no way to ensure consistency between multiple data collections.

The Solution

An interface that allows collaboration over the data and version management. For example, the interface can provide a Git terminology that allows versioning of the lake by branching, committing, and merging changes.

Final Thoughts

We decided to create lakeFS after meeting with over 30 companies managing a data lake. These pains, familiar from our own experience, came up over and over.

lakeFS is designed to make managing data lakes as simple as possible. No matter how big the data is, no matter what format it’s stored in, no matter what technologies you use to analyze it. Go ahead, give lakeFS a try, without installing.

About lakeFS

The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.

Our mission is to maximize the manageability of open source data analytics solutions that scale.

To learn more...

Read Related Articles.

how to build infrastructure for AI-ready data

The Control Plane for AI-Ready Data

Versioned. Reproducible. Compliant.

Best Practices | Data Engineering | Machine Learning | Thought Leadership

Agentic Data Access: How AI Agents Securely Access Enterprise Data

Oz Katz
July 8, 2026

Best Practices | Data Engineering | Machine Learning | Tutorials

Give Your AI agent a Versioned Filesystem: A Self-Correcting Receipts Pipeline on E2B and lakeFS

Alexandria Yip
July 8, 2026

Best Practices | Data Engineering | Machine Learning | Thought Leadership

Scaling ML Data Without Breaking Compliance

Gottfried Sehringer
July 6, 2026

Data Versioning – Does It Mean What You Think It Means?

Table of Contents