Based on my presentation at PyData Global 2025.
Throughout my career, I focused on data and AI systems across highly regulated industries and on high-stakes use cases, where the cost of a mistake is high, or the risk of failure is significant.
It’s one thing to use AI to create a TikTok video, write advertising copy, or come up with a recipe for a holiday party. It’s entirely different to use AI in healthcare, finance, or government use cases, where if an AI system makes a mistake, people can get hurt or even die.
Because of these serious repercussions, we see a much higher bar for building AI systems that are reliable, trustworthy, safe to use, accurate, and auditable.
There is a broad class of generalizable models we can use for everyday speech and language tasks, like chatbots. But for many other use cases, those models just aren’t good enough. They make too many mistakes and aren’t reliable enough.
What about agents? There has been a lot of excitement about agents, but in my experience, agents were not ready for many of the high-stakes use cases because they still make mistakes, and we can’t make them reliable enough yet.
Is there a way for sectors like healthcare to build better models? I believe the answer lies in reliable, reproducible, AI-ready first-party data. And data version control is at its core.
I’ll go through the basics of data version control and show an example of how teams can use data versioning in practice while building a computer vision model.
What is data version control?
Let’s start with what data version control is. I think the best way to understand it is by talking about lakeFS customers and what they’re trying to do. Our customers are very different: Fortune 500 companies across many industries, government agencies, research organizations, startups, companies building their own models, and companies using models built by others.
But they have some things in common:
- They tend to work with large-scale data, often stored in object storage, to build or run models.
- They do a lot of iteration.
- Sometimes their data changes constantly, or they run many experiments, and they need to do this efficiently.
- They often have distributed teams where different people and teams handle different parts of the data lifecycle: capturing, managing, storing, processing, preparing for machine learning, training models, and running inference.
- They also care deeply about reliability. Many of our customers work in highly regulated or high-stakes environments. They need AI systems that are trustworthy, reliable, reproducible, and, when appropriate, auditable. That’s a big part of why lakeFS exists and why people use data version control.
lakeFS sits in the middle of a typical data and AI stack. At the top, you have all the tools that data engineers, MLOps teams, and data scientists use to build and deploy systems. At the bottom, you have object storage, where the data actually lives. Most customers use object storage, often at a very large scale, in data lakes.

Why do we need data versioning?
Object stores and data lakes tend to get messy and complicated. One idea behind data lakes is that it should be easy to ingest data without strict schemas or validation. That makes it easy to collect data, but it creates a tax later when you want to use that data. It becomes harder to understand what you have and how to use it.
We often hear the term “data swamp.” It’s a cliché, but it’s real.
Over time, data lakes become difficult to manage. Many tools used by data engineers and data scientists have an imperfect understanding of what’s really happening in the object storage layer, because things move fast and become messy.
Our goal is to provide a layer between the data itself and the AI tools, so it’s easier to use data and build the systems customers want.
Data version control helps with this. We can borrow ideas from software engineering, especially from Git. The same concepts used to manage code – branching, merging, validation, commits, and tags – can be applied to data. We can experiment safely, collaborate, merge changes back into a main dataset, enforce validation rules, and create snapshots of data that can be recreated months or years later.
Data version control solves a number of other pain points
Data discovery
In large organizations with many teams creating and modifying data, people often end up defensively copying data. Many versions of the same datasets start to appear. When someone needs data, they have to spend time on “data archaeology” to figure out which version is correct.
Data provisioning
Another pain point is provisioning data. A team might want production data for safe experimentation, which often means copying large datasets. That takes time and creates even more data sprawl.
Collaboration
Many customers have complex pipelines, and data engineering teams must stitch all of this together before ML teams can use it. There are many steps and teams, and mistakes can easily cascade, leading to significant rework.
GPU utilization
GPUs are scarce and expensive, so it makes sense that teams want to keep them as utilized as possible. If a data pipeline fails at step 22, they don’t want to restart from step 1 if they can avoid it. Faster recovery means GPUs don’t sit idle. In some cases, delivering data to the GPU is the bottleneck, so improving that flow saves money and improves efficiency.
Reproducibility
Companies building AI models need to answer questions like, “Who built this model six months ago? How did they build it? What data did they use?” For the healthcare sector, reproducibility is crucial – teams must be able to reproduce exactly the data used to train a model.
Copying data when versioning?
When we apply data version control to object storage, we have to remember that data can be enormous: terabytes or petabytes. We don’t want to copy data every time we run an experiment. Instead, lakeFS creates zero-copy branches using metadata operations.
How does that work? We describe which objects belong to a dataset at a given point in time. When we make changes, we update metadata, not the data itself. This makes data version control fast and scalable.
Now that we’ve got the theory out of the way, let’s see how this works in the real world.
Step-by-step guide to data versioning for computer vision
Setting up a data version control environment: lakeFS
You can use lakeFS either locally or in the cloud. Let’s start with the lakeFS samples on GitHub.
The setup creates several containers: a notebook server, a MinIO S3-compatible object store, and a lakeFS instance. You could also use AWS, GCP, Azure, or another S3-compatible store. I’m also installing MLflow for experiment tracking.

In the lakeFS samples repository, there are many examples and notebooks. I encourage you to explore them. There’s also a getting-started guide and a step-by-step tutorial in the documentation.

I’ve already set up lakeFS, so I’ll just show you the basics now. This is what the UI looks like:

In the actual model-building workflow, we’ll use the API programmatically, but the UI helps visualize what’s happening.
For example, I can create a repository and load data into it. That’s what I’m doing here in the UI.
Let’s say this is my main branch. We’ll start with our initial data in the main, and then I’ll show Git-like operations for how we manage data version control. We can create tags, for example, to say “this is our raw data.”

We can also create a branch. Part of the idea of data version control is that we can branch off main so we can safely experiment without affecting the production dataset. This happens really fast. Here it’s tiny data, so it doesn’t matter, but if this were terabyte- or petabyte-scale data – either tens of thousands of objects or very large files – creating the branch would still be fast because it’s a metadata operation.

Now we can make changes on this branch. We can experiment safely without breaking the production dataset.
Let’s say I’m doing some experimentation: maybe I’m changing the data I’m using to train a model. I delete some files and add some files. Now I have an experiment branch that’s different from main, and I commit those changes. At this point, my experiment branch differs from the main.
Just like with Git and code, I can compare the changes. I can see what objects I deleted and what objects I added. If I decide these changes are what I want reflected in production, I can merge them back. I can do a pull request and manage changes to main.
Setting up the object store: MinIO
In this case, on my Mac, I’m using MinIO, which is S3-compatible. When I was interacting with lakeFS, I was actually interacting with this bucket under the hood. I created a repo, which created a path in this bucket, and the data I loaded shows up here. So there’s a relationship between lakeFS and the underlying object storage you’re using.

The notebook server looks like this, and it includes the sample notebooks. There’s a little index, and there are lots you can use to get started with lakeFS. I encourage you to check these out.
Setting up model tracking: MLflow
I’m going to use MLflow to track the models we create and the experiments we run, alongside lakeFS, to show the data we used throughout the process.

Example scenario: Building a computer vision model to detect diabetic retinopathy (DR)
The scenario is a healthcare AI use case: we’re building a computer vision model to detect diabetic retinopathy (DR). We’ll use images of normal and diseased retinas and explore different ways to train a model to predict whether a scan shows this condition – an early screening tool.
I’m going to flip over to VS Code. I’ll run a few logistics cells:
- load libraries,
- configure a few things,
- and connect to my lakeFS and MLflow environment.

I’ve specified which repo I’ll use, and I’m starting from a single branch that contains retina scan data – some with DR and some without. I’m also setting up a few helper functions to interact with lakeFS programmatically.
Next, I’m going to load data. I’m loading data into my local environment from the main branch. You could imagine this as a cloud GPU instance or another piece of infrastructure you use for processing – this is just a stand-in for the demo.

Now let’s look at the images. At the top, we have normal retinas. At the bottom, we have retinas with the disease. Our goal is to train a model to spot patterns in the pixel data and predict whether the scan has DR (yes/no).

A couple caveats:
- First, this is a simulation. I’m using a small sample, so everything trains quickly, which also introduces some variability in performance. That’s actually helpful because I don’t fully know what the outcome will be.
- Second, I’m not a professional data scientist. I’ve worked with a lot of talented PhD data scientists and learned things by osmosis, but if you’re a real data scientist, you may see things you’d do differently.
Now I’m defining the model architecture and some training parameters. I’m going to keep the architecture consistent across runs. First, I’ll train a baseline model using the baseline data. Since the dataset is small, it trains relatively fast. Then we’ll check performance.

We have 200 samples in the validation set. In this simulation, let’s say we care most about false negatives. We want a rapid screening tool: if someone has the condition, we want to catch it and send them for follow-up testing. False positives matter too, but let’s assume false negatives are the primary concern.

This is our baseline model. Now I’m going to log some things:
- I’ll write some metadata – It’s simulated, but in real healthcare AI, there are heavy metadata requirements: where the data came from, how it was captured, population details, imaging equipment, and other factors. This is important so we can evaluate, reproduce, and audit the model later.
- Log the baseline model to MLflow – That records the experiment.
- Commit model-related metadata to lakeFS as well
At this point, we have a baseline model. It’s not great, but we’ll deploy it to production for now.
Now let’s look at what we did. In MLflow, we have an experiment for the baseline model that includes performance metrics, model parameters, and the metadata we captured. Crucially, we also have a link to the exact data used to train the model – an immutable reference to the snapshot of data. Even if the object store changes later, we can still retrain the model on the exact same data snapshot.

If we flip over to lakeFS, we can also see MLflow-related metadata stored there. So there’s a bidirectional linkage: from MLflow, we can see which data snapshot was used; from the data snapshot, we can see which experiments ran against it.

Now let’s say we want to improve the model. I’ll keep the architecture the same, but change the data.
First, I’ll create a branch: a zero-copy branch off the production dataset so I can experiment safely.

Then I’ll augment the dataset by generating simulated variations – rotations, shifts, zooming, and flips. I’ll run a data processing job, then load the augmented images into the experiment branch in lakeFS and commit the result.

Now if we look at the repo, we have a new branch with new data that doesn’t exist on main. It’s isolated and safe.

Next we’ll train the model on this augmented data and see how it performs.
Here’s the V2 model.

As hoped, it performs a bit better on the metric we care most about: fewer false negatives. Still amateur-grade, but it’s progress. Now we log this model to MLflow, and we log the MLflow metadata back into lakeFS again.
If we look in MLflow, we now have another experiment.

It has similar metadata, but it’s tied to a different lakeFS snapshot – so recreating it later is straightforward. And again, while this might feel like “a lot of logging,” the value shows up when teams are large and distributed, or when you’re trying to understand what someone else did. It also helps if you forget what you did yesterday.
Now we’ll try one more experiment.
We’ll change the data by creating a new branch and applying CLAHE (contrast-limited adaptive histogram equalization). It increases contrast in the structures we want the model to pick up, and it’s used in medical imaging contexts. We’ll apply this processing, inspect the results, load the processed data into the new branch, commit it, and train a model on it.

In this run, it doesn’t help. It improves the wrong thing: false positives look better, but false negatives get worse. That’s the nature of machine learning—you don’t always know what will work until you try. Still, we log the experiment to MLflow and log the metadata into lakeFS so we have a record for the future.

Now we compare models: baseline, augmented-data, and CLAHE. CLAHE performs better in some areas, but worse on the most important metric.

So we can decide to promote the augmented-data model to production. That means we merge the augmented-data branch back into main – because that’s now our “golden” training data – and we promote the corresponding model in MLflow.
If we step back and look at everything, the benefit becomes clearer over time. Imagine six months pass, the team changes, and a new data scientist needs to understand what was tried, what worked, and which data were used.
MLflow provides experiment history and metadata, and lakeFS provides an exact snapshot of the training data. Together, they make reproduction and auditing much easier.
Wrap up
Imagine you’re building a healthcare AI product and want regulatory approval – FDA in the US, MDR in the EU. A lot of what’s required is documentation: how the model was built, how to reproduce it, and evidence that it can be audited.
Without this tracking, it’s hard to compile. With MLflow and lakeFS, you can provide model performance details, model training information, infrastructure details, and – crucially – the exact data commit used for training. You can recreate the repo exactly as it existed at training time, even if the object store has changed. That’s a major part of the value of data version control in high-stakes settings.


