Based on my presentation at PyData Global 2025.
My colleague Yoav recently wrote about why reproducibility matters so much in healthcare AI and how data version control addresses the gap. This post is a follow up to that, with an overview of how to incorporate data version control into your existing ML workflow to address the reproducibility challenge.
I’ll walk through the key concepts using an end-to-end computer vision example where I built a simple computer vision model to detect retinal disease (using this dataset). If you want a full step-by-step tutorial, check out this presentation from PyData Global, or grab an example notebook fromlakeFS samples that outline a similar end-to-end approach.
In this post I want to focus on what matters conceptually and how the pieces fit together as you incorporate lakeFS into your ML workflow.
Your experiment tracking has a blind spot
Here’s the typical ML experiment workflow: you pull data, preprocess it, train a model, evaluate it, and log everything to MLflow or some other experiment tracking platform. Your experiment record captures the model architecture, training parameters, metrics, and the model artifact itself. If someone asks how a model was built, you can answer most of their questions from the MLflow record.
But there’s a gap. MLflow knows that you trained on a dataset. It doesn’t know which version of that dataset. If the underlying data has changed – new samples added, bad samples removed, preprocessing updated – there’s no way to get back to the exact state of the data at training time. The experiment record points to a location, not a snapshot.
Git doesn’t solve this either. Git is built for code: text files, relatively small, diffable. Training datasets are often hundreds of gigabytes or terabytes of binary objects in cloud storage. You can’t check a million retinal scans into a Git repo.
This is the problem data version control solves. lakeFS applies Git-like operations – branches, commits, tags – to object storage, using metadata operations rather than data copies. A commit in lakeFS is an immutable snapshot of your dataset at a point in time. A branch is a zero-copy clone that lets you modify data without touching the original. And it all works at petabyte scale because you’re versioning metadata, not duplicating files.
What that means in practice is you can close the loop in your experiment tracking: your MLflow record points to a specific lakeFS commit, and that commit is a reproducible snapshot of the exact training data that was used.
For anyone who wants to follow along and replicate this approach in your own environment, check out the Quick Start guide in our documentation. For the steps I outline below, I’m using the lakeFS high-level Python SDK (pip install lakefs), which provides a clean interface for branching, committing, and tagging operations. For MLflow, we use the standard mlflow library.Use this code to set up the python SDK.
import lakefs
from lakefs.client import Client
import mlflow
import json
# Connect to your lakeFS instance
# If lakectl is configured, you can skip the Client() call entirely —
# the SDK will auto-discover credentials from ~/.lakectl.yaml or environment variables.
client = Client(
host="http://localhost:8000", # your lakeFS endpoint
username="your_access_key_id",
password="your_secret_access_key"
)
REPO_NAME = "YOUR_REPO_NAME"
LAKEFS_ENDPOINT = "http://localhost:8000" # also used for UI links in MLflow tags
repo = lakefs.Repository(REPO_NAME, client=client)Branching: experiment with data without breaking anything
One pattern we see constantly with large enterprises is defensive copying. A team wants to experiment with different data preparation strategies, maybe cleaning data, format conversion, or otherwise preparing data for ML, but they’re working against a production dataset that other teams depend on. So they copy the entire dataset to a separate location, do their work there, and now there are two (or five, or twenty) copies of the data floating around, each slightly different, with no clear lineage or version history.
Applying git operations to data helps mitigate this. With lakeFS, instead of copying data, you create a zero-copy branch instead. In the retinal imaging example, the workflow looks like this: we have a main branch that represents our production training data, in this case 1,000 retinal scan images. When we want to try augmenting this data, we create an experiment branch:
# Create a branch for our augmentation experiment
# This is a zero-copy operation — no data is duplicated
branch = repo.branch("experiment-augmentation").create(
source_reference="dataset-v1.0-baseline" # branch from a tag, not main
)
print(f"Branch created: {branch.id}")Two things to note here. First, this is a zero-copy operation. Whether the dataset is a thousand images or a million, the branch is created in milliseconds because we’re only creating metadata pointers. Second, we’re branching from a tag (dataset-v1.0-baseline), not from the current state of main. This means our experiment starts from a known, fixed baseline, regardless of what other changes have been merged into main since then. If we run multiple experiments in parallel, they all start from the same starting point.
On this branch, we can add augmented images, delete samples, restructure directories – whatever the experiment requires. None of it affects main. If the experiment doesn’t work out, we just abandon the branch. If it does work out, we merge it back in a controlled way. The same model you’re used to from Git feature branches, applied to your training data.
Snapshots: immutable records of your training data
Every time you commit to a lakeFS branch, you create an immutable snapshot of the dataset at that point in time. This is the piece that makes reproducibility actually work – not “we think this is probably the data we used” but “here is the exact, byte-for-byte state of every object in the dataset when we trained this model.”
In the retinal imaging example, after augmenting our training data on the experiment branch, we commit with metadata that describes what we did:
# Commit the augmented data with descriptive metadata
branch = repo.branch("experiment-augmentation")
ref = branch.commit(
message="Add 300 augmented training images",
metadata={
"augmentation_count": "300",
"augmentation_params": json.dumps({
"rotation_range": 15,
"width_shift_range": 0.1,
"horizontal_flip": True,
"zoom_range": 0.15
}),
"original_dataset_size": "1000",
"total_dataset_size": "1300",
},
)
data_commit_id = ref.get_commit().idThat commit ID is now a permanent, immutable reference. Even if the object store changes a thousand times after this point, anyone with that commit ID can reconstruct the dataset exactly as it existed at training time. The data itself hasn’t been copied or locked – it’s the metadata snapshot that provides the guarantee.
We also tag significant snapshots with human-readable names so they’re easy to find and reference later:
# Tag this snapshot for easy reference
repo.tag("dataset-v1.1-augmented").create(source_ref=data_commit_id)Closing the loop: linking lakeFS to your experiment tracker
This is where data version control integrates with the workflow you already have. The idea is simple: when you log an experiment to MLflow, include the lakeFS commit ID. When you commit data changes to lakeFS, include the MLflow run ID. Two small additions to your existing logging, and suddenly you have complete bidirectional traceability between models and data.
The forward link – MLflow pointing to lakeFS – is just a few tags added to your MLflow run. You’d add these alongside your existing metric and parameter logging:
mlflow.set_tracking_uri("http://localhost:5001") # your MLflow instance
mlflow.set_experiment("my-experiment")
# Get the current commit ID for the branch we trained on
branch = repo.branch("experiment-augmentation")
data_commit_id = branch.get_commit().id
with mlflow.start_run(run_name="experiment-augmentation") as run:
# === Your existing MLflow logging ===
mlflow.log_param("training.epochs", 5)
mlflow.log_param("training.learning_rate", 0.0005)
mlflow.log_metric("accuracy", float(metrics["accuracy"]))
mlflow.log_metric("sensitivity", float(metrics["sensitivity"]))
# ... log model artifact, etc.
# === Add lakeFS data version tags (the forward link) ===
mlflow.set_tag("lakefs.repo", REPO_NAME)
mlflow.set_tag("lakefs.branch", "experiment-augmentation")
mlflow.set_tag("lakefs.commit_id", data_commit_id)
mlflow.set_tag("lakefs.commit_url",
f"{LAKEFS_ENDPOINT}/repositories/{REPO_NAME}/commits/{data_commit_id}")
run_id = run.info.run_id
experiment_id = run.info.experiment_idThe backward link – lakeFS pointing to MLflow – goes in a commit to the lakeFS branch after you’ve logged the experiment:
# Commit model metadata to lakeFS with MLflow linkage (the backward link)
branch = repo.branch("experiment-augmentation")
branch.commit(
message=f"Model documentation - links to MLflow run {run_id[:12]}",
metadata={
"mlflow_run_id": run_id,
"mlflow_experiment_id": experiment_id,
"mlflow_run_url": f"http://localhost:5001/#/experiments/{experiment_id}/runs/{run_id}",
"model_name": "augmentation-mobilenet-v1.1",
"sensitivity": str(metrics["sensitivity"]),
"training_data_tag": "dataset-v1.1-augmented",
},
allow_empty=True,
)That’s it. The rest of your MLflow workflow – logging metrics, parameters, model artifacts, registering models – stays exactly the same. You’re adding a handful of tags and metadata fields, and in return you get something that was previously very hard to achieve: from any model in your registry, you can trace back to the exact training data, and from any dataset snapshot, you can see every experiment that was run against it.
Here’s what this looks like in practice. In the MLflow UI, the lakeFS tags appear alongside your other experiment metadata – including a clickable link to the exact data commit:

And in lakeFS, the commit metadata shows which MLflow experiment used this data snapshot, with a direct link back to the run:

This bidirectional linkage means you can start from either side and get the full picture.
Metadata for compliance
In regulated industries – healthcare, finance, defense, automotive – reproducibility isn’t just good practice. It’s a legal requirement. The FDA regulates AI-based medical devices. The EU has the Medical Device Regulation and the AI Act. Financial regulators require model governance and audit trails. All of them expect you to document how models were built, what data was used, and to demonstrate that results can be reproduced.
Both MLflow parameters and lakeFS commit metadata give you natural places to capture compliance-relevant information as part of your normal workflow. In the retinal imaging example, I show how one might log clinical and regulatory metadata alongside the training parameters.
with mlflow.start_run(run_name="experiment-augmentation") as run:
# ... lakeFS tags and training metrics as shown above ...
# Compliance metadata — logged once, available for auditing forever
mlflow.log_param("clinical.data_source", "APTOS Retinal Imaging Study")
mlflow.log_param("clinical.imaging_equipment", "Topcon NW400")
mlflow.log_param("clinical.population", "Aravind Eye Hospital patients, 2019")
mlflow.log_param("fda.device_class", "Class II")
mlflow.log_param("fda.target_sensitivity", 0.87)
mlflow.log_param("fda.intended_use", "Screening aid for diabetic retinopathy")Taking these extra steps is a small amount of extra work, but it’s part of the same logging pattern you’re already using. The difference is that now, when you need to compile a regulatory submission or respond to an audit, the information is structured, versioned, and linked to the exact data and model artifacts. You’re not doing forensic data archaeology six months after the fact.
In the walkthrough, we actually generate a structured regulatory submission report that pulls model architecture details, training parameters, clinical metrics, and data provenance directly from MLflow and lakeFS. The documentation becomes a byproduct of doing the work, not a separate effort.
The reproducibility scenario
Now let’s describe a scenario that illustrates the benefits of this workflow. Six months after training, a new team member needs to understand and improve the production model. Or a regulator asks you to demonstrate how the model was built. Without data version control, here’s what that looks like:
The experiment record in MLflow tells you the model architecture, training parameters, and performance metrics. It might tell you the data came from a particular S3 path. But that path now contains different data – updated, modified, reorganized. Reproducing the model means hoping someone documented the data state somewhere, or trying to reconstruct it from logs. In practice, teams tell us this kind of data archaeology can consume five to ten hours a week of their ML practitioners’ time.
With the lakeFS integration, the path is straightforward:
- Find the production model in MLflow’s model registry
- Read the
lakefs.commit_idtag to get the exact data snapshot - Use that commit to check out the dataset in the exact state it was in at training time – even if the underlying object store has changed hundreds of times since then
- Retrain with the same parameters (also logged in MLflow) and verify you get equivalent results
Going the other direction works too. Starting from a data snapshot in lakeFS, the commit metadata tells you which MLflow experiments used this data, with direct links to the experiment runs. You can see performance metrics, model versions, and the full experiment history – all from the data side.
This is what a complete audit trail looks like. Code versioned in Git. Experiments tracked in MLflow. Training data versioned in lakeFS. Each one linked to the others. No gaps.
Fitting this into your workflow
If you’re already using MLflow and Git, adding lakeFS to your workflow is a smaller change than it might seem. The core additions are:
Before training: create a lakeFS branch for your experiment (or use main for your baseline), and commit your training data with descriptive metadata. Tag significant snapshots.
During logging: add lakeFS commit information as tags on your MLflow run. This is four or five lines of code.
After training: commit model documentation and metadata back to the lakeFS branch, including the MLflow run ID. Merge winning experiments to main when you’re ready to promote.
The rest of your workflow – data preprocessing, model architecture, training loops, evaluation, model registration – stays exactly as it is. You’re not replacing anything. You’re closing the gap in what was already there.
If you want to follow along step by step retinal imaging example in this post is based on my PyData Global 2025 presentation, which walks through the full end-to-end workflow live, including training three models with different data strategies, comparing results, and promoting a winner to production., Also, complete end-to-end notebook examples are available in the lakeFS samples repository alongside integration examples for dozens of other tools in the data and AI ecosystem.
And if you’re building models in an environment where reproducibility and compliance aren’t optional – healthcare, finance, defense, automotive, government – I’d genuinely love to hear what you’re working on and what challenges you’re running into. Reach out anytime.



