Reproducibility is a fundamental challenge in building reliable machine learning (ML) models and AI applications.
It’s not just about debugging a model when it fails in production; it’s also about ensuring that experiments are consistent, avoiding unintended variance, and making incremental progress with confidence.
Without reproducibility, ML teams risk wasting time on unreliable results and struggling to trace back issues in model behavior.
The Three Pillars of ML Reproducibility
To build a reproducible ML experiment, three key components must remain stable across different runs:
- Input Data – The dataset used for training and evaluation.
- Code and Parameters – The logic that defines the model, including hyperparameters and configurations.
- Execution Environment – The dependencies, libraries, and compute infrastructure that run the experiment.
This trio forms the Holy Trinity of ML Reproducibility. If any of these components change unexpectedly, the results of an experiment can differ, making it difficult to iterate systematically.
An easy way to understand this is through the concept of pure functions in programming. Given the same input, a pure function will always return the same output. Ideally, ML experiments should follow the same principle—ensuring that a given experiment is always repeatable, regardless of when or where it is executed.
How Git Solves Reproducibility for Code and Environment
Version control systems like Git have revolutionized how we manage reproducibility for code and execution environments:
Code and Parameters:
Git enables tracking every change in the model code, hyperparameters, and configurations, ensuring that an experiment can be reproduced exactly as it was executed. In many cases, this also means explicitly stating any variance in execution as part of the code base. A common example is setting an explicit seed for random number generation when sampling or codifying how model temperature behaves for a set value.
Execution Environment:
Infrastructure-as-Code (IaC) tools like Docker, Conda, or Kubernetes allow teams to define and version their compute environments, preventing inconsistencies due to dependency changes. Environment is defined as all layers of the stack on which our code is executed: from the OS version we’re using to the versions of the dependencies used in the code, to the specific GPU make and model we are using. All of these can make a difference and should be tracked.
While code and environment have standardized solutions and could be made reproducible by utilizing good code hygiene together with source control – one critical piece remains challenging to version effectively: data.
The Missing Piece: Data Versioning
Unlike code and environments, data is highly mutable by nature. It can change between runs due to various reasons:
| Reason | Description |
|---|---|
| Collaboration | Other team members may update the dataset, add new data, or modify existing records |
| External Dependencies | If your model relies on third-party data sources, updates to those datasets can introduce unintended variance |
| Data Management Policies | Data can be rewritten, deleted, or partitioned differently, affecting downstream ML experiments |
A naive approach to ensuring data reproducibility might be to copy the dataset for each experiment, but this is impractical due to cost, privacy, and governance concerns – not to mention, a time consuming and error prone process. Duplicating large datasets is expensive in terms of storage, and it can introduce compliance risks when dealing with regulated data.
Enter lakeFS: Scalable Data Version Control
lakeFS solves the challenge of data versioning by applying Git-like principles to object storage. Instead of duplicating data, lakeFS maintains pointers to objects, allowing teams to:
- Create immutable snapshots of data at any point in time.
- Track changes to datasets over multiple experiments.
- Ensure that the input data for an experiment remains unchanged, even if the underlying dataset evolves.
Example: Reproducible Data Access with lakeFS
The following example demonstrates how to read data from a lakeFS repository using Pandas, ensuring reproducibility by using both the latest dataset and a stable reference:
# pip install pandas lakefs-spec
import pandas as pd
pd.read_parquet(f'lakefs://my-datalake/main/datasets/my_data.parquet')Now that we can read from our lakeFS repository, we want to create our own isolated branch: reading data directly from the main branch means someone might change it at a later date, rendering our experiment non-reproducible!
Let’s create a branch and read from a snapshot of the data that we control:
import lakefs
import pandas as pd
# create a seperate branch for this experiment - this is a zero-copy operation!
lakefs.repository('my-datalake').branch('jane-experiment-13').create(source_reference='main')
# read from our isolated branch
pd.read_parquet(f'lakefs://my-datalake/jane-experiment-13/datasets/my_data.parquet')By using our own branch, we ensure that our datasets remain unchanged across runs, even if the main branch gets updated.
Going further, we might want to tag this specific version: by doing so, lakeFS will ensure this input data cannot change as tags are immutable:
import lakefs
import pandas as pd
lakefs.repository('my-datalake').tag('model-run-exp-13').create(source_reference='jane-experiment-13')
# read from our immutable tag
pd.read_parquet(f'lakefs://my-datalake/model-run-exp-13/datasets/my_data.parquet')Bonus: lakeFS Mount for Seamless Reproducibility
We can simplify our data loading even further, while simultaneously ensuring reproducibility using lakeFS Mount, which enables reading from lakeFS just like a local filesystem. This provides the benefits of data versioning without requiring code modifications.
Simply mount a lakeFS branch or commit as a file system and access data with existing tools:
$ cd my_git_repo/
$ everest mount lakefs://my-datalake/jane-experiment-13/datasets/ ./my_dir
$ cat ./my_dir/my_data.csv # mount will lazily fetch data as needed
$ git commit -am 'added data from my data lake'
$ git show
commit ab8ab8cca9b77a3b228985ec2dbdb105a828498a (HEAD -> main)
Author: Jane Doe <jane.doe@example.com>
Date: Thu Feb 13 15:41:01 2025 +0200
added data from my data lake
new file mode 100644
+++ b/my_dir/.everest/source
+lakefs://my-datalake/a2c284fb/datasets/As you can see, with lakeFS Mount, reproducibility is built-in: by default, any mounted directory that exists within a Git repository, will automatically add the source reference to Git’s tracking. This avoids having to actually add the data itself to Git, which will be very expensive, while still allowing different versions of the code to utilize different versions of data.
Conclusion
ML reproducibility requires a disciplined approach to managing input data, code, and execution environments. While Git provides robust solutions for code and infrastructure, data versioning remains the missing piece in many ML pipelines.With lakeFS, teams can version data efficiently without incurring high storage costs, ensuring that experiments remain stable, repeatable, and trustworthy. Whether using SDKs or lakeFS Mount, lakeFS empowers ML teams with the tools they need to iterate faster and deploy models with confidence.


