Best Practices, Machine Learning, Tutorials

ML Data Version Control and Reproducibility at Scale

Amit Kesarwani

Last updated on July 16, 2025

Home > Blog > ML Data Version Control and Reproducibility at Scale

Ready to try lakeFS? Watch how it works

Introduction

In the ever-evolving landscape of machine learning (ML), data stands as the cornerstone upon which triumphant models are built. However, as ML projects expand and encompass larger and more complex datasets, the challenge of efficiently managing and controlling data at scale becomes more pronounced.

These are the common conventional approaches used by the data scientists and the constraints associated with these approaches:

The Copy/Paste Predicament: In the world of data science, it’s commonplace for data scientists to extract subsets of data to their local environments for model training. This method allows for iterative experimentation, but it introduces challenges that hinder the seamless evolution of ML projects.

Reproducibility Constraints: Traditional practices of copying and modifying data locally lack the version control and auditability crucial for reproducibility. Iterating on models with various data subsets becomes a daunting task.

Inefficient Data Transfer: Regularly shuttling data between the central repository and local environments strains resources and time, especially when choosing different subsets of data for each training run.

Limited Compute Power: Operating within a local environment hampers the ability to harness the full power of parallel computing, as well as the distributed prowess of systems like Apache Spark.

In this guide, we will explain how to tackle these challenges by using data version control. We will demonstrate:

How to use Git and lakeFS, both open-source tools, to version control your code and data when working locally.
How to use lakeFS without the need to copy data and train your model at scale in a distributed computing environment. lakeFS leverages a unique approach (zero clone copies), where different versions of your ML datasets and models are efficiently managed without duplicating the data.

We will also delve into the power of parallel ML – running experiments in parallel with different parameters and exploring how lakeFS and Databricks together can supercharge your ML experiments and streamline your ML data pipeline. By leveraging the capabilities of lakeFS and Databricks, a distributed computing environment for Spark, you can harness the full potential of parallel ML without compromising on performance or scalability. We will illustrate how to scale parallel ML from a small local instance across to a full GPU/CPU cluster. Databricks is not a requirement, you can run Spark in the way you like best.

We will also integrate with MLflow to provide full experiment tracking and model logging.

Throughout this article, we will provide a step-by-step guide, accompanied by a Jupyter notebook, to demonstrate how these tools work together to enhance your ML workflows. Whether you’re a data scientist, ML engineer, or AI enthusiast, this blog will equip you with the knowledge and tools needed to leverage parallel ML effectively and accelerate your model development process.

Target Architecture

We will store our data using the open-source Linux Foundation project Delta Lake. Under the hood, Delta Lake stores the raw data in the Parquet format.

We are relying on PyTorch’s library segmentation_models.pytorch for the Deep Learning model. PyTorch Lightning is a great way to simplify your PyTorch code and bootstrap your Deep Learning workloads.

Our data will be stored as a Delta table and available as a Spark dataframe. However, PyTorch is expecting a specific type of data. We need a library to create a dataset in PyTorch format and manage the caching from blob storage to local SSD. For that, we will use Petastorm. So, Petastorm takes on the data loading duties and provides the interface between the Lakehouse and the Deep Learning model.

MLflow will provide an experiment tracking tool, log our experiment metrics automatically and allow for saving out the model to our model registry.

Demo Notebook

You will run the demo notebook either in a local Docker container or on the Databricks cluster. This picture explains the full process:

Create a lakeFS and Git repository.
Create an Experiment branch.
Import selected images from object storage (S3 in this case) to the Experiment branch. This will be a zero-copy operation and will take a few seconds.
If you will run the demo locally then you will clone the Experiment branch locally which will download the images locally. It is recommended to use a smaller dataset locally.
Demo will build a data pipeline with Medallion Architecture
1. Convert raw images and image masks to Delta table format and save as Bronze dataset.
2. Resize images, transform image masks into images and save as Silver dataset.
3. Join images and image masks. Save it as a Gold dataset.
4. You will commit all these datasets in the lakeFS repository for version control and data lineage purposes.
Split Gold dataset into train and test datasets.
Prepare the dataset in PyTorch format by using Petastorm.
Train the base model. If running locally then train the model once with particular parameters (architecture, encoder and learning rate). If running on the distributed cluster then you will fine-tune hyperparameters with Hyperopt.
Save the best model to the MLflow registry and save the best model information in the lakeFS repository.
Flag the best model version as production-ready in MLflow.
Save demo notebooks (code) to Git repo. Git will not add local images to the staging area while adding “.lakefs_ref.yaml” file which includes lakeFS commit information.
If you want then you can merge the best model information and datasets to the main production branch in lakeFS but this step is not part of the Demo Notebook.

Demo Prerequisites

Docker installed on your local machine
This demo requires connecting to a lakeFS Server. You can spin up lakeFS Server for free on the lakeFS cloud.
If you want to run the demo in a distributed Spark cluster then you can run this demo on Databricks cluster but it is not required.

Step 1: Demo Setup

You will be utilizing a prepackaged environment (Docker container) that includes Python, Jupyter notebook, Git, Spark and MLflow. You can read more about the demo in this git repository.

Clone the repo:

ML Data Version Control and Reproducibility at Scale

Table of Contents

Ready to try lakeFS? Watch how it works

Introduction

Target Architecture

Demo Notebook

Demo Prerequisites

Step 1: Demo Setup

NOTES:

Step 2: Notebook Config

1. Change lakeFS Cloud endpoint and credentials:

2. Storage Information:

3. Local vs. Distributed Computing:

4. Number of images to use for each experiment:

5. Download Demo Dataset:

6. AWS credentials:

Step 3: Notebook Setup

Step 4: Experimentation Branch

Step 5: Zero Copy Import

Step 6: Clone Experiment Branch Locally

Step 7: Git + lakeFS Together

Step 8: Build the Data Pipeline

Step 9: Train the Base Model: Local Experiment

Step 10: Commit “Code+Data” to Git Repository

Step 11: Start Local MLflow Server

Step 12: Review lakeFS Commits and MLflow Experiment Logs

Step 13: Distributed Computing Environment

Step 14: Parallel ML Experiments

Step 15: Comparing and Evaluating Results

Summary

Related Articles

Agentic Data Access: How AI Agents Securely Access Enterprise Data

Give Your AI agent a Versioned Filesystem: A Self-Correcting Receipts Pipeline on E2B and lakeFS

Scaling ML Data Without Breaking Compliance

Pick up the Slack with lakeFS