Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community

ML Data Version Control and Reproducibility at Scale

In the ever-evolving landscape of machine learning (ML), data stands as the cornerstone upon which triumphant models are built. However, as ML projects expand and encompass larger and more complex datasets, the challenge of efficiently managing and controlling data at scale becomes more pronounced.

Breaking Down Conventional Approaches: The Copy/Paste Predicament

In the world of data science, it’s commonplace for data scientists to extract subsets of data to their local environments for model training. This method allows for iterative experimentation, but it introduces challenges that hinder the seamless evolution of ML projects:

Reproducibility Constraints: Traditional practices of copying and modifying data locally lack the version control and audit-ability crucial for reproducibility. Iterating on models with various data subsets becomes a daunting task.

Inefficient Data Transfer: Regularly shuttling data between the central repository and local environments strains resources and time, especially when choosing different subsets of data for each training run.

Limited Compute Power: Operating within a local environment hampers the ability to harness the full power of parallel computing, as well as the distributed prowess of systems like Apache Spark.


In this session, we will demonstrate

  • How to use lakeFS to version control your data when working with your data locally.
  • How to use lakeFS without the need to copy data locally, and train your model at scale directly on the Cloud. We will be leveraging the technology stack of:
    • AWS S3
    • Databricks Delta Lake
    • PyTorch
    • MLflow


Amit Kesarwani

Director Solution Engineering, lakeFS

Iddo Avneri

VP Customer Success, lakeFS

Git for Data – lakeFS

  • Get Started
    Get Started
  • Did you know that lakeFS is an official Databricks Technology Partner? Learn more about -

    lakeFS for Databricks