Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros

Learn from AI, ML & data leaders from Dell, Lockheed Martin, Red Hat & more

Amit Kesarwani
Amit Kesarwani Author

Amit heads the solution architecture group at Treeverse, the company...

Last updated on December 24, 2024

Data versioning is a central aspect of modern data management, especially in the context of GenAI and machine learning. Teams need a solution to version both their data and models. By keeping track of various iterations of datasets and models, they can manage changes smoothly and ensure the reproducibility of results. 

MLflow has become a cornerstone in ML data version control because it solves this and many other problems. 

In this article, we take a deep dive into the data versioning capabilities of MLflow to show you why teams use it for data version control – and how you can complement these capabilities with the open-source data versioning solution lakeFS.

What is Data Versioning in MLflow?

MLflow helps teams track ML experiments, such as your models, model parameters, datasets, and hyperparameters – and then reproduce them as needed. MLflow provides a packing format for reproducible runs on any platform and distributes models to your preferred deployment tools.

The MLflow Tracking API and UI allow you to record runs, organize them into experiments, and log extra data. The solution also includes various useful components for monitoring operations, such as model training, model storage, model management, model loading to production, and pipeline creation.

Key Components of MLflow Data Versioning

The most important data versioning components of MLflow include:

Component What It Does
Tracking MLflow lets you record and compare experiment settings and results
Models This feature allows you to manage and deploy models from various ML libraries to different model serving and inference platforms
Projects You can package machine learning code in a reusable, reproducible format that can be shared with other data scientists or deployed to production
Model Registry This enables you to centralize a model repository for managing models’ whole lifecycle stage transitions, from staging to production and includes versioning and annotation capabilities
Model Serving This feature enables you to host MLflow models as REST endpoints

Benefits of Data Versioning in ML Operations

Improved Traceability and Reproducibility 

This ensures that experiments can be replicated and results validated. At the same time, reproducibility encourages trying out new ideas without worrying about losing earlier work, since all versions are saved.

Enhanced Collaboration Across Teams 

Version control enables each team member to work on many features or fixes simultaneously, isolating their changes via branches. When a feature is finished, it can be merged back into the main codebase in a controlled manner, avoiding the turmoil of conflicting modifications and ensuring that everyone’s contributions are seamlessly integrated.

Furthermore, version control systems maintain a full history of changes, making it simple to determine who made specific changes and why, which is extremely useful for debugging and code reviews. 

Efficient Rollbacks and Issue Resolution 

As the world changes, so does data. If you want to return to or investigate an older data status, you can do so by restoring it from a log file easily, thanks to data version control.

Maintained Consistency in Production Deployments 

Data versioning allows teams to capture the versions of their data and models in Git commits, providing a way for switching between these various data contents. The end result is a consistent history of data, code, and machine learning models that team members can navigate. This ensures that projects have consistent logical file names and lets you employ diverse storage options for your data and models in any cloud or on-premises solution.

Regulatory Compliance and Auditing 

Data versioning also improves data compliance by allowing teams to use audit features to review data changes, which the system painstakingly records.

Faster Experimentation and Model Iteration 

Extended model management features provide standardized components for each stage of the machine learning lifecycle, making it easier to design ML applications.

Data Integrity and Version Control for Large Datasets

Data version control capabilities can scale to match the size of your datasets.

Data Versioning Techniques in MLflow

MLflow Integration for Experiment Tracking

MLflow’s Tracking feature allows users to document experiments by logging key information such as parameters, metrics, and artifacts. This makes it easy to track the performance of different models, monitor hyperparameters, and store artifacts like trained models and plots, ensuring the entire experimentation process is transparent and organized. 

Additionally, MLflow enhances reproducibility by recording the complete computing environment, including libraries, dependencies, and configurations. This ensures that experiments can be reproduced consistently across different machines or over time, providing reliable results and minimizing the risk of discrepancies caused by environment changes. By combining tracking with reproducibility, MLflow streamlines the end-to-end machine learning workflow.

Model Registry and Version Control

The MLflow Model Registry component consists of a centralized model store, APIs, and a user interface for cooperatively managing an MLflow Model’s whole lifecycle. It includes model lineage (the MLflow experiment and run that created the model), model versioning, stage transitions (such as staging to production), and annotations.

You can add an MLflow Model to the Model Registry. A registered model has a unique name and includes versions, transitional phases, model lineage, and other metadata. 

Each registered model may have one or more versions. When a new model is added to the Model Registry, it is assigned “version 1.” Each model version can be allocated a single stage at any given moment. MLflow includes predefined stages for common user scenarios, such as Staging, Production, and Archive. You can move a model version from one stage to another.

Managing Data Pipelines for Versioning

MLflow Pipelines is a system that allows users to quickly create high-quality models and put them into production. Compared to ad-hoc ML workflows, MLflow Pipelines have three important benefits.

By eliminating the large amount of boilerplate code that is typically required to curate datasets or train and tune models, predefined templates for common aspects, such as regression modeling, you can let the teams start quickly and focus on building great models.

The intelligent pipeline execution engine speeds up model creation by caching results from each phase of the process and re-running the smallest number of steps as modifications are made.

The modular, Git-integrated pipeline structure makes the move from development to production much easier by making sure that all model code, data, and configurations are simple for ML developers to review and deploy.

Deployment Consistency and Rollbacks

For programmatic interaction, you can register models through the UI or the API. Additionally, you can organize models using aliases and tags, and deploy consistently across environments. 

Deployment consistency is just one advantage. In case anything goes wrong, you can easily roll back to a specific version of your model and navigate all the different versions easily, thanks to consistent naming.

Challenges and Considerations in Data Versioning in MLflow

Scaling Version Control for Large Datasets  

Managing numerous versions of large datasets can be difficult due to increased storage and performance demands.

Automation in Data Versioning  

Unlike code, datasets can be large, dynamic, and difficult to track effectively over time. Automating the tracking of data lineage, changes, and dependencies requires integrating tools that handle not just dataset versions but also metadata, preprocessing steps, and transformations. Moreover, handling version conflicts and scalability for large datasets further complicates automation efforts.

Handling Changes in Data Schemas 

Data schema versioning can be difficult because you must handle dependencies and conflicts among different sources, targets, and pipelines. Furthermore, you must examine how data schema changes may affect existing data and whether transformation or backfilling is necessary. Striking the balance between data schema evolution and stability is also critical, especially when dealing with frequent or complicated modifications that could cause problems in your data warehouse or downstream applications. 

Tooling and Integration Challenges  

Integrating data versioning with current data pipelines, particularly ETL (Extract, Transform, Load) operations, necessitates careful planning and execution to maintain consistent data flows and version tracking across the data lifecycle.

Real-World Applications of Data Versioning in MLflow

Experiment Management in Production

Experiment tracking is the process of carefully monitoring and maintaining all aspects of machine learning experiments to promote reproducibility, comparability, collaboration, and transparency/traceability. This includes: 

  • Documenting hyperparameters utilized in model optimization 
  • Evaluation metrics acquired from the model
  • Artifacts (such as datasets and graphs) generated during the experiment
  • Code versions used, and more

In this context, an ML experiment is a systematic process of applying several configurations, algorithms, or models to a dataset in order to investigate, analyze, and enhance performance on a particular task. The fundamental purpose of an experiment is to validate hypotheses and refine models using iterative testing and evaluation.

In MLflow, users often divide this process into several phases. First, they define the experiment, specifying its purpose and the parameters to track. As the experiment progresses, they record the parameters, measurements, and artifacts. They then carry out multiple runs with different setups and examine the results to discover the most effective approaches.

Versioning in Deep Learning Projects

MLflow provides a set of features to power your deep learning workflows:

Feature What It Does
Experiment Tracking MLflow keeps track of your deep learning experiments, including their parameters, metrics, and models. Your experiments will be saved on the MLflow server, allowing you to compare and share them
Model Registry You can save your trained deep learning models to the MLflow server and simply retrieve them later for inference
Model Deployment After training, use MLflow to offer the trained model as a REST API endpoint, allowing you to simply incorporate it into your application
Native Library Support MLflow features native integrations with popular deep learning libraries like PyTorch, Keras, and Tensorflow, allowing you to simply incorporate MLflow into your workflow to elevate your deep learning projects

Scalable MLflow Implementations

MLflow is designed to work easily with a variety of data settings, ranging from tiny datasets to Big Data applications. Machine learning outcomes frequently rely on solid data sources and so the solution needs to scale effectively to meet variable data requirements.

Here’s how MLflow handles scalability across dimensions:

Distributed Execution  

MLflow runs can be performed on distributed clusters. For example, integration with Apache Spark enables distributed processing. Furthermore, runs can be started on your preferred distributed architecture, with results sent to a centralized Tracking Server for analysis.MLflow provides an integrated API for starting runs on Databricks.

Parallel Runs 

MLflow can organize many runs with different parameters at the same time, which is useful for hyperparameter tuning.

Interoperability with Distributed Storage  

MLflow Projects may communicate with several distributed storage technologies, such as Azure ADLS, Azure Blob Storage, AWS S3, Cloudflare R2, and DBFS. Whether it’s automatically fetching files to a local context or directly interacting with a distant storage URI, MLflow ensures that projects can manage large datasets, such as analyzing a 100 TB file.

Centralized Model Management with Model Registry  

Large-scale businesses can benefit from the MLflow Model Registry, a unified platform designed for collaborative model lifecycle management. The Model Registry is especially useful in circumstances where different data science teams are generating numerous models at the same time. It speeds model discovery, tracks experiments, manages versions, and helps teams grasp a model’s objective.

Best Practices for Data Versioning in MLflow

Guidelines for Organizing Data Versions

One way to organize data versions is by using a special data repository. For example, once you’ve implemented a data versioning system, you’ll need to describe your data repository. A data repository is akin to a Git repository or a cloud storage solution, such as an Amazon S3 bucket.

A well-defined repository will include data sets that are used together for a logical purpose of analysis and must be consistent across time, such as all data sets linked to sales funnel optimization or the ML model used to predict customer turnover.

In open-source version control systems like lakeFS, a repository is a logical namespace that contains data, branches, and commits. Before starting to version your data, you must first determine which data will be included in your repository.

Automating Version Control with Pipelines

Write-Audit-Publish and Deployment ensure that modifications and upgrades to data pipelines are automatically tested, integrated, and deployed to production, allowing for consistent and dependable data processing and delivery.

In data engineering, this includes automated ETL code testing, validating data structure, monitoring data quality, detecting anomalies, deploying updated data models to production, and ensuring that databases or data warehouses are properly configured.

Git’s core function is version management, but its interaction with Write-Audit-Publish solutions such as GitHub Actions makes it an effective deployment tool. Data engineers can automate pipeline deployment by defining certain “actions” or “workflows.” This means that when you push code to a Git repository, it will be immediately deployed to a production environment if it meets all of the required requirements.

This technique incorporates production-level engineering into data operations. It ensures that data pipelines are stable, dependable, and constantly monitored. It also means that data engineers may focus on writing and optimizing their code while the deployment process is automated and secure.

Managing Multiple Data Sources

Handling data coming from multiple sources is one of the biggest challenges to managing data in general – no wonder that it affects data versioning as well. When approaching the issue, remember that one of the problems in data environments is to avoid stepping on your peers’ toes. Data assets are frequently considered a shared folder from which anybody may read, edit, and modify.

This is why it makes so much sense to create personal versions of the data while developing a solution. This eliminates the chance that a modification you make would inadvertently harm another member of the team. 

Ensuring Consistency Across Development and Production Environments

Database versioning is an essential component of managing changes to your data over time. It enables teams to monitor changes, roll back to prior versions as needed, and ensure consistency across several settings. 

To manage migrations effectively, keep them small and gradual, as smaller changes are easier to monitor and debug. Make sure to use descriptive names for migration scripts to clearly indicate the changes they implement, making it easier to track their purpose. Also, automate the migration process by integrating it into Write-Audit-Publish pipelines, ensuring smooth and consistent deployments while reducing the risk of human error.

MLflow Data Version Control and lakeFS

Data is a big part of any ML project, and you need a solid data versioning solution for it – not only for your ML models. Many solutions for data version control have limited capabilities and integration with other tools in the ecosystem. But lakeFS, an open-source project, bridges this gap.

It enables teams to manage their data using Git-like procedures (commit, merge, etc.) while growing to billions of files and petabytes of data. Bringing best practices from software engineering to data is a wise decision. In this scenario, you add a management layer to your object store, such as S3, which converts your entire bucket into a code repository.

lakeFS is a novel approach (zero clone copies) to efficiently maintain different versions of your ML datasets and models without duplicating the data.

Have a look at this guide to see how to use Git and lakeFS to version control your code and data when working locally: ML Data Version Control and Reproducibility at Scale.

Conclusion

In deep learning, where models undergo regular updates and iterations, data versioning plays a crucial role in modern machine learning workflows. MLflow’s Model Registry and other data versioning features offer reliable mechanisms for tracking these versions, allowing users to track changes, backtrack to previous iterations, and efficiently compare alternative models.

Are you working with a data lake? Explore MLflow on Databricks to get all the best practices for a smooth data versioning experience.

We use cookies to improve your experience and understand how our site is used.

Learn more in our Privacy Policy