Developing a machine learning application is a complex process that involves steps such as processing massive volumes of data, testing multiple ML models, parameter optimization, feature tuning, and others.

This is why data version control is critical in the ML environment.

If you want your experiments and data to be reproducible, you need to use the right version control tooling that enables you to keep track of all of the aspects listed above.

We have already covered the topic of ML data version control. But what about ML model versioning? Keep reading to learn what it is and which tools you can use to version your models.

What is Machine Learning Model Versioning?

Model versioning is the process of tracking and controlling software changes across time. Whether you’re developing an app or an ML model, you must keep track of every change made by team members to fix errors and avoid disagreements.

A version control framework helps you achieve this goal. Frameworks were created to track every individual change made by each contributor and save it in a unique type of database, allowing you to spot discrepancies and prevent conflicts in concurrent work while merging.

Model versioning vs. data versioning

Model versioning tracks the modifications made to a model. These changes can be attributed to optimization efforts, changes in training data, etc.
Data version control entails recording changes to a dataset. The data teams work with tends to vary over time as a result of feature engineering efforts, seasonalities, and other factors. This can occur when the original dataset is reprocessed, rectified, or even supplemented with additional data. It’s critical that teams track these developments.

Machine Learning Version Control Types

When it comes to version control, we can differentiate between centralized and distributed systems.

Centralized Version Control Systems

In this scenario, teams have a single repository on a central server. Versions are saved directly to this central repository and it’s thanks to that repository that smooth collaboration can happen.

However, a centralized system requires a network connection to the central repository, which is also a single point of failure for backup capabilities. Subversion is a centralized version control system.

Distributed Version Control Systems

Distributed systems use a single repository on a central server. Clients create local replicas of the central repository. Versions are initially stored in the local repository and then synchronized with the central repository.

The central repository enables collaboration, while local copies allow for offline work and increased backup capabilities.

Note that keeping a local copy of the entire repository takes up more disk space. Distributed version control systems include Git and Mercurial.

Benefits of Model Versioning

The machine learning development process is highly iterative, with developers seeking the best-performing model while adjusting hyperparameters, code, and data. It is critical to retain a record of these modifications to track model performance relative to the parameters, saving time spent retraining the model for experimentation.

Using a model version control system provides numerous advantages:

Collaboration

If you’re a solo researcher, versioning may not be necessary. But when you work with a team on a large project, collaboration becomes extremely difficult without a version control system in place.

Reproducibility

By capturing snapshots of the complete machine learning process, you can duplicate the identical output, including the trained weights, saving time on retraining and testing.

Rollback capabilities

When making updates, the model may break. A version control system provides a changelog, which might be useful when your model breaks and you need to roll back your changes to return to a stable version.

Dependency monitoring

This entails tracking several versions of datasets (training, assessment, and development) and tweaking model hyperparameters and parameter values. Version control enables you to test several models on separate branches or repositories, modify model parameters and hyperparameters, and track the accuracy of each change.

Model updates

Model development does not occur in a single step but in cycles. Version control allows you to control which versions are issued while continuing to develop for future release.

How Do You Version Control Models?

The answer to this question depends on where you are in the model development process. Let’s examine what needs to be versioned in each development stage.

Choosing the Algorithm

To select a model, first decide on the algorithm to be used. You may need to test more than one algorithm and compare their results. Each algorithm should have its own versioning to track changes independently and select the best-performing model.

Making Performance Changes

When building your model or making performance modifications, you should keep track of the changes to understand why the performance changed. This can be accomplished by designating separate repositories for each model. It allows you to test numerous models in parallel while providing isolation between them.

Parameter Versioning

When training models, keep track of the used hyperparameters. To modify each hyperparameter, you can construct separate branches and observe the model’s performance as the hyperparameters change.

Trained parameters, like model code and hyperparameters, should be versioned to ensure reproducibility of the same trained weights and to save time when retraining the model.

To do this using version control systems, you must build a branch for each feature, parameter, and hyperparameter you intend to update. This allows you to run the analysis of one change simultaneously while keeping all updates to the same model in a single repository.

The versioning system should store the holdout and performance results and record the performance matrices for each step. As noted in the model training section, once you’ve chosen the parameters that best suit your needs, you’ll need to merge the change into an integration branch and execute the assessment on that branch.

Model Validation

Model validation ensures that the model performs as intended on real data. This stage requires you to keep track of each validation result and the model’s performance over time.

Which adjustments to the model improved its performance? If you’re comparing different models, keep track of the validation matrices used to evaluate them.

To achieve this, after assessing the performance on the integration branch, model validation may be performed on the main branch, where you can merge the evaluated changes, execute your validation, and mark the changes that meet your client specifications as deployment-ready versions.

Model Deployment

When your model is ready for deployment, track which versions were sent and the modifications made between them. This allows you to have a staged deployment by deploying your most recent version on the main branch while still developing and refining your model.

Version control will also provide the fault tolerance required if your model fails during deployment, allowing you to roll back to the prior working version.

Model Changes

Finally, model versioning can assist ML engineers in understanding what was changed in the model, which functionality was updated by the researchers, and how the functionality was modified.

Knowing what was done and how it might impact deployment time and simplicity when combining different features.

Methods for Versioning Machine Learning Models

Versioning Configurations for Model Training and Deployment

One of the key best practices for versioning ML models is versioning configurations for training and deploying models. These settings include dependencies such as libraries and packages to maintain consistency across the training and deployment environments.

The deployment scripts used to deliver the model can be versioned, making the deployment process more reproducible. Versioning the deployment environment’s dependencies, such as the operating system, runtime libraries, and software packages, ensures consistency.

Containerization for Model Portability

Model packaging is all about combining model artifacts, dependencies, configuration files, and metadata into a single format that can be easily distributed, installed, and reused.

The ultimate goal is to simplify the process of deploying a model, making the transition to production easy. Model packaging ensures that teams can easily deploy and maintain a machine learning model in a production setting.

There are various advantages to adopting containerization technologies like Docker and Kubernetes to package ML models.

One of them is portability. ML models packaged using Docker or Kubernetes can be readily moved between different settings, including development, testing, and production. This enables developers to test their models in various situations and guarantee that they function properly before deployment.

Containerization solutions ensure that machine learning models work consistently across several environments, reducing the need to worry about dependencies and infrastructure. Moreover, containers provide a safe environment for executing ML models, blocking access to sensitive data and lowering the risk of attack.

Versioning Pre-Trained Models and Fine-Tuning Pipelines

We talk about model fine-tuning when a pre-trained model (a “foundation model”) is modified with additional data to learn new knowledge or be trained for a specific job. With an effective training process, a pretrained model can produce more accurate and contextually appropriate outcomes.

There are major advantages to adopting a pre-trained model. It lowers processing costs, reduces carbon footprint, and enables you to use cutting-edge models without having to train them from the start. Naturally, versioning pre-trained models makes a lot of sense because it helps you keep track of changed applied during fine-tuning.

Top Technologies for Machine Learning Model Versioning

lakeFS

lakeFS is an open-source version control system that runs over the data lake and uses Git-like semantics. Data scientists can use it to train models in isolation, run parallel experiments and achieve full data reproducibility.

A machine learning model is, in essence, a binary file. This means that lakeFS can easily manage and version an ML model since it supports any data format.

Using lakeFS’s integration with Git, you can manage the versions of your model, the data used to train it, and the code together. If most of your day to day work is local, you can use lakectl local to sync local directories with lakeFS paths or lakeFS Mount, which mounts a path to a local directory while keeping consistency and providing performance.

MLflow

MLflow is an open-source project that helps ML engineers manage the ML lifecycle. MLflow, like other platforms, allows you to version data and models and repackage code to ensure reproducible results. The platform works well with a variety of ML libraries and tools, including TensorFlow, Pytorch, XGBoost, and Apache Spark.

MLflow data versioning capabilities are as follows:

MLflow Tracking – Monitor experiments by logging parameters, metrics, code versions, and output files. Log and query experiments using Python, Java APIs, and so forth.
MLflow Projects – Organize implementation code in a reproducible manner using coding rules. This allows you to rerun your code.
MLflow Models – Standardize the packaging of machine learning models. This lets you use and interact with your models over the REST API. Batch prediction is also possible with Apache Spark.

Model Registry – You can version your models and create a model lineage that displays the model’s development lifecycle.

DVC

DVC — Source: https://dvc.org/doc/use-cases/versioning-data-and-models

Data Version Control (DVC), acquired by lakeFS, lets users capture versions of their data and models in Git commits and store them on-premises or in cloud storage. It also includes a system for switching between the various data contents. The result is a single history for data, code, and ML models they can navigate and treat as a work log.

DVC permits data versioning via codification. You create basic metafiles once that describe the datasets, ML objects, and so on that you want to track. The metadata can be stored in Git instead of huge files. You can use DVC to make data snapshots, restore prior versions, replicate experiments, and track evolving metrics.

When you use DVC, unique versions of your data files and folders are cached systematically (avoiding file duplication). To keep the project light, the working data storage is segregated from your workspace but remains connected via file linkages that DVC automatically handles.

GitLFS

Git LFS is a solution for managing huge files and model artifacts in Git repositories. It substitutes big files and model artifacts with pointers, decreasing repository size and allowing for efficient versioning of those files.

Using Git LFS can help preserve reproducibility and collaboration in machine learning projects, making it easy to exchange models and datasets with team members and replicate experiments.

In addition to huge files, Git LFS is handy for handling model artifacts such as trained models, weights, and configuration files. Tracking them with Git LFS lets you to version them alongside your source, making replicating experiments and sharing models with team members easier.

To add model artifacts to Git LFS, use the same **git lfs track** command as you would for large files. Once tracked, artifacts can be added, committed, and pushed to the repository using Git commands.

Challenges and Pitfalls in ML Model Versioning

Addressing Model Drift and Concept Drift

Model drift, or the steady degradation of model performance caused by changes in data patterns or real-world conditions, is one of the most common issues in ML deployment.

There are two common types of model drift:

Data drift happens when the distribution of input data shifts over time. For example, a content recommendation engine was trained using a historical database of the most popular movies. However, the platform has seen numerous new releases and changes in user viewing behaviors since then.
Concept drift happens when the task that the model was taught to do changes. For example, a fraud detection model was trained to detect IRSF fraud in telecom, but fraudsters have devised new tactics. Concept drift can occur suddenly, gradually, incrementally, or repeatedly.

To avoid these, use continuous model monitoring tools that help spot significant data shifts or model deviations by measuring important model performance parameters.

Balancing Storage Efficiency and Versioning Detail

Versioning large datasets can quickly use a lot of storage space, especially if they have frequent updates or are high in volume. Managing storage efficiently without sacrificing access to previous versions can be costly and technically challenging.

Managing Multi-Model Dependencies

Organizing various versions of datasets, related metadata, and preparation scripts can strain the infrastructure. Teams must carefully manage the dependencies between different versions of data and code to minimize errors or mismatches that could jeopardize model performance.

Handling Real-Time Model Updates in Production

It’s essential to monitor models in production. That way, you can detect issues with your model and the system that serves it in production before they begin to cause problems.

Real-time monitoring also lets you take action by triaging and debugging production models or the inputs and systems that support them. Always have a plan for maintaining and developing the model in production.

Best Practices for Machine Learning Model Versioning

Establishing a Model Versioning Strategy Early On

It’s key to develop your model versioning strategy early on.

First, decide which version control system you’re going to use. You may be tempted to manage your machine learning models with a general-purpose version management system, such as Git. However, this can quickly become time-consuming and inefficient because ML models are frequently big, binary, and dynamic files that are not well suited to Git’s text-based and static approach. Instead, use a version control system specifically intended for ML models.

Another key step in your strategy is establishing a consistent naming convention. One of the issues of versioning ML models is keeping track of the many versions and their properties. A viable solution to this problem is to employ a consistent naming standard that reflects the model’s purpose, design, data, and performance.

Automating the Versioning Process for Scalability

Automation, which occurs through CI/CD pipelines, ensures that the version control stages are seamlessly integrated into the ML workflow. It reduces the likelihood of human errors and increases effectiveness.

Use tools that automatically track and version models and associated metadata, guaranteeing that each model version is reproducible and traceable. You can also configure the CI pipelines to automatically initiate the model training process whenever changes occur in the codebase or data (for example, if you use git, every push can activate the pipeline).

Make sure that you have automatic monitoring in place to check the model’s performance after deployment; relying entirely on manual inspections is bound to become time-consuming.

Integrating Versioning with Experimentation Workflows

Make sure to document experiments and changes, including the reasons for changes and the details of the experiments to enable smooth collaboration. To improve the documentation process, you can incorporate automatic document production into CI/CD pipelines, ensuring that documentation is constantly up to date with the most recent code changes.

Don’t forget to provide descriptive comments throughout the code to explain sophisticated logic, algorithms, or any complex decisions made during development and deployment.

Leveraging Versioning for Compliance and Auditing

Model versioning and governance allow you to recreate and compare the performance of various versions of your models, which helps in optimization and debugging. They also enable you to roll back to prior versions in the event of mistakes or malfunctions, which can boost your reliability and availability. Finally, they help in documenting and auditing your models’ history and lineage.

Governing your models entails ensuring that they meet corporate and legal requirements and expectations, including accuracy, fairness, robustness, privacy, and ethics. Depending on your context and aims, you can regulate your models using various processes and methods. For example, you can use a code review process to evaluate and approve your models’ code and logic before deploying them. You can also use a testing and validation procedure to validate and evaluate the findings and behavior of your models in various scenarios and contexts.

One more approach is employing a monitoring and feedback method, in which you track and assess the performance and impact of your models in production while also collecting and incorporating user and customer input.

Machine Learning Model Versioning Use Cases

Every use case that involves model versioning calls for data versioning as well. This is where lakeFS can help. lakeFS brings version control to data lakes and creates data versions using Git-like semantics.

lakeFS allows you to apply ideas to your data lake, such as branching to create an isolated version of the data, committing to create a reproducible point in time, and merging to combine your changes in a single atomic action.

Data preprocessing

Developing machine learning models requires more than just executing code; it also requires training data and the necessary parameters. Updating machine learning models is an iterative process that requires tracking all prior changes.

Data versioning allows you to save a snapshot of the training data and experimental outcomes, making implementation easier with each iteration.

Many data preparation activities become more efficient when data is kept in the same format as code. lakeFS helps implement data versioning at all stages of the data’s lifetime via hooks for zero-copy isolation, pre-commit, and pre-merge that allow you to design an automated procedure.

Parallel experimentation

Using lakeFS, you can run experiments in parallel with different settings without sacrificing performance or scalability. For example, you can combine lakeFS with MinIO to achieve parallelism without incurring additional storage costs since lakeFS lets you efficiently maintain different versions of your ML datasets and models without duplicating the data.

Data reproducibility

Since lakeFS provides a Git-like interface for data, you can easily track different versions of your datasets and easily reproduce their state at any point in time.

To make data reproducible, all you need to do is create a new commit for your lakeFS repository when the data in it changes. As long as a commit has been made, reproducing a particular state is as simple as reading data from a path containing the unique commit_id generated for each commit.

Wrap up

Machine learning model versioning is a critical step during and after model development. It supports collaboration, history-keeping, and performance monitoring across different versions. Whether you choose a centralized or distributed version control system, make sure that the solution you pick meets the unique requirements of your project.

Read this guide for more insights about version control in ML: ML Data Version Control and Reproducibility at Scale.

Machine Learning Model Versioning: Top Tools & Best Practices