Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Amit Kesarwani
Amit Kesarwani Author

Amit heads the solution architecture group at Treeverse, the company...

Last updated on February 13, 2025

Machine learning teams face many hurdles, from data sources with missing values to experiment reproducibility issues. MLflow is a tool that makes this easier. And Databricks makes working with it even more straightforward, thanks to its managed MLflow offering. 

Managed MLflow expands the capabilities of MLflow, with an emphasis on dependability, security, and scalability. Keep reading to learn more about MLflow on Databricks and how to add data versioning to the mix to achieve experiment reproducibility.

What is MLflow on Databricks?

MLflow is an open-source platform that manages the whole machine learning lifecycle. It comes with the following components:

Component What it Does
Tracking This feature allows you to record and compare parameters and outcomes from several tests.
Models You can maintain and deploy models from a range of machine learning libraries to a variety of model serving and inference platforms.
Projects This allows you to package ML code in a reusable, reproducible format that can be shared with other data scientists or deployed in production.
Model Registry It centralizes a model store for managing models’ whole lifecycle stage transitions, from staging to production, and includes versioning and annotation features. Databricks offers a managed version using Unity Catalog.
Model Serving This enables you to host MLflow models as REST endpoints. Databricks offers a uniform interface for deploying, managing, and querying your AI models.
MLflow
Source: MLflow

Databricks offers a fully managed and hosted version of MLflow, complete with corporate security measures, high availability, and additional Databricks workspace features like experiment and run management, as well as notebook revision recording.

Benefits of MLflow on Databricks

Model Development

MLflow on Databricks
Source: Databricks 

A unified framework for production-ready models can help to improve and accelerate machine learning lifecycle management. Managed MLflow Recipes provide for smooth ML project startup, quick iteration, and large-scale model deployment. 

With MLflow’s LLM services, you can easily develop generative AI apps that interact smoothly with LangChain, Hugging Face, and OpenAI.

Experiment tracking 

Managed MLflow on Databricks
Source: Databricks 

This feature lets users run experiments using any machine learning library, framework, or language, and each experiment’s parameters, metrics, code, and models will be automatically tracked. 

MLflow allows you to securely share, manage, and compare experiment results, as well as matching artifacts and code versions, due to built-in connections with the Databricks workspace and Notebooks. With MLflow’s evaluation feature, you can also review the outcomes of GenAI trials and enhance their quality.

Model management

MLflow model management
Source: Databricks 

Use a single location to identify and share ML models, collaborate on their transition from experimentation to online testing and production, link with approval and governance procedures and Write-Audit-Publish pipelines, and track ML deployments and performance. 

The Model Registry promotes the exchange of skills and knowledge while keeping you in control.

Model deployment

Model deployment MLflow
Source: Databricks 

This feature lets you easily deploy production models for batch inference on Apache Spark or REST APIs with built-in integration with Docker, Azure ML, or Amazon SageMaker. 

Managed MLflow on Databricks allows you to operationalize and monitor production models using the Databricks jobs scheduler and auto-managed clusters that grow based on business requirements.

The most recent improvements to MLflow smoothly bundle GenAI apps for deployment. Databricks Model Serving allows you to scale up your chatbots and other GenAI applications.

How to Run MLflow Projects on Databricks

If you’re already a Databricks customer, you can access the MLflow service through your Databricks workspace. 

Note: The MLflow APIs in Databricks are identical to the open-source version, so you can run the same code on Databricks or on your own infrastructure.

To get started with MLflow from within your Databricks account, check the specific instructions for your cloud provider:

Running MLflow on Databricks Community Edition

Databricks Community Edition (CE) is a fully managed and hosted version of the Databricks platform. 

Many of the corporate features of the Databricks platform are not available on CE, but most of the MLflow features are. The only notable difference is that serving endpoints cannot be created on CE, which stops model deployment.

To get started, go to the Databricks Community Edition page and follow the steps there. It takes around 5 minutes to set up, and you’ll have an almost fully working Databricks Workspace where you may log your instructional experiments, runs, and artifacts.

When you log into the Community Edition, you’ll see a landing page like this:

Databricks Community Edition page
Source: MLflow 

To access the MLflow UI, click on the “Experiments” link on the left side. When you initially open the MLflow UI on CE, you will see this:

Databricks Community Edition MLflow UI
Source: MLflow

With a Databricks managed instance of MLflow, you have two choices for executing the tutorial notebooks:

  1. Expand to discover how to import notebooks straight into CE.
  2. Learn how to run notebooks locally and use CE as a remote tracking server.

How to Create MLflow Projects in Databricks: Tutorial

Step 1: MLflow Tracking

MLflow on Databricks provides a seamless experience for tracking and securing training runs for machine learning and deep learning models.

Learn more about this step here: Track model development using MLflow

Step 2: Model Lifecycle Management

Model Registry is a centralized model repository with a UI and APIs that let you manage the whole lifecycle of MLflow Models (opening the door to machine learning lifecycle management). 

Databricks offers a hosted version in the Unity Catalog, which comes with centralized model governance, cross-workspace access, lineage tracking, and deployment.

If your workspace is not set up for Unity Catalog, you can use the Workspace Model Registry. Here are the key concepts you need to know:

Component What it Does
Model An MLflow Model recorded from an experiment.
Registered model An MLflow model that has been registered with the Model Registry. A registered model contains a unique name, versions, model lineage, and more metadata.
Model version When a new model is added to the Model Registry, it’s assigned Version 1. Each model with the same model name increases the model version number.
Model alias A mutable, named reference to a specific version of a registered model. Aliases are commonly used to identify which model versions are deployed in a particular environment in machine learning model training processes, as well as to create inference workloads that target a certain alias.
Model stage
(only applies to workspace model registry)
MLflow has preset stages for typical use scenarios, including None, Staging, Production, and Archived. You can transition a model version between stages or request a model stage transition. A model version can be allocated one or more stages.
Description You can annotate a model’s goal by giving a description as well as any relevant information for the team, such as the algorithm, dataset, or technique.

Example notebooks

The following is an example of how to use the Model Registry to develop a machine learning application that anticipates the daily electricity production of a wind farm:

Step 3: Model Deployment

Finally, it’s time to deploy MLflow models! Mosaic AI Model Serving offers a uniform interface for deploying, managing, and querying AI models. Each model you provide has a REST API that you may use in your web or client application.

Model serving enables serving:

  • Custom models – They may be registered in either Unity Catalog or the workspace model registry. Examples include scikit-learn, XGBoost, PyTorch, and Hugging Face transformer models.
  • Models from Foundation Model APIs – These models are carefully curated foundation model architectures that enable optimal inference. Base models such as Llama-2-70B-chat, BGE-Large, and Mistral-7B are immediately accessible with pay-per-token pricing, but applications requiring performance guarantees and fine-tuned model variations may be installed with provided throughput.
  • External models – These models are hosted outside of Databricks. Examples include generative AI models such as OpenAI’s GPT-4 and Anthropic’s Claude, among others. Endpoints that serve external models may be centrally managed, and you can set rates and access constraints for them.

Additional Capabilities of MLflow on Databricks

MLflow Model Registry on Databricks

The MLflow Model Registry component consists of centralized model storage, APIs, and a user interface for cooperatively managing an MLflow Model’s whole lifespan. It includes model lineage (the MLflow experiment and run that created the model), model versioning, model aliasing, model tagging, annotations, and status (registered model).

MLflow Model Registry allows you to manage the entire lifecycle of MLflow Models. Databricks offers a hosted version of the MLflow Model Registry in the Unity Catalog.

MLflow Model Serving on Databricks

After training and testing your machine learning model, the next step is to deploy it in a production environment. This process may be complicated, but MLflow makes it easier by providing a simple toolkit for distributing your ML models to a variety of destinations, such as local environments, cloud services, and Kubernetes clusters.

Model Serving allows you to host machine learning models from the Model Registry via REST APIs. These endpoints are automatically updated according to the availability of model versions and stages. The built-in deployment tools in MLflow allow you to deploy a model to third-party serving frameworks as well.

MLflow supports a number of deployment targets:

  • Deploying locally – With MLflow, deploying a model locally as an inference server is as simple as running the command mlflow models serve.
  • Amazon SageMaker – A fully managed solution for scaling machine learning inference containers. MLflow makes deployment easier with simple commands that eliminate the need to build container descriptions.
  • AzureML – An MLflow Model can be deployed to Azure ML’s managed online/batch endpoints, as well as Azure Container Instances (ACI) and Azure Kubernetes Service.
  • Databricks – Provides a managed service for delivering MLflow models at scale, including performance optimization and monitoring tools.
  • Kubernetes – MLflow deployment interfaces with Kubernetes-native ML serving frameworks, including Seldon Core and KServe (formerly KFServing).
  • Community Supported Targets – MLflow also supports other deployment targets, including Ray Serve, Redis AI, Torch Serve, and Oracle Cloud Infrastructure (OCI), via community-supported plugins.

Databricks Autologging

Databricks Autologging is a no-code solution that builds on MLflow automatic logging to provide automatic experiment tracking for machine learning training sessions on Databricks.

Databricks Autologging automatically captures model parameters, metrics, files, and lineage information as you train models from a number of popular machine learning libraries. 

Training sessions are recorded using MLflow tracking runs. Model files are also recorded, allowing you to simply log them into the MLflow Model Registry and distribute them for real-time scoring using Model Serving.

Managing Machine Learning Infrastructure with lakeFS

Data teams working in ML settings need ways to work on isolated “copies” of the data to preprocess it prior to training. This ensures data integrity, allows for safe experimentation, and supports parallel workflows among team members. Additionally, versioning the data is crucial for reproducing the exact dataset used for training specific models. This practice enhances reproducibility, maintains accountability, ensures regulatory compliance, and facilitates model monitoring and maintenance over time.

lakeFS is a data version control solution that matches the machine use case perfectly. Here’s how you can use it to version your data for ML experimentation:

  • ML Data Reproducibility – lakeFS allows you to version the data  components of an ML experiment without duplicating the data for multiple experiments. 
  • Data Preparation in Isolation – Branches enable each data scientists to prepare the data in isolation.
  • Parallel ML Experimentation – Run multiple experiments simultaneously, using different dataset versions.
  • Advanced Unstructured Data Filtering – Simplify model development by filtering objects using custom tags
  • Fast Data Loading for Deep Learning Workloads – Localize data to reduce latency and cut costs by optimizing GPU utilization

lakeFS and MLFlow in practice

Imagine that you can take an S3 bucket and make changes in separate branches, commit them once tested or roll back to the previous working version of your model. 

The ecosystem of tools we have in our environment will access the data through the lakeFS API and get those versioning capabilities.

And from a refactoring perspective, we can include the name of a branch or a commit identifier to track down the state of our data at a specific point in time: a specific experiment, raw data, modified data, and sample data that was used to train a specific model along with the model itself.

We can run these Git actions via a notebook, command line interface or the solution UI. 

To learn more about how lakeFS works with MLflow, watch this on-demand tutorial: Version control of local datasets for MLflow experiments.

Conclusion

Managed MLflow on Databricks allows you to manage the whole machine learning lifecycle with the dependability, security, and scalability required for corporate applications. It offers an integrated platform for tracking and securing machine learning model training and ML projects in production.

Managed MLflow can be hosted by Databricks or completely managed on your own infrastructure. In addition to open-source MLflow capabilities, it includes enterprise-grade security, experiment management, high availability, and Databricks features such as notebook revision recording. 

lakeFS