Ready to dive into the lake?

lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Vino SD
April 21, 2023

MLOps is mostly data engineering. As organizations ride past the hype cycle of MLOps, we realize there is significant overlap between MLOps and data engineering. As ML engineers, we spend most of our time collecting, verifying, pre-processing, and engineering features from data before we can even begin training models. 

Only 5% of developing and deploying an ML system in production involves ML code. The rest of it is managing training data and serving infrastructure.

Challenges with ML reproducibility

While there are similarities between data engineering and ML engineering, the way ML applications are actually developed is different compared to traditional software development. 

ML workflows are not linear. Engineers experiment with different ML algorithms and parameters in an incremental and iterative manner to arrive at a more accurate ML model. Due to this iterative nature of development, one of the biggest challenges in ML is to make sure that work is reproducible. For example, an ML system designed to diagnose cancer should make the same prediction every time for the same input specimen.

ML workflows are not linear. Reference:

Today, end-to-end ML pipelines involve multi-step complex workflows that comprise: 

  • Pre-processing the training data
  • Generating features
  • Running ML experiments 
  • Validating the ML models for explainability and bias
  • Deployment of model artifacts in production
  • Model monitoring for performance degradation and data drift
  • Re-training the models periodically, and so on

Due to these many moving parts, the reproducibility of the end-to-end ML pipeline is far from a solved problem today.  

How do we achieve ML reproducibility today?

ML practitioners tend to maintain different versions of the ML pipeline and data to achieve reproducibility.  However, copying massive training datasets each time we want to experiment is not scalable. On top of this, there’s no way to maintain versions of several model artifacts and their associated training data atomically. Add to this the complexities of managing versions of structured, semi-structured, and unstructured training data, such as video, audio, IoT sensor data, tabular data, and so on. Lastly, it’s hard to enforce data privacy best practices and data access controls when ML teams create duplicate copies of the same data for collaboration. 

An effective solution to these challenges requires a versioning tool. 

Engineering best practices may suggest Git for versioning data, just like we version code. But Git isn’t safe, sufficient, or scalable for data. In addition, raw training data is often persisted in cloud object stores (S3, GCS, Azure Blob), so we need a versioning tool that works for data in-place (in object stores). 

Let’s dive into how to use a data versioning tool such as lakeFS to version training data, ML code, and models together.

Taking Data-centric Approach to MLOps using lakeFS

lakeFS offers a Git-like interface for data at scale and enables reproducibility of the entire ML training pipeline. That is, it opens the door to atomic versioning of all components of ML experimentation: training data, features, model artifacts, and train/test metrics. 

Git-like branching and experimenting with lakeFS.

Let’s now use a wine-quality dataset to predict the quality of wines and see how lakeFS can be used to run the ML experiments.

To follow along, you can clone the lakeFS-samples repo.

ML Experimentation using lakeFS branches

Setting up the environment:

  1. We are using minIO as our storage. So let’s start by creating a minIO bucket `wine-quality-dataset`.
  2. Head over to the lakeFS UI and create a repository called `wine-quality-prediction`.
  1. In the lakeFS repo, create a branch `ingest-data` from `main`. Upload the wine dataset to `ingest-data` branch.
  1. Next, perform a quick exploratory analysis of the data to understand the dataset better. It’s good that there are no null values here. Quality is the target class, and the remaining columns are independent variables.
  1. As you can see from the histogram below, the target class (wine quality) is highly imbalanced.
  1. Run further analysis on the data set using a pair plot of feature correlations. It seems the features `acidity` and `pH` might be correlated. 
  1. Refer to the Jupyter Notebook where I ran additional exploratory analysis on the dataset as well.

Next, let’s dive into experimenting with different ML models to predict wine quality.

Experiment #1:

For the first experiment, I want to use a random forest classifier to do a multi-class classification of wine quality. My ML training pipeline steps include standard scaling of the input features, then running principal component analysis to understand the feature variance.

  1. Create a new lakeFS branch `exp-1` and write training config.json to the branch and commit it.
config = {
'branch_name': 'exp-1',
'drop_columns': ['type'],
'f1_average': 'micro', #imbalance class problem
'is_scale_input': True,
'is_pca': True,
'test_size': '0.25'
  1. The next step is feature creation. I use StandardScaler to scale my features and plot the PCA. As you can see in the line chart below, six principal components capture 90% of the feature variance, so I will use PCA(n_components=6) to transform the features.
The graph shows principal components on the x-axis and percent of feature variance captured by the components.
  1. Now, let’s write these features to our lakeFS repo and commit them.
  1. When you switch to the commits tab, you can look at the commit history for the exp-1 branch as well.
  1. Before we start the training, let’s split the dataset into train and test sets, and then write the pre-processed train and test data sets to lakefs://wine-quality-prediction/exp-1/preprocessed.
  2. You can review the changes in the uncommitted changes tab. You may then revert the changes as needed or commit them if you’re happy with the dataset.
  1. I then train the random forest classifier and get a training f1-score of 0.65. Definitely not a great model, but I want to save this experiment in the lakeFS repo. So, I write the model as a joblib in exp-1 branch and save the metrics.csv along with it. At the end of experiment #1, this is what my lakeFS repo looks like.

Experiment #2:

For the second experiment, I want to group my wine quality into just three target classes – good, okay, and bad. I use a weighted f1-score as my target metric to account for the class imbalance problem. My ML training pipeline steps doesn’t include standard scaling or PCA.

  1. Create a new lakeFS branch exp-2 from the ingest-data branch and write the config file to the lakeFS repo.
config = {
'branch_name': 'exp-2',
'drop_columns': ['type'],
'f1_average': 'weighted', #imbalance class problem
'is_scale_input': False,
'is_pca': False,
'test_size': '0.25'
  1. Next,I want to categorize the target class (wine quality) into three classes only. The imbalance class problem still exists but let me experiment with this approach and see if it improves the f1-score.
  1. Once I’m done with creating these new target classes, I write them to the lakeFS repo as features.csv and labels.csv as seen below.
  1. Next up, I split the dataset into training and test sets. I train the random forest classifier model and arrive at an f1-score of 0.87. Let’s write the train/test dataset, model artifacts, and metrics to the lakeFS repo as well. The commit history looks like this:
  1. I can traverse the commit history and access the lineage of the training data and other ML components as well. This way, using lakeFS I can version the raw training data, pre-processed data, features, config, model artifacts, and metrics in one place.
  1. In the end, I want to diff the exp-1 and exp-2 branches to see which experiment and model has a better f1-score and decide to deploy that model to main (prod). In the diff, we can see that exp-2 is the winning branch and I will merge exp-2 into the main branch. That is, we’re deploying ml-pipeline-2 to production.
  1. After I merge exp-2 to the main, you can see the commit history to verify the merge. 

ML Reproducibility with lakeFS tags:

We’re done with running two different ML experiments, but the real challenge is in reproducing these experiments at a later point in time. This is where lakeFS can help.

I will now show you how to create a lakeFS tag and reproduce a specific experiment by tagging a commit id.

  1. Let’s start by creating a new lakeFS tag on the exp-1 branch as shown below.
  1. Next up, I’m loading the model artifact from the tag. Notice that I use the tag name instead of the branch name to load the joblib file from the exp-1 branch.
  1. When I run this loaded model against the test data, I get the f1-score of 0.65 which is the same as the experiment-1. That is, we were able to reproduce experiment-1 by simply checking out the lakeFS tag.

This way, you could use lakeFS to build an ML experimentation platform to version multiple ML components atomically and achieve reproducible results.

If you want to try the above ML experiments hands-on, you can find the Jupyter Notebook in the lakeFS-samples repo, and follow the steps.

Git for Data – lakeFS

  • Get Started
    Get Started
  • The annual State of Data Engineering Report is now available. Find out what’s new in 2023 -