Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Jan Willem Kleinrouweler, appliedAI
Jan Willem Kleinrouweler, appliedAI Author

Jan Willem Kleinrouweler is the Head of ML Engineering at...

,
Max Mynter, appliedAI
Max Mynter, appliedAI Author

Max Mynter is an MLOps engineer at appliedAI Institute for...

Last updated on March 18, 2024

TL;DR

In this blog post, we will explore how to add data versioning to an ML project; a simple end-to-end rain prediction project for the Munich area. The data assets will be stored in lakeFS and we will use the lakeFS-spec Python package for easy interaction with lakeFS. Following model training with initial data, we will update the dataset and retrain the model. Finally, we will revert to the initial data version to reproduce the original model and assess its initial accuracy.

Introduction: Data version control for ML Projects

Data version control is an important part of any machine learning (ML) project. It allows ML practitioners to systematically track different versions of data sets, which are used for training and validating ML models. Data versioning ensures traceability, auditability, and reproducibility. It also enables team members to experiment in isolation and collaborate on projects smoothly. 

What is lakeFS-spec?

lakeFS-spec is a Python library implementing a fsspec backend for the lakeFS data lake. lakeFS is an open-source tool that provides scalable and format-agnostic version control for data lakes, using Git-like semantics to create and access different data versions.

The purpose of lakeFS-spec is to streamline versioned data operations in lakeFS, particularly for Python developers in data science use cases. It combines file access to objects in lakeFS repositories with a unified interface for performing versioning operations.

The integration with the fsspec ecosystem enables popular data science tools such as Pandas, Polars, and DuckDB to seamlessly work with lakeFS repositories.

What is fsspec?

fsspec is a project to provide a unified Pythonic interface to local, remote, and embedded file systems and bytes storage. It provides a set of filesystem classes with uniform APIs (i.e. functions such as cp, rm, cat, mkdir, etc.). 

A variety of third-party Python libraries (for example, Pandas, DuckDB, Polars, and PyArrow) already use fsspec to perform remote I/O operations, allowing their users to immediately benefit from any additional file system implementations, such as lakeFS-spec.

The value of lakeFS-spec for Python users

Users who work with lakeFS through the lakeFS-spec package reap the following benefits:

  • High-level abstraction over basic lakeFS repository operations
  • Convenient access to the underlying block storage, in addition to lakeFS versioning operations
  • Seamless integration into the fsspec ecosystem
  • Accessing objects in lakeFS directly from popular data science libraries (e.g., Pandas, Numpy, Polars, DuckDB, PyArrow) with just one or two lines of code
  • Transaction support for conducting data version control operations in a conscious and reliable way
  • Zero-config option through config autodiscovery
  • Caching to avoid unnecessary data transfers
Interacting with lakeFS with and without lakeFS-spec package

  Try it out

Prerequisites 

To follow the code examples, you need: 

  • Access to an empty lakeFS repository (you can easily set up a local instance – refer to the Quickstart guide). Note: In the code snippets in this blog post, we are using the weather-data repository. 
  • In order for lakeFS-spec to find the credentials for accessing the lakeFS instance, make sure to have .lakectl.yaml in your home directory or set the LAKEFS_HOST, LAKEFS_USERNAME, and LAKEFS_PASSWORD environment variables. 
  • Python 3.9 or higher installed

Installation 

Start by installing the lakeFS-spec package and other dependencies for this project using pip (or your favorite Python package manager). We recommend installing the dependencies in a virtual environment: 

$ python -m venv .venv 
$ source .venv/bin/active 
(.venv) $ pip install lakefs-spec pandas requests scikit-learn

Rain Prediction Model 

Our goal is to create a model that can predict rain in Munich within one day. The code below contains a simple ML pipeline that loads the raw weather data from the Open-Meteo API, does some minimal feature engineering, and trains a Random Forest Classifier** model. 

import pandas as pd
import requests
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

WEATHER_DATA_URL = "https://archive-api.open-meteo.com/v1/archive?latitude=52.52&longitude=13.41&start_date={year}-01-01&end_date={year}-12-31&hourly=temperature_2m,relativehumidity_2m,rain,pressure_msl,surface_pressure,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m"


def load_weather_data(year: int) -> pd.DataFrame:
    weather_data_url = WEATHER_DATA_URL.format(year=year)
    weather_data = requests.get(weather_data_url).json()

    return pd.DataFrame.from_dict(weather_data["hourly"])

def process_weather_data(df: pd.DataFrame) -> pd.DataFrame:
    features_df = df.copy()

    features_df["time"] = pd.to_datetime(features_df["time"])
    features_df = features_df.set_index("time")

    features_df["is_raining"] = features_df["rain"] > 0
    features_df["is_raining_in_1_day"] = features_df["is_raining"].shift(24)

    features_df = features_df.dropna()

    return features_df


def train_rain_prediction_model(train_df: pd.DataFrame) -> RandomForestClassifier:
    x_train, y_train = train_df.drop("is_raining_in_1_day", axis=1), 
train_df["is_raining_in_1_day"].astype(bool)

    model = RandomForestClassifier(random_state=7)
    model.fit(x_train, y_train)

    return model

def score_rain_prediction_model(model: RandomForestClassifier, test_df: pd.DataFrame) -> float:
    x_test, y_test = test_df.drop("is_raining_in_1_day", axis=1),
test_df["is_raining_in_1_day"].astype(bool)
    return model.score(x_test, y_test)


def rain_prediction_pipeline(year: int) -> None:
    raw_data_df = load_weather_data(year)
    features_df = process_weather_data(raw_data_df)

** Note that a Random Forest Classifier is not the best choice for solving this problem. We use it here to keep the example simple. 

To run the pipeline for 2018, you can call it as follows: 

rain_prediction_pipeline(2018) 

Versioning the Data Assets 

Along the way, several data assets are produced: 

  • Hourly weather data from the API
  • The dataset with the added is_raining and is_raining_in_one_day features
  • The train and test data sets 

Depending on your use case, you may want to version one or more of the data assets. In this example, we are going to version all of them. We use the native functions from Pandas to store the data sets. 

Besides the data, you can store the model in lakeFS and benefit from its versioning capabilities (note: for production setups, you may prefer a proper model registry). In this example, we are serializing the model as a Pickle file and uploading it using lakeFS-spec file buffers. 

The lakeFS-spec library provides a file-system abstraction over the lakeFS Python SDK by implementing the fsspec standard. This allows users of third-party libraries like Pandas to immediately benefit from the additional functionality through their existing familiar I/O interface. 

We can adapt the pipeline to include the upload to the lakeFS repository as follows: 

import pickle

from lakefs_spec import LakeFSFileSystem

…

def rain_prediction_pipeline(year: int) -> None:
    raw_data_df = load_weather_data(year)
    features_df = process_weather_data(raw_data_df)

    train_df, test_df = train_test_split(features_df, random_state=42)

    model = train_rain_prediction_model(train_df)
    accuracy = score_rain_prediction_model(model, test_df)

    fs = LakeFSFileSystem()

    with fs.transaction("weather-data", "main") as tx:
        raw_data_df.to_csv(f"lakefs://weather-data/{tx.branch.id}/data/raw.csv")
        features_df.to_csv(f"lakefs://weather-data/{tx.branch.id}/data/features.csv")

        train_df.to_csv(f"lakefs://weather-data/{tx.branch.id}/data/train.csv")
        test_df.to_csv(f"lakefs://weather-data/{tx.branch.id}/data/test.csv")

    with fs.open(f"lakefs://weather-data/{tx.branch.id}/model/rain_prediction_model.pkl", "wb") as f:
            pickle.dump(model, f)

        commit = tx.commit(f"Weather data and rain prediction model for {year}")
        tx.tag(commit, f"rain-prediction-{year}")

To run complete the upload to lakeFS we need to run the pipeline function for 2018 again:

rain_prediction_pipeline(2018) 

Retraining the Model with Recent Data 

Every year, new data comes in, and we can use more recent data to retrain the model. Executing the pipeline again, in this case for 2019, would look as follows: 

rain_prediction_pipeline(2019) 

Executing the pipeline will overwrite the data sets and model to the latest version. However, thanks to the versioning capabilities of lakeFS, we can always retrieve earlier versions. 

Revisiting the Model Training 

Later in a project, you may want to revisit the model training. Your goal might be reproducing the model and verifying the results, or tuning the model for better performance.  

In this example, we’re going to reproduce the same model as in 2018. To ensure that the model is trained on exactly the same date, we are going to load the train and test data sets using the tag from before. 

train_df = pd.read_csv(f"lakefs://weather-data/rain-prediction-2018/data/train.csv", index_col=0)
test_df = pd.read_csv(f"lakefs://weather-data/rain-prediction-2018/data/test.csv", index_col=0)

model = train_rain_prediction_model(train_df)
accuracy = score_rain_prediction_model(model, test_df)

Comparing Model Improvement 

Being able to retrieve earlier versions of the data and the model is also useful when comparing model performance when retraining. 

In the example below, we compare the performance of the models trained on the 2018 and 2019 training data when evaluated on the 2019 test data set to get a sense of how well these models generalize: 

def load_model(fs: LakeFSFileSystem, ref: str) -> RandomForestClassifier:
    with fs.open(f"lakefs://weather-data/{ref}/model/rain_prediction_model.pkl", "rb") as f:
        model = pickle.load(f)
    return model

fs = LakeFSFileSystem()

# Get the two models
model_2018 = load_model(fs, "rain-prediction-2018")
model_2019 = load_model(fs, "rain-prediction-2019")

# Get the 2019 test data set
test_df = pd.read_csv("lakefs://weather-data/rain-prediction-2019/data/test.csv", parse_dates=True, index_col="time")

# Compare
accuracy_2018 = score_rain_prediction_model(model_2018, test_df)
accuracy_2019 = score_rain_prediction_model(model_2019, test_df)

print(f"{accuracy_2018} vs {accuracy_2019}") # 0.842032967032967 vs 0.8736263736263736

Thanks to the implementation of version control for both data and models, we were able to select the desired version of the trained model and evaluate its performance on two different versions of input data. As observed, the model trained on data from 2018 shows a slightly higher accuracy for test data from 2019 (87.36%) compared to test data from 2018 (84.20%).

Conclusion 

Many ML projects don’t use data version control due to the adjustments required for the project. In this article, we showed that with lakeFS and lakeFS-spec, adding data version control to your ML projects is as simple as writing to files, be it from your favorite data science libraries or low-level file operations. 

Enjoyed working with lakeFS-spec? Follow the project on GitHub and share the repository with your teammates.

Git for Data – lakeFS

  • Get Started
    Get Started