Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community

Data Version Control With Python: A Comprehensive Guide for Data Scientists

Oz Katz
Oz Katz Author

Oz Katz is the CTO and Co-founder of lakeFS, an...

August 11, 2023

With the AI revolution well on its way, the amount of data, whether real or synthetic, is growing at an exponential rate. While this is generally a good thing – expanding our understanding of the universe and allowing us to build smarter, more capable AI and machine learning algorithms – it also comes with new challenges.

Ensuring that the data teams use throughout the development lifecycle is reliable and of high quality is a challenge that scales with the amount of data used.

This means that an ever-growing portion of the time being spent on this task grows along with it:

Pie chart illustrating the most time-consuming tasks of data scientists

Being able to not only build reliable models but to do so in a testable, reproducible, and reliable manner is a common data management problem today. This is especially painful for ML engineers who need to keep track of the machine learning models or large files – and be able to revert to a previous version if necessary.

Understanding Useful Data Version Control Concepts and Elements for Python

How can a data version control system help?

So how do we go about building testable, repeatable, and reliable workflows? We need a methodology for tracking files.

Let’s break it down:

Testing is the means of ensuring our input data (or ground truth) adheres to a set of well-defined and measurable criteria.

Common examples of this would be ensuring a column has no null values or that image files are of a specific resolution. Testing helps us validate that our data cleaning processes are correct and that we don’t accidentally feed garbage into our pipeline.

Reproducible and Reliable are more nuanced terms. By definition, any scientific work should be repeatable. If we cannot run the same experiment and get the same results, not only will we have issues troubleshooting and improving our algorithms incrementally, but we’ll also have issues scaling them.

This is especially important in fields like ML where data practitioners can benefit a lot from versioning both data and model files to achieve reproducible machine learning experiments.

Elements

So what do we need in order to version our data? Assuming we already have data stored on cloud-based storage systems from most major cloud providers, such as Amazon S3, Google Cloud Storage, or Microsoft Azure Blob Storage, we need one more piece: a data versioning system. This is where lakeFS comes in.

What is essential for data versioning: data, data versioning system, and storage.

lakeFS itself is not a storage system. Your data stays in place within the object store, as it always has. When installed, lakeFS connects to the object store and keeps track of changes as they are made to the data. lakeFS adds the following concepts to our storage layer:

Repository – In lakeFS, a repository is a set of related objects (or collections of objects), similar to a Git repository (Git is probably the version control system most software developers use to version code). In many cases, these represent tables of various formats for tabular data, semi-structured data such as JSON, CSV files, or log files – or a set of unstructured objects such as images, videos, sensor data, data and model files, etc.

Commit – Using a commit – similar to a Git commit – you can view a repository at a certain point in its history, and you’re guaranteed that the data you see is exactly as it was at the point of committing it. These commits are immutable “checkpoints” containing all the contents of a repository at a given point in the repository’s history.

Each commit contains metadata – the committer, timestamp, and commit message, as well as arbitrary key/value pairs you can choose to add.

Branch – Branches in lakeFS allow users to create their own “isolated” view of the repository. Changes on one branch don’t appear on other branches. Users can make changes to one branch and apply them to another by merging them. Multiple versions created by multiple users aren’t a problem: under the hood, branches are simply a pointer to a commit along with a set of uncommitted changes.

Merge – Merging is the way to integrate all the data changes from one branch into another. The result of a merge is a new commit, with the destination as the first parent and the source as the second.

Learn more about the basic concepts in the lakeFS documentation.

Getting Started with Data Version Control in Python for Data Science

OK, so now that we understand where lakeFS fits in and the basic concepts behind it, let’s try it out!

In this guide, we’ll set up a local lakeFS installation for data version control. Python and pandas come next. We will do some data preparation on an isolated branch! Finally, we’ll commit our changes for reproducibility and see how we can later re-run our experiment on the commit we generated.

Ready? Let’s get started!

Setting Up

The first thing we’ll have to do is set up a lakeFS environment for versioning data.

Another option is to use the free lakeFS playground, which requires no installation.

Sign up here (and skip to the client installation).

Run lakeFS

We can run lakeFS locally using a single command (this requires Docker) in the tool’s command line interface:

docker run --name lakefs --pull always \
             --rm --publish 8000:8000 \
             treeverse/lakefs:latest \
             run --local-settings

This will download the lakeFS Docker Image and run a local lakeFS instance on our machine. See the full guide in the lakeFS documentation.

Set up via UI

Once lakeFS is running, you should see something like this in your terminal:

[…]
│
│ If you're running lakeFS locally for the first time,
│     complete the setup process at http://127.0.0.1:8000/setup
│
[…]

Go ahead and follow the link to http://127.0.0.1/setup.

Make a note of the Access Key ID and Secret Access Key and click on Go To Login. Log in with the credentials you’ve just created.

You’ll notice that there haven’t been any repositories created yet. Click the Create Sample Repository button.

How to create sample repository with lakeFS

By now, we’ve successfully created a repository. Feel free to play around with it in the UI! You can look for the elements we talked about: commits, branches, merging, etc.

Managing repository with lakeFS using commits, branches, and merging elements

Pip install client

Our setup is almost complete. Whether we used the lakeFS playground or the steps above to install lakeFS locally, next we’ll want to use lakeFS from our notebook or code.

For that, we need the lakeFS Python SDK. Let’s install it:

$ pip install lakefs_client    

Basic Data Version Control Workflow in Python

Cool, now we should have a lakeFS environment with a sample repository and the Python SDK installed too. Let’s use it!

Instantiating a lakeFS client in Python

Let’s open a Python shell (or notebook) and create a new lakeFS client:

import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = 'AKIAIOSFODNN7EXAMPLE'
configuration.password = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
configuration.host = 'http://localhost:8000'

client = LakeFSClient(configuration)    

If you’re using the lakeFS client, replace the host above with your lakeFS installation host (https://<installation-id>.us-east-1.lakefscloud.io/). The username and password fields should be filled with the access key ID and secret access key respectively, for your installation.

Reading and writing files

So we have a client! Let’s start by using it to upload some data into our lakeFS repository. Let’s mess around and upload some dummy data to it – but since we’re still messing around, let’s not do it on the main branch! In a real deployment, this will likely be the “production” branch that other users might rely on.

If you remember, the way to achieve isolation in lakeFS is by creating a branch – so let’s do that:

# Create a branch to upload to
client.branches.create_branch(
    repository='quickstart', 
    branch_creation=models.BranchCreation(
        name='my-first-branch', 
        source='main'))

Here we created a new branch named my-first-branch which branches off our main branch.

Branches are cheap in lakeFS! Even if our main branch had Petabytes of data, this is always a quick, zero-clone operation. The solution keeps track of the changes that occur on each branch, thus only storing the changes

So now that we have an isolated branch, let’s write some dummy data to it:

# Write example file that we will upload
with open('example.txt', 'w') as f:
    f.write('hello world!\n')

# Write a file to a lakeFS branch
with open('example.txt', 'r') as f:
    client.objects.upload_object(
        repository='quickstart', 
        branch='my-first-branch', 
        path='data/example.txt',
        content=f)  

We can now ask lakeFS for the data we just wrote. Reading a lakeFS object is pretty simple

# Reading a file
f = client.objects.get_object(
    repository='quickstart',
    ref='my-first-branch', 
    path='data/example.txt')

print(f.read())  # "Hello World!\n"     

Import existing files

Another way to get data into lakeFS is by importing existing data. Importing doesn’t move data at all: it is done by instructing lakeFS to scan our existing storage and simply pointing to it from our repository.

This allows us to branch, modify, commit, and use all other lakeFS capabilities on existing data without creating a physical copy of it.

Illustration: importing existing data by creating pointers to data in another object store location

Check out the quick import guide in the lakeFS documentation.

pandas example

Reading and writing raw files is nice, but in the real world we’re typically looking to aggregate, select, and filter.

Let’s see how we can use the popular pandas library to read and write to lakeFS.

First, let’s make sure we have pandas installed. We’ll also include pyarrow since it includes utilities that allow pandas to read the Parquet files included in the lakeFS samples:


$ pip install pandas pyarrow     
    

Now, we can import pandas and read a dataframe:

import pandas as pd

>>> f = client.objects.get_object(
...    repository='quickstart',
...    ref='my-first-branch', 
...    path='lakes.parquet')

>>> df = pd.read_parquet(f)
       Hylak_id    Lake_name                   Country      Depth_m
0             1  Caspian Sea                    Russia  1025.000000
1             2   Great Bear                    Canada   446.000000
2             3  Great Slave                    Canada   614.000000
3             4     Winnipeg                    Canada    36.000000
4             5     Superior  United States of America   406.000000
...         ...          ...                       ...          ...
99995     99996                                 Canada     6.580317
99996     99997                                 Canada     9.083196
99997     99998                                 Canada    20.822592
99998     99999                                 Canada    10.418225
99999    100000                                 Canada    20.179936

[100000 rows x 4 columns]

>>>      

That’s nice!

However, in many cases, we want deeper integration between pandas and the underlying storage. This allows us to read more than one file at a time, and allows pandas to efficiently write back to storage.

Luckily, lakeFS also supports the S3 API which Pandas works seamlessly with! Let’s see how to use it:


$ pip install boto3 s3fs     
    

Now that we have our dependencies installed, we can read and write data frames from a lakeFS branch using the S3 API:

import pandas as pd

lakefs = {
    'key': 'AKIAIOSFOLQUICKSTART',
    'secret': 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
    'client_kwargs': {
        'endpoint_url': 'https://treeverse.us-east-1.lakefscloud.io',
    }
}

# Read dataframe
df = pd.read_parquet('s3://quickstart/my-first-branch/lakes.parquet', storage_options=lakefs)

# Let's filter it
df = df[df['Depth_m'] > 100]

# Write it back!
df.to_parquet('s3://quickstart/my-first-branch/lakes.parquet', storage_options=lakefs)    

You’ll notice that we’re using a URI scheme here that includes the name of the branch in the path: this is how lakeFS can distinguish between different versions of the data. We’ll use that later to read from a specific point-in-time by replacing the name of the branch with a commit identifier.

If we now look at our my-first-branch in the lakeFS UI, we can see that we have uncommitted changes:

Uncommitted changes in the lakeFS UI

Great, we now have a working example of reading and modifying pandas dataframes on a lakeFS branch! 🥳

Integrating Data Version Control with ML and Data Science

Let’s try a more realistic example:

Imagine that we have some existing data that we want to extract features from.

The first thing we’d want to do is make sure our data is of sufficient quality. For this, we’ll typically want to remove empty values, outliers, etc.

Data Version Control in Data Pre-Processing Phase

Create a branch

Like in previous examples, our first step would be to create an isolated branch:

# Create a branch to do our cleanup on
client.branches.create_branch(
    repository='quickstart', 
    branch_creation=models.BranchCreation(
        name='cleanup', 
        source='main'))   

Remove outliers

Now, let’s explore our dataset. We probably don’t want to include extremely deep lakes or extremely shallow ones, as these might be very different from the rest.

# Plotting from the 0.01% shallowest, to the 0.1% deepest
df = pd.read_parquet('s3://quickstart/cleanup/lakes.parquet', storage_options=lakefs)
df['Depth_m'].quantile([0.001, 0.01, 0.1, 0.5, 0.75, 0.90, 0.99, 0.999]).plot(style='.-')   

Note: If you’re running in an interactive shell instead of a notebook, you might need to save the output as an image file to view the plot:


df['Depth_m'].quantile([0.001, 0.01, 0.1, 0.5, 0.75, 0.90, 0.99, 0.999]).plot().figure.savefig('depth.png')   
    

We should see something like this:

Graph representing deep lakes (please adapt in case this is not correct from the technical perspective)

This means that the deepest lakes are way deeper than expected. For the sake of our experiment, let’s filter these out:

# Remove the top 5% deepest lakes
df = df[df['Depth_m'] <= df['Depth_m'].quantile(0.95)]   

Things are now looking a bit better!

Graph showing the result of a filtering operation that reduced the number of data lake outliers

No more outliers!

Let’s also validate that we don’t have any empty or zero values:

>>> df['Depth_m'].hasnans
False
>>> df[df['Depth_m'] <= 0.0]
Empty DataFrame
Columns: [Hylak_id, Lake_name, Country, Depth_m]
Index: []      

Things are looking good; let’s save these values back to our repository and commit!

# Save to branch
>>> df.to_parquet('s3://quickstart/cleanup/lakes.parquet',
storage_options=lakefs)

# Commit changes
client.commits.commit(
 repository='quickstart', 
        branch='cleanup', 
        commit_creation=models.CommitCreation(
  message='removed outliers')) 

Yay! We’ve successfully saved our changes and made them available to others while also committing them. Let’s see our commit in the lakeFS UI:

Verifying commit in the lakeFS UI

Data Versioning in the Modeling Phase

Branch per experiment

While working on a model, we might want to experiment with different input selections and different datasets.

Branching makes this a repeatable process. We can create multiple experiments, each tied to a specific lakeFS branch, to version both inputs and outputs:

# Create 9 experiment branches, all branched off of "cleanup"
[client.branches.create_branch(
    repository='quickstart', 
    branch_creation=models.BranchCreation(
        name=f'experiment-jane-doe-lakes-{expriment_num}', 
        source='cleanup'))
     for expriment_num in range(1, 10)]  

Looking at the “Branches” tab in the UI, we now see this:

Verifying branches tab and commit ID in the lakeFS UI

As we can notice, all these branches currently point to the same Commit ID. They are simply pointers to the cleanup branch, since we’ve made no changes to them.

However, main and cleanup are no longer identical since we’ve made changes and committed them on our cleanup branch.

Commit + metadata

Now that we have multiple branches, we can use them all in parallel to play around with different model weights, input data and hyper parameters.

Actually building a machine learning model is beyond the scope of this guide, so let’s mock that part:

import os
import io
model = io.BytesIO(os.urandom(1024 * 1024 * 10)) # random 10MB of data

# Let's upload
client.objects.upload_object(
    repository='quickstart', 
    branch='experiment-jane-doe-lakes-1', 
    path='models/lakes.h5', 
    content=model)   
Uncommitted changes tab in the lakeFS UI

Let’s commit our changes, this time documenting the experiment using commit metadata:

# Committing, notice the metadata dictionary!
client.commits.commit(
    repository='quickstart', 
    branch='experiment-jane-doe-lakes-1', 
    commit_creation=models.CommitCreation(
        message='modeling experiment: #1',
        metadata={
            'hp_param_a': '35',
            'hp_param_b': 'false',
            'hp_param_c': '3.142',
            'tensorflow_version': '2.13.0',
        }))   

Let’s see the commit in the UI:

Commits tab in the lakeFS UI

We can see all the parameters that were used and any additional metadata used for this experiment that generated this model.

Of course, if we also modified our input data on the branch, lakeFS would automatically commit that as well, along with our model.

Reproducibility

Reading from a commit

Let’s imagine one of our experiments proved successful. A day or a month goes by, and now we want to further improve our model. Does anyone remember how the model was generated?

Which parameters were used? Let’s use the blame functionality in the lakeFS UI!

Blame functionality in the lakeFS UI
Last commit to modify object in the lakeFS UI

So we know exactly how this model/dataset was created, by whom, when, and using which parameters.

Best Practices for Data Version Control in Python for Data Science And Engineering

Attaching Git commit hashes to lakeFS commits

This is the last required piece for our workflow: our training code is hopefully versioned as well! For that, we’re likely using Github or a similar Git repository.

What if we could tie our model back to the code that generated it? lakeFS provides a nifty way of doing so. All we have to do is add the git commit ID from GitHub and use it in the lakeFS commit metadata:

How to add the Git commit ID from GitHub to the the lakeFS commit metadata

Let’s take that commit hash and add it to our metadata, along with a direct link to the GitHub repository:

# Notice the additional 2 metadata fields!
client.commits.commit(
    repository='quickstart', 
    branch='experiment-jane-doe-lakes-1', 
    commit_creation=models.CommitCreation(
        message='modeling experiment: now with Git!',
        metadata={
            'hp_param_a': '35',
            'hp_param_b': 'false',
            'hp_param_c': '3.142',
            'tensorflow_version': '2.13.0',
            'git_commit_sha': 'e1eff68ac5db',
            '::lakefs::GitHub::url[url:ui]': 'https://github.com/ozkatz/is_alpaca/commit/e1eff68ac5db51ae71b5be3e43902ce686cf3a82'
        }))   

Now, if we look at the commit, we see a new button appear:

Modeling experiment with Git in the lakeFS UI

Clicking it will take us directly to the GitHub commit that modified this model.

Data versioning is within your reach

In this guide, we covered why data versioning is an important practice to adopt when performing data tasks – from isolation to reproducibility. We’ve also learned how to setup and use lakeFS with common data science tools such as pandas.

Of course, the journey doesn’t have to stop here! Here are a few resources that should be helpful to those looking to adopt lakeFS:

  • See the lakeFS vast list of integrations to see how to connect lakeFS with the tools you’re using,
  • Check out the sample notebooks with hands-on, runnable guides on many data science use cases,
  • Looking to use lakeFS with your own data? Start using lakeFS Cloud for free or deploy the open source version on your environment.

Of course, you’re more than welcome to join over 1,000 other data practitioners who are already members of the lakeFS Slack group!

Chapters:

At lakeFS, we’ve created a series of articles, with each one delving into a unique aspect of data version control.

Data Version Control: What Is It and How Does It Work?

This guide will provide you with a comprehensive overview of data version control, explaining what it is, how it functions, and why it’s essential for all data practitioners.

Learn more in our guide to Data Version Control.

Data Version Control Best Practices

This article outlines a range of best practices for data version control, designed to help teams maintain the highest quality and validity of data throughout every stage of the process.

Learn more in our detailed guide to Data Version Control Best Practices.

Best Data Version Control Tools

As datasets grow and become more complex, data version control tools become essential in managing changes, preventing inconsistencies, and maintaining accuracy. This article introduces five leading solutions that practitioners can rely on to handle these daily challenges.

Learn more in our detailed guide to Best Data Version Control Tools.


Git for Data – lakeFS

  • Get Started
    Get Started