Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Oz Katz
Oz Katz Author

Oz Katz is the CTO and Co-founder of lakeFS, an...

Last updated on April 26, 2024

The massive increase in generated data presents a serious challenge to organizations looking to unlock value from their data sets. Data practitioners have to deal with many consequences of the huge data volume, including manageability and collaboration.

This is where data versioning can help. Data version control is crucial because it allows data teams to work faster while lowering the cost of errors. 

Data versioning is especially important to teams working on data science research projects that call for running multiple experiments on data. Letting several people experiment on data in isolation or easily reproducing data and code are key capabilities today. 

This article dives into a feature that enables teams to implement data versioning locally with the open-source tool lakeFS: lakectl local. 

The value of using lakeFS in data science and research

Data versioning plays a critical role in data science projects that integrate hundreds of distinct data sets, as well as ongoing refinement of data processes used by applications to carry out tasks. A good example of that are machine learning algorithms that discover new data patterns or predict output values from input variables.

A research team that constantly integrates new data sources and algorithms to then quickly experiment on and test needs data versioning. 

For example, Enigma used lakeFS branches to address the isolation problem easily. Each developer and researcher can produce a distinct data branch that contains a comprehensive snapshot of the production data (at no extra expense). They may make changes and assess their impact on the final data set without worrying about interfering with someone else’s work or contaminating the production data.

Here are three powerful use cases of data version control for data science and research teams.

Parallel experimentation

Some projects – in particular, machine learning model creation – are based on dynamic and iterative processes that involve testing various elements: data versions, transformations, algorithms, and hyperparameter settings.

To make the most of such an iterative strategy, teams must run tests in a timely, easily traceable, and repeatable manner. Localizing model data during development brings benefits to the entire process, accelerating it via interactive and offline development and reducing data access latency.

Local data availability comes in handy for creating a seamless connection of data version control systems with source control systems such as Git. This link is essential for achieving model repeatability, which allows for a more efficient and collaborative model development environment.

Reproducibility of data and code

Data changes rapidly, making it challenging for teams to maintain an accurate record of its current condition across time. 

Organizations often keep only one state of their data: the present state. Exposing a Git-like interface to data enables tracking of more than just the present status of the data. 

Reproducibility expands this by allowing teams to time travel between different versions of the data. You can take snapshots of the data at various periods and under varied conditions of modification. You can create branches and test new versions against the same input data too. 

The end result is repeatable, atomic, and versioned data lake activities, which lead to improved data management.

What is lakectl local in lakeFS?

lakeFS introduces Git-like techniques into the world of data, so the simplest way to describe lakectl local (local checkouts) is to use the Git analogy.

When you get a pull request from a fork or branch of your repository, you can merge it locally to resolve a merge conflict or to test and validate the changes before merging them back to the main branch.

How to use lakectl local in lakeFS

The local command of lakeFS CLI lakectl enables you to work with lakeFS data locally by letting you transfer lakeFS data into a directory on any system, synchronize local directories with remote lakeFS locations, and integrate lakeFS with Git.

Once you clone data stored in lakeFS to your machine, you can identify which Git versions you’re using and build reproducible local workflows that scale well and are simple to use.

Check out our documentation on how to work with lakeFS data locally.

Practical example of using lakectl local with Git

In this example, we’re going to develop an ML model that predicts whether an image is an Alpaca or not. Our goal? Improving the input for the model. 

The code for the model is versioned using Git, while the model dataset is versioned with the help of lakeFS. 

We’ll be using lakectl local to tie code versions to data versions to achieve model reproducibility, which is just essential for machine learning projects!

Let’s get started!

Setup

To get started, we initialize a Git repo called is_alpaca, which includes the model code:

Git repo

We created a lakeFS repository and uploaded the is_alpaca training dataset by Kaggle into it: 

lakeFS repo

Create an Isolated Environment for Experiments

Our objective is to enhance model predictions. To achieve our objectives, we will experiment with modifying the training dataset. We will carry out tests in isolation without changing anything until we’re confident that the data quality has been improved and the data is ready.

Let’s create a new lakeFS branch called experiment-1. Our is_alpaca dataset is accessible from that branch, and we will only interact with data from that branch.

experiment-1-branch

On the code side, we will create a Git branch called experiment-1 to avoid polluting our main branch with a dataset that is being tuned.

Clone lakeFS Data into a Local Git Repository

Inspecting the train.py script, we can see that it expects an input on the input directory.

import tensorflow as tf

input_location = './input'
model_location = './models/is_alpaca.h5'

def get_ds(subset):
     return tf.keras.utils.image_dataset_from_directory(
          input_location, validation_split=0.2, subset=subset,
          seed=123, image_size=(244, 244), batch_size=32)

train_ds = get_ds("training")
val_ds = get_ds("validation")

model = tf.keras.Sequential([
     tf.keras.layers.Rescaling(1./255),
     tf.keras.layers.Conv2D(32, 3, activation='relu'),
     tf.keras.layers.MaxPooling2D(),
     tf.keras.layers.Conv2D(32, 3, activation='relu'),
     tf.keras.layers.MaxPooling2D(),
     tf.keras.layers.Conv2D(32, 3, activation='relu'),
     tf.keras.layers.MaxPooling2D(),
     tf.keras.layers.Flatten(),
     tf.keras.layers.Dense(128, activation='relu'),
     tf.keras.layers.Dense(2)])

# Fit and save
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])
model.fit(train_ds, validation_data=val_ds, epochs=3)
model.save(model_location)

This means that in order to create and experiment with our model locally, we must have access to the lakeFS-managed is_alpaca dataset on that route. To do this, we’ll use the lakectl local clone command from our local Git repository root:

lakectl local clone lakefs://is-alpaca/experiment-1/dataset/train/ input

This command will compare our local input directory (which didn’t exist till now) to the specified lakeFS path and determine whether there are files to be downloaded from lakeFS.

Successfully cloned lakefs://is-alpaca/experiment-1/dataset/train/ to ~/ml_models/is_alpaca/input

Clone Summary:

Downloaded: 250

Uploaded: 0

Removed: 0

Running lakectl local list from our Git repository root will reveal that the input directory is now in sync with a lakeFS prefix (Remote URI), as well as the lakeFS version of the data being tracked (Synced Commit):

is_alpaca % lakectl local list                 
+-----------+------------------------------------------------+------------------------------------------------------------------+
| DIRECTORY | REMOTE URI                                     | SYNCED COMMIT                                                    |
+-----------+------------------------------------------------+------------------------------------------------------------------+
| input     | lakefs://is-alpaca/experiment-1/dataset/train/ | 589f87704418c6bac80c5a6fc1b52c245af347b9ad1ea8d06597e4437fae4ca3 |
+-----------+------------------------------------------------+------------------------------------------------------------------+

Tie Code Version and Data Version

Now, let’s tell Git to stage the dataset we’ve uploaded and check our Git branch status:

is_alpaca % git add input/
is_alpaca % git status 
On branch experiment-1
Changes to be committed:
     (use "git restore --staged <file>..." to unstage)
          new file:   input/.lakefs_ref.yaml

Changes not staged for commit:
     (use "git add <file>..." to update what will be committed)
     (use "git restore <file>..." to discard changes in working directory)
          modified:   .gitignore

We can see that the .gitignore file has changed, and the files we cloned from lakeFS into the input directory are not tracked by git. This is intentional; keep in mind that lakeFS manages the data. But wait, what is the special input/.lakefs_ref.yaml file that Git keeps track of?

is_alpaca % cat input/.lakefs_ref.yaml

src: lakefs://is-alpaca/experiment-1/dataset/train/s
at_head: 589f87704418c6bac80c5a6fc1b52c245af347b9ad1ea8d06597e4437fae4ca3

This file includes the lakeFS version of the data that the Git repository is currently pointing to. Let’s commit the changes to Git with:

git commit -m "added is_alpaca dataset"

By committing to Git, we tie the current code version of the model to the dataset version in lakeFS as it appears in input/.lakefs_ref.yaml.

Experiment and Version Results

We executed the train script on the cloned input, which produced a model. Now, we’ll utilize the model to determine if an axolotl is an alpaca. Here are the surprising results:

is_alpaca % ./predict.py ~/axolotl1.jpeg
{'alpaca': 0.32112, 'not alpaca': 0.07260383}

We expected the model to provide a more clear forecast, so let us try to enhance it. To do this, we will include more photos of axolotls in the model input directory:

is_alpaca % cp ~/axolotls_images/* input/not_alpaca

To check what changes we made to out dataset, we will use lakectl local status.

is_alpaca % lakectl local status input 
diff 'local:///ml_models/is_alpaca/input' <--> 'lakefs://is-alpaca/589f87704418c6bac80c5a6fc1b52c245af347b9ad1ea8d06597e4437fae4ca3/dataset/train/'...
diff 'lakefs://is-alpaca/589f87704418c6bac80c5a6fc1b52c245af347b9ad1ea8d06597e4437fae4ca3/dataset/train/' <--> 'lakefs://is-alpaca/experiment-1/dataset/train/'...

╔════════╦════════╦════════════════════════════╗
║ SOURCE ║ CHANGE ║ PATH                       ║
╠════════╬════════╬════════════════════════════╣
║ local  ║ added  ║ not_alpaca/axolotl2.jpeg ║
║ local  ║ added  ║ not_alpaca/axolotl3.png  ║
║ local  ║ added  ║ not_alpaca/axolotl4.jpeg ║
╚════════╩════════╩════════════════════════════╝

At this point, the dataset changes aren’t being tracked by lakeFS yet. We will validate that by looking at the uncommitted changes area of our experiment branch and verifying it is empty.

To commit these changes to lakeFS we will use lakectl local commit:

is_alpaca % lakectl local commit input -m "add images of axolotls to the training dataset"

Getting branch: experiment-1

diff 'local:///ml_models/is_alpaca/input' <--> 'lakefs://is-alpaca/589f87704418c6bac80c5a6fc1b52c245af347b9ad1ea8d06597e4437fae4ca3/dataset/train/'...
upload not_alpaca/axolotl3.png              ... done! [5.04KB in 679ms]
upload not_alpaca/axolotl2.jpeg             ... done! [38.31KB in 685ms]
upload not_alpaca/axolotl4.jpeg             ... done! [7.70KB in 718ms]

Sync Summary:

Downloaded: 0
Uploaded: 3
Removed: 0

Finished syncing changes. Perform commit on branch...
Commit for branch "experiment-1" completed.

ID: 0b376f01b925a075851bbaffacf104a80de04a43ed7e56054bf54c42d2c8cce6
Message: add images of axolotls to the training dataset
Timestamp: 2024-02-08 17:41:20 +0200 IST
Parents: 589f87704418c6bac80c5a6fc1b52c245af347b9ad1ea8d06597e4437fae4ca3

Looking at the lakeFS UI, we can see that the lakeFS commit contains information indicating the code version of the associated Git repository at the moment of the commit.

git metadata in lakeFS

Inspecting the Git repository, we can see that the input/.lakefs_ref.yaml is pointing to the latest lakeFS commit:

0b376f01b925a075851bbaffacf104a80de04a43ed7e56054bf54c42d2c8cce6

We will now re-train our model using the changed dataset and attempt to identify whether an axolotl is an alpaca:

is_alpaca % ./predict.py ~/axolotl1.jpeg
{'alpaca': 0.12443, 'not alpaca': 0.47260383}

The results are indeed more accurate.

Reproduce Model Results

What if we wanted to rerun the model that indicated an axolotl was more likely to be an alpaca? This question is translated as: “How do I roll back my code and data to the time before we optimized the train dataset?” Which means: “What was the Git commit ID at this point?”

Searching our Git log, we find this commit:

commit 5403ec29903942b692aabef404598b8dd3577f8a

     added is_alpaca dataset

So, all we have to do now is git checkout 5403ec29903942b692aabef404598b8dd3577f8a and we are good to reproduce the model results!

Check out our article about ML Data Version Control and Reproducibility at Scale for another example of how lakeFS and Git work seamlessly together.

Git for Data – lakeFS

  • Get Started
    Get Started
  • Where is data engineering heading in 2024? Find out in this year’s State of Data Engineering Report -

    Read it here
    +