One of the capabilities of lakeFS is that you can use it to create isolated environments for experimentation or development.

Let’s say we want to build a machine learning model and need to prepare or clean some data. With lakeFS, we can do this in isolation without creating an entire copy of the dataset.

Let’s see how it works in practice.

Creating a new branch with zero copy

This is a lakeFS repository we created beforehand. Before touching any of the datasets here, we will create our environment: a branch called oz-exp1.

This takes no time at all. Since it’s a metadata operation, no data actually gets copied when we do that. As you can see, they both point to the same commit ID.

Let’s go to our branch. We can do anything we want to the data on the branch without affecting the dataset.

For this example, we’ll use the stanford_dogs dataset on the branch with images of different dog breeds:

Building an image classifier

Let’s say we want to build an image classifier based on this dataset.

The first thing we need to do is to copy the URI which contains the path but also the name of the branch we’re working on. We want to be able to use it locally on our machine, but don’t want to actually copy everything or download the entire data set. That would dramatically increase our storage costs.

So we’re going to use the everest command line tool from lakeFS and mount the URI onto to a local directory called dogs in write mode:

After a second, we have a dogs folder here with all the images that we just saw of all the different dog breeds.

But everest didn’t actually copy all this data over to our machine – it represents the data as if it were a local directory. And only as we’re reading these files do they get downloaded from the remote object store.

So what can we do next?

Let’s say that we want to focus on chihuahuas and remove all the chihuahua images from the folder. We can remove them on our branch using everest and commit the change:

work on data locally with everest command line tool

There will be no chihuahuas in our branch, but if we switch to main, we’ll see that it still contains those chihuahuas.

Result: Cleaning data in full isolation

What we’ve essentially done is cleaning our data in isolation. If we now train a model based on this data, we can tie it back to the actual commit we were using and see exactly what data it contains, including what modifications we made from the original dataset in order to get there.

Wrap up

lakeFS provides a powerful capability to create isolated environments for data experimentation and development without duplicating large datasets – thanks to its zero-copy branching mechanism.

Users can modify and process data in isolation, avoiding the high storage costs of traditional data copying. At the same time, all changes can be easily tracked, ensuring that data cleaning is traceable and reversible. This provides a solid foundation for model training while maintaining the original dataset intact.

Preprocessing Data Locally with Zero Copy Using lakeFS

Creating a new branch with zero copy

Building an image classifier

Result: Cleaning data in full isolation

Wrap up

Watch a 3-minute tutorial of how it works

Need help getting started?

lakeFS

Preprocessing Data Locally with Zero Copy Using lakeFS

Creating a new branch with zero copy

Building an image classifier

Result: Cleaning data in full isolation

Wrap up

Related articles

Building Compliant and Reproducible ML Pipelines

Iceberg REST Catalog Alternatives: Top Options & How to Choose The Best One For Your Team

lakeFS Top 10 Defining Product Milestones in 2025

Watch a 3-minute tutorial of how it works

lakeFS

Pick up the Slack with lakeFS