Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Amit Kesarwani
Amit Kesarwani Author

September 30, 2023

Introduction

In the ever-evolving landscape of machine learning (ML), data stands as the cornerstone upon which triumphant models are built. However, as ML projects expand and encompass larger and more complex datasets, the challenge of efficiently managing and controlling data at scale becomes more pronounced.

These are the common conventional approaches used by the data scientists and the constraints associated with these approaches:

  • The Copy/Paste Predicament: In the world of data science, it’s commonplace for data scientists to extract subsets of data to their local environments for model training. This method allows for iterative experimentation, but it introduces challenges that hinder the seamless evolution of ML projects.
  • Reproducibility Constraints: Traditional practices of copying and modifying data locally lack the version control and auditability crucial for reproducibility. Iterating on models with various data subsets becomes a daunting task.
  • Inefficient Data Transfer: Regularly shuttling data between the central repository and local environments strains resources and time, especially when choosing different subsets of data for each training run.
  • Limited Compute Power: Operating within a local environment hampers the ability to harness the full power of parallel computing, as well as the distributed prowess of systems like Apache Spark.

In this guide, we will explain how to tackle these challenges by using data version control. We will demonstrate:

  • How to use Git and lakeFS, both open-source tools, to version control your code and data when working locally.
  • How to use lakeFS without the need to copy data and train your model at scale in a distributed computing environment. lakeFS leverages a unique approach (zero clone copies), where different versions of your ML datasets and models are efficiently managed without duplicating the data.

We will also delve into the power of parallel ML – running experiments in parallel with different parameters and exploring how lakeFS and Databricks together can supercharge your ML experiments and streamline your ML data pipeline. By leveraging the capabilities of lakeFS and Databricks, a distributed computing environment for Spark, you can harness the full potential of parallel ML without compromising on performance or scalability. We will illustrate how to scale parallel ML from a small local instance across to a full GPU/CPU cluster. Databricks is not a requirement, you can run Spark in the way you like best.

We will also integrate with MLflow to provide full experiment tracking and model logging.

Throughout this article, we will provide a step-by-step guide, accompanied by a Jupyter notebook, to demonstrate how these tools work together to enhance your ML workflows. Whether you’re a data scientist, ML engineer, or AI enthusiast, this blog will equip you with the knowledge and tools needed to leverage parallel ML effectively and accelerate your model development process.

Target Architecture

Target Architecture: Accelerating Deep Learning with PyTorch Lightning
Source (Databricks Blog): Accelerating Your Deep Learning with PyTorch Lightning on Databricks

We will store our data using the open-source Linux Foundation project Delta Lake. Under the hood, Delta Lake stores the raw data in the Parquet format.

We are relying on PyTorch’s library segmentation_models.pytorch for the Deep Learning model. PyTorch Lightning is a great way to simplify your PyTorch code and bootstrap your Deep Learning workloads.

Our data will be stored as a Delta table and available as a Spark dataframe. However, PyTorch is expecting a specific type of data. We need a library to create a dataset in PyTorch format and manage the caching from blob storage to local SSD. For that, we will use Petastorm. So, Petastorm takes on the data loading duties and provides the interface between the Lakehouse and the Deep Learning model.

MLflow will provide an experiment tracking tool, log our experiment metrics automatically and allow for saving out the model to our model registry.

Demo Notebook

You will run the demo notebook either in a local Docker container or on the Databricks cluster. This picture explains the full process:

Demo Notebook
  1. Create a lakeFS and Git repository.
  2. Create an Experiment branch.
  3. Import selected images from object storage (S3 in this case) to the Experiment branch. This will be a zero-copy operation and will take a few seconds.
  4. If you will run the demo locally then you will clone the Experiment branch locally which will download the images locally. It is recommended to use a smaller dataset locally.
  5. Demo will build a data pipeline with Medallion Architecture
    1. Convert raw images and image masks to Delta table format and save as Bronze dataset.
    2. Resize images, transform image masks into images and save as Silver dataset.
    3. Join images and image masks. Save it as a Gold dataset.
    4. You will commit all these datasets in the lakeFS repository for version control and data lineage purposes.
  6. Split Gold dataset into train and test datasets.
  7. Prepare the dataset in PyTorch format by using Petastorm.
  8. Train the base model. If running locally then train the model once with particular parameters (architecture, encoder and learning rate). If running on the distributed cluster then you will fine-tune hyperparameters with Hyperopt.
  9. Save the best model to the MLflow registry and save the best model information in the lakeFS repository.
  10. Flag the best model version as production-ready in MLflow.
  11. Save demo notebooks (code) to Git repo. Git will not add local images to the staging area while adding “.lakefs_ref.yaml” file which includes lakeFS commit information.
  12. If you want then you can merge the best model information and datasets to the main production branch in lakeFS but this step is not part of the Demo Notebook.

Demo Prerequisites

  • Docker installed on your local machine
  • This demo requires connecting to a lakeFS Server. You can spin up lakeFS Server for free on the lakeFS cloud.
  • If you want to run the demo in a distributed Spark cluster then you can run this demo on Databricks cluster but it is not required.

Step 1: Demo Setup

You will be utilizing a prepackaged environment (Docker container) that includes Python, Jupyter notebook, Git, Spark and MLflow. You can read more about the demo in this git repository

Clone the repo:

git clone https://github.com/treeverse/lakeFS-samples
cd lakeFS-samples/01_standalone_examples/image-segmentation

And build the Docker image and run the container:

docker build -t lakefs-image-segmentation-demo .

docker run -d -p 8889:8888 -p 4041:4040 -p 5001:5000 --user root -e GRANT_SUDO=yes -v $PWD:/home/jovyan -v $PWD/jupyter_notebook_config.py:/home/jovyan/.jupyter/jupyter_notebook_config.py --name lakefs-image-segmentation-demo lakefs-image-segmentation-demo

NOTES:

  • The first time you build the Docker image, it might take up to 20–30 minutes to come up, depending on the dependencies. The second time, it will take a few seconds.
  • Docker image size is around 10GB so you should have enough virtual disk available for your Docker environment.
  • If any of the port numbers (8889, 4041 and 5001) used in the above command are already in use then change the port numbers to any available ports.

Once you run the Docker container, open Jupyter UI http://127.0.0.1:8889/ in your web browser. You will be using the `Image Segmentation` notebook throughout this demo (in the Jupyter UI). The notebook starts with configuring and setting up the environment.

Step 2: Notebook Config

Open `Image Segmentation` notebook in Jupyter UI and complete following Config steps in the notebook:

1. Change lakeFS Cloud endpoint and credentials:

lakeFS Cloud endpoint will be in this format: ‘https://username.aws_region_name.lakefscloud.io’

If you don’t have lakeFS access and secret keys, login to lakeFS and click on Administration -> Create Access Key

Change lakeFS Cloud endpoint and credentials

2. Storage Information:

Since you are going to create a repository in the demo, you will need to change the storage namespace to point to a unique path. If you are using your own bucket, insert the path to your bucket. If you are using our bucket in lakeFS Cloud, you will want to create a repository in a subdirectory of the sample repository that was automatically created for you.

For example, if you login to your lakeFS Cloud and see:

Storage Information

Add a subdirectory to the existing path (in this case, s3://lakefs-sample-us-east-1-production/AROA5OU4KHZHHFCX4PTOM:028298a734a7a198ccd5126ebb31d7f1240faa6b64c8fcd6c4c3502fd64b6645/). i.e. insert:

storageNamespace = 
's3://lakefs-sample-us-east-1-production/AROA5OU4KHZHHFCX4PTOM:028298a734a7a198ccd5126ebb31d7f1240faa6b64c8fcd6c4c3502fd64b6645/image-segmentation-repo/'

3. Local vs. Distributed Computing:

If you are running this demo in the local Docker container then the `localOrDistributedComputing` variable will be “LOCAL”. But if you are running this demo on a distributed Spark cluster like Databricks then the `localOrDistributedComputing` variable will be “DISTRIBUTED”.

4. Number of images to use for each experiment:

Demo dataset has about 200K images. How many images do you want to use for each experiment? We recommend using a small number of images for local computing e.g. 100-500 while using a higher number of images for distributed computing e.g. 1000-10000. Demo will randomly select these numbers of images from the demo dataset.

5. Download Demo Dataset:

You will need a dataset to work with. Download the dataset for the “Airbus Ship Detection Challenge” from Kaggle and upload to the “airbus-ship-detection” folder in your S3 bucket. This demo dataset is around 31GB in size.

lakeFS must have permissions to list and read the objects in this S3 bucket, and this S3 bucket must be in the same region of the S3 bucket used by lakeFS Cloud.

Change “bucketName” and “awsRegion” variable in the notebook:

bucketName = '<S3 Bucket Name>'
awsRegion = '<AWS Region>'
prefix = "airbus-ship-detection/"

6. AWS credentials:

Provide your AWS credentials to access the demo dataset.

Step 3: Notebook Setup

  1. Run Setup process: Run “ImageSegmentationSetup” notebook as is. This setup notebook imports required Python libraries, creates lakeFS repo, creates many functions and the code for the ML model.
  2. Create Git repo: If you are running locally then you will create an empty Git repository. Git will version control your code while lakeFS will version control your data.

if localOrDistributedComputing == "LOCAL":
    !git init {repo_name}

Now, you will notice the “image-segmentation-repo” folder in the Jupyter File Browser on the left side panel:

"image-segmentation-repo" folder in the Jupyter File Browser

You can now run the notebook step by step and view the changes locally as well as on your lakeFS server.

Step 4: Experimentation Branch 

You will create a separate branch in the lakeFS repo for running your experiments. If you are running locally then you will also create a branch in the Git repo.

If you want to run the experiment multiple times then you can just increment the branch number variable (experimentBranchN) and create a new experiment branch:

experimentBranchN = experimentBranch+"-1"

if localOrDistributedComputing == "LOCAL":
    !cd {repo_name} && git checkout -b {experimentBranchN}

lakefs.branches.create_branch(
    repository=repo_name,
    branch_creation=BranchCreation(
        name=experimentBranchN,
        source=emptyBranch))

Step 5: Zero Copy Import 

Next you will get the list of around 200K images from S3 in the demo dataset and you will randomly select a few images (number of images as set in “imagesPerExperiment” variable) from this dataset. In a real life scenario, you might filter the list based on certain labels or somebody might provide you a list of images to use for your experiments.

file_list = list_images() 
file_list_random = random.choices(file_list, k=imagesPerExperiment)

You will import randomly selected images from S3 to the lakeFS repository. It is worth noting that the import is a zero copy operation i.e. none of the data will be actually copied over. However, you will be able to access the data and version it going forward with lakeFS.

Once you run the code below:

import_images(file_list_random)

Log into lakeFS and view the content of the repository for the experiment branch:

view the content of the repository for the experiment branch
Image sample of experiment branch

Step 6: Clone Experiment Branch Locally

If you are running locally then you will clone the experiment branch locally by running the lakeFS CLI (“lakectl local”) command. If you would like to know more about “lakectl local” commands then you can read “The best of both worlds: Introducing local checkouts with lakeFS” section of the blog “Scalable Data Version Control – Getting the Best of Both Worlds with lakeFS”.

lakectl local clone” command will download images locally: 

lakectl local clone lakefs://{repo.id}/{experimentBranchN}/ {repo_path}

You will notice the “image-segmentation-repo/lakefs_local” folder in the Jupyter File Browser on the left side panel:

"image-segmentation-repo/lakefs_local"

You can browse the files inside this folder:

Browse files in this folder

Step 7: Git + lakeFS Together

By using “lakectl local” commands, you can “clone” data stored in lakeFS to any machine, track which versions you were using in Git, and create reproducible local workflows that both scale very well and are easy to use.

When you clone a lakeFS branch locally in a directory inside a Git repo by using “lakectl local clone” command, it includes that directory in the “.gitignore” file so Git will ignore the data folder. Your “.gitignore” file will look like this:

# ignored by lakectl local:
lakefs_local/*
!lakefs_local/.lakefs_ref.yaml

But Git will version control the “.lakefs_ref.yaml” file created/updated by “lakectl local” commands. “.lakefs_ref.yaml” file includes the lakeFS source/branch information and commit ID. This way code as well as commit information about data will be kept together in Git repo. 

“.lakefs_ref.yaml” file looks like this:

src: lakefs://image-segmentation-repo/LOCAL-experiment-1/
at_head: a2129ac1ccbd6c305254c57bf227593a3285ff2cea712169e7b9ac57d56994ba

Step 8: Build the Data Pipeline

In next few cells in the notebook, you will use Medallion Architecture to transform your data to get it ready for the ML training:

  • You will ingest raw images as a bronze data set and will save it as Delta table. You will commit the bronze dataset to the lakeFS repository and will tag it. Tags are a way to give a meaningful name to a specific commit and tags can be used in future to pull the data.
  • You will enrich the bronze dataset, will save as silver dataset, will commit the silver dataset to the lakeFS repository and will tag it.
  • You will load the raw image mask as a bronze dataset, will save it as Delta table, will commit and will tag it.
  • You will transform masks into images as a silver dataset, will save it as Delta table, will commit and will tag it.
  • You will join images and image masks. Will save it as a Gold dataset.
  • You will split the Gold dataset into train and test datasets.
  • You will prepare the dataset in PyTorch format by using Petastorm library.

Step 9: Train the Base Model: Local Experiment

Up to this point, you created a lakeFS & Git repository and imported data from S3 into lakeFS repository within a Local Experiment branch. Now with the data imported & transformed, you can run multiple experiments locally with slightly different parameters or you can create multiple local experiment branches with different training datasets/images.

Following the notebook in the next cell, you will set the parameters for the local experiment and will train the model once:

Train the Base Model: Local Experiment

In the next few cells, you will save the best model/run information to the MLflow registry as well as lakeFS repository. If you are satisfied with your best model run then you can flag the best model version as production-ready or skip this cell.

Step 10: Commit “Code+Data” to Git Repository

You will copy notebooks (code) to Git repo and will run the “git add” command to add changes in the working directory to the staging area. Git will not add data files to staging area while adds notebooks/code and “.lakefs_ref.yaml” file which includes lakeFS commit information:

Commit “Code+Data” to Git Repository

Step 11: Start Local MLflow Server

Click on the link in the notebook to open another “start-mlflow-ui” notebook, run first cell to start MLflow server and click on the link in the notebook to go to MLflow UI:

Start Local MLflow Server

Step 12: Review lakeFS Commits and MLflow Experiment Logs

Go back to the “Image Segmentation” notebook and run the last cell in the notebook to generate the hyperlink to go straight to the Commits page in lakeFS and click on the generated link:

Review lakeFS Commits and MLflow Experiment Logs

On the Commits page, you will see the metrics and information for the best run (which can be used later for references):

Review lakeFS Commits

Click on the “Open Registered Model UI” button on the Commits page and it will take you straight to the registered model in the MLflow UI:

Open Registered Model UI

Now click on the link for the “Source Run” to see datasets, parameters, metrics, tags, and model artifacts including pickle file in MLflow:

Source Run

If you expand “Tags” section then it will show you lakeFS repo name, branch name and dataset URL:

Expand the "Tags" section

Copy the URL for the “lakefs_dataset” tag and open it in a new browser tab. It will take you straight to the Gold dataset used to train this best run:

Copy the URL for the “lakefs_dataset” tag

Go to the Branches tab in the lakeFS UI and select  LOCAL-experiment-1  branch:

Go to the Branches tab in the lakeFS UI

You will see that lakeFS versioned the datasets (raw, bronze, silver and gold) and information on the best model:

lakeFS versioned the datasets

Click on the best_model.txt file. It contains the information for the best run, metrics, parameters and tags:

best_model.txt

So, you versioned:

  • Code along with lakeFS commit info in Git
  • Datasets (raw, bronze, silver and gold) and information on the best model in lakeFS
  • Model along with info on lakeFS dataset in MLflow

In other words, you versioned everything (Code+Data+Model) in an easily reproducible way.

Step 13: Distributed Computing Environment

You can run multiple experiments in parallel with larger training datasets in a distributed computing environment. You can use Databricks for this or you can run your own Spark cluster.

If you use Databricks then you will run the cluster with following configurations:

  • Databricks Runtime Version: 12.2 LTS ML (includes Apache Spark 3.3.2, Scala 2.12)
  • Python Libraries: lakefs-client, pytorch-lightning==1.5.4, segmentation-models-pytorch==0.3.3
  • lakeFS Library on Maven: io.lakefs:hadoop-lakefs-assembly:0.1.12

Also, refer to Databricks and lakeFS Integration: Step-by-Step Configuration Tutorial to review all steps needed to configure lakeFS on Databricks.

Step 14: Parallel ML Experiments

Import “Image Segmentation” and “ImageSegmentationSetup” notebooks to Databricks. Open “Image Segmentation” notebook in Databricks UI and change following variable to DISTRIBUTED in the “Config” section:

localOrDistributedComputing = "DISTRIBUTED"

Change number of images for each experiment to 1000-10000 or more depending upon your Databricks cluster size:

imagesPerExperiment = 1000

Run other cells as you did for the local experiments. But when training the model, you will fine-tune hyperparameters with Hyperopt. You can also increase the “max_evals” parameter while fine-tuning hyperparameters (search for “max_evals” in “Image Segmentation” notebook).

Step 15: Comparing and Evaluating Results

Once experiments have completed, you can compare those and promote, to production, the best one.

Following code in the notebook selects best model based on the Intersection over Union (IoU) metric:

best_model = mlflow.search_runs(filter_string='attributes.status = 
    "FINISHED" and tags.lakefs_demos = "image_segmentation"',
    order_by=["metrics.valid_per_image_iou DESC"], max_results=1).iloc[0]

OPTIONAL: If you would like to promote the best model and training dataset associated with the best model into the production branch in lakeFS then running a merge (via code or UI), will promote the data from that branch into the main branch:

lakefs.refs.merge_into_branch(repository=repo.id,
                              source_ref="branch name with best model",
                              destination_branch="main")

Once again, no data is being duplicated in the underlying S3 buckets. However, now you have a single production branch with reference commits/tags, which includes the entire data set, the configuration and the performance of the model.

Summary

You have built an end-to-end pipeline to incrementally import our dataset, clean it and version control it for the reproducibility purposes. Trained a Deep Learning model at scale. The model is now ready for deployment and ready for production-grade usage.

lakeFS accelerates your team and simplifies the version control process for the ML use cases:

  • Unique zero-copy import of datasets and making different versions available to all
  • Support both structured and unstructured datasets
  • Train the model locally as well as at scale in distributed computing environment
  • Integrate with Git and other tools like MLflow
  • Security and compliance covered all along, from data security (RBAC) to data lineage

Want to learn more?

If you have questions about lakeFS, then drop us a line at hello@treeverse.io or join the discussion on lakeFS’ Slack channel.

Git for Data – lakeFS

  • Get Started
    Get Started
  • Create a Dev/Test Environment for Data Pipelines Using Spark and Python in this LIVE WEBINAR -

    Register here
    +