Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Idan Novogroder
Idan Novogroder Author

Idan has an extensive background in software and DevOps engineering....

Last updated on April 26, 2024

The bigger the data you deal with, the less it is possible to consume on a single system. lakeFS tackles this issue by allowing for the efficient administration of large-scale data stored remotely. 

In addition to the capacity to manage massive datasets, lakeFS allows its users to carry out partial checkouts when working with certain areas of the data locally. Some of the most relevant use cases relate to working with machine learning models or optimizing GPU utilization.

Keep reading to learn more about the use cases of lakectl local and see how it works in lakeFS with a practical example.

What is lakectl local?

lakeFS brings Git-like mechanisms to the world of data, so the easiest way to explain lakectl local (local checkouts) is to use the analogy in Git.

When someone gives you a pull request from a fork or branch of your repository, you may merge it locally to settle a merge conflict or to test and validate the changes before merging them.

Use cases for lakectl local

Local development of machine learning models

Machine learning model development is a dynamic and iterative process that involves experimenting with different data versions, transformations, algorithms, and hyperparameter values. 

To maximize this iterative approach, teams must conduct tests in a timely, easy-to-track, and reproducible method. Localizing model data during development improves the overall process. It speeds it up by allowing for interactive and offline development and lowering data access latency.

Local data availability is necessary to smoothly integrate data version control systems with source control systems like Git. This connection is critical for attaining model reproducibility, which enables a more efficient and collaborative model development environment.

Optimize GPU utilization through data locality during training

Deep learning models demand high-end GPUs as their computing resources. In executing such workloads, the objective is to maximize GPU use and keep them from idling to optimize resource utilization and keep costs at bay. 

Many deep learning tasks require access to images, sometimes accessed several times. Localizing data can avoid unnecessary round trips to remote storage, resulting in significant cost savings.

How to use lakeFS locally

The local command of lakeFS’ CLI lakectl allows you to operate with lakeFS data locally. It enables copying lakeFS data into a directory on any system, synchronizing local directories with distant lakeFS locations, and integrating lakeFS with Git.

Using the command, you clone the data stored in lakeFS to any machine. Next, you can track which versions you are using in Git and create reproducible local workflows that scale very well and are easy to use.

Example of lakectl local

Imagine that you’re a lakeFS user and you’ve got several data sets stored in your lakeFS repository. But now you want to work with just one subset of locally and the data in the OrionStar folder:

First things first, you need to clone this folder to your local machine:

$ lakectl local clone lakefs://quickstart/main/data/OrionStar/ orion 
download QTR1_2007.csv                   ... done! [867B in 0s]
download PRICES.csv                      ... done! [1.29KB in 0s]
download ORDER_FACT_UPDATES.csv          ... done! [1.44KB in 0s]
download QTR2_2007.csv                   ... done! [1.67KB in 0s]
download CUSTOMER.csv                    ... done! [6.86KB in 0s]
[…]

Successfully cloned lakefs://quickstart/main/data/OrionStar/ to /Users/rmoff/work/orion.

Clone Summary:

Downloaded: 14
Uploaded: 0
Removed: 0

Just like a cloned Git repository, the local folder keeps references to the main branch of the quickstart repository. This means that you can sync subsequent changes as well as work with the data locally.

With the data locally, you can use whatever you want. Here’s an example of a DuckDB query. 

🟡◗ SELECT c.customer_name, c.customer_address, co.product_name, co.total_retail_price
> FROM   READ_CSV_AUTO('orion/CUSTOMER_ORDERS.csv') co
>        INNER JOIN
>        READ_CSV_AUTO('orion/CUSTOMER.csv') c ON co.customer_name=c.customer_name;
┌─────────────────────┬───────────────────────────────────┬────────────────────────────────────────┬────────────────────┐
│    Customer_Name    │         Customer_Address          │              Product_Name              │ Total_Retail_Price │
│       varchar       │              varchar              │                varchar                 │      varchar       │
├─────────────────────┼───────────────────────────────────┼────────────────────────────────────────┼────────────────────┤
│ Kyndal Hooks        │ 252 Clay St                       │ Kids Sweat Round Neck,Large Logo       │ $69.40             │
│ Kyndal Hooks        │ 252 Clay St                       │ Fleece Cuff Pant Kid'S                 │ $14.30             │
│ Dericka Pockran     │ 131 Franklin St                   │ Children's Mitten                      │ $37.80             │
│ Wendell Summersby   │ 9 Angourie Court                  │ Bozeman Rain & Storm Set               │ $39.40             │
│ Sandrina Stephano   │ 6468 Cog Hill Ct                  │ Teen Profleece w/Zipper                │ $52.50             │

Naturally, you could engage in something more complex here, such as training a machine learning model or scoring the data before pushing it back to the repository.

The example above is a little special since the web UI of lakeFS has DuckDB embedded, enabling you to work with the data directly:

Let’s continue with the local example. Go into the lakeFS repository and you’ll find a data dictionary:

To get this and other changes in this path to you local copy, run a pull command:

$ lakectl local pull

diff 'local:///Users/rmoff/work/orion' <--> 'lakefs://quickstart/0b51ece0d7c39904c20054617165bbc5acc05b5d79b40ff2a1364cb9f15579d7/data/OrionStar/'...

download 00_data_dictionary.pdf ... done! [129.17KB in 1ms]

Successfully synced changes!

Pull Summary:

Downloaded: 1
Uploaded: 0
Removed: 0

What if you’d like to track a branch different from the main one that you’ve initially cloned? You can use checkout for that. To get started, specify the local folder from which we’re running the command (just a full stop) and reference the branch name:

$ lakectl local checkout . --ref dev

diff 'local:///Users/rmoff/work/orion' <--> 'lakefs://quickstart/8e0e62cb7c20d016eba74abd2510bf8fcaac94563fe860fb4259626bed7e66a2/data/OrionStar/'...
diff 'local:///Users/rmoff/work/orion' <--> 'lakefs://quickstart/7eaeafd3fb90df419ca26eca69ba2b153eae13bd3dccf80ba47cbed1260cb0c1/data/OrionStar/'...

delete local: PRICES.csv ... done! [%0 in 0s]

Checkout Summary:

Downloaded: 0
Uploaded: 0
Removed: 1

Look at the output of the command. You’ll see that the PRICES.csv file was removed from the dev branch. Let’s remove a few more files locally and add one more (RATINGS.csv) just to see the two-way sync in action:

$ cp ../RATINGS.csv .

$ rm -v EMPLOYEE_*
EMPLOYEE_ADDRESSES.csv
EMPLOYEE_DONATIONS.csv
EMPLOYEE_ORGANIZATION.csv
EMPLOYEE_PAYROLL.csv  

$ lakectl local status

[…]

╔════════╦═════════╦═══════════════════════════╗
║ SOURCE ║ CHANGE  ║ PATH                      ║
╠════════╬═════════╬═══════════════════════════╣
║ local  ║ removed ║ EMPLOYEE_ADDRESSES.csv    ║
║ local  ║ removed ║ EMPLOYEE_DONATIONS.csv    ║
║ local  ║ removed ║ EMPLOYEE_ORGANIZATION.csv ║
║ local  ║ removed ║ EMPLOYEE_PAYROLL.csv      ║
║ local  ║ added   ║ RATINGS.csv               ║
╚════════╩═════════╩═══════════════════════════╝

Now let’s commit the change:

$ lakectl local commit . -m "Remove employee data and add ratings information" --pre-sign=false

Getting branch: main

diff 'local:///Users/rmoff/work/orion' <--> 'lakefs://quickstart/7eaeafd3fb90df419ca26eca69ba2b153eae13bd3dccf80ba47cbed1260cb0c1/data/OrionStar/'...
diff 'lakefs://quickstart/38df1480d835fe52dc6db31b87ef35b78403d0c9c621c74a6d032c686551abc5/data/OrionStar/' <--> 'lakefs://quickstart/main/data/OrionStar/'...

delete remote path: EMPLOYEE_ORGANIZATI~ ... done! [%0 in 8ms]
delete remote path: EMPLOYEE_ADDRESSES.~ ... done! [%0 in 8ms]
delete remote path: EMPLOYEE_DONATIONS.~ ... done! [%0 in 8ms]
delete remote path: EMPLOYEE_PAYROLL.csv ... done! [%0 in 8ms]
upload RATINGS.csv                       ... done! [6.86KB in 9ms]

Wrap up

The local command of lakeFS’ CLI lakectl allows you to operate with lakeFS data locally. It lets you copy lakeFS data into a directory on any system, synchronize local directories with distant lakeFS locations, and integrate lakeFS with Git.

Check out this documentation page to learn more about lakectl local.

Git for Data – lakeFS

  • Get Started
    Get Started
  • Where is data engineering heading in 2024? Find out in this year’s State of Data Engineering Report -

    Read it here
    +