Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Iddo Avneri
Iddo Avneri Author

Iddo has a strong software development background. He started his...

Last updated on December 12, 2024

When working with large datasets stored in object stores (such as Amazon S3, Google Cloud Storage, Azure Blob Storage on the cloud, or MinIO or Dell ECS on-prem), we often see a need for users to work locally with that data. Data scientists and engineers may prefer to work locally for several reasons. For instance, they might want to be close to their GPU resources, ensuring faster data processing, or they simply find it easier to develop and experiment in a familiar local environment. Additionally, working locally avoids the need to interact with potentially unfamiliar cloud interfaces. Regardless of the reason, having local access to data stored in object stores is a common requirement.

lakeFS, offers two primary ways for users to work with data locally:

  1. lakeFS CLI (utilizing the lakectl local command)
  2. lakeFS Mount

In this post, I’ll explore both methods, explain their usage, and provide a comparison of their key features.

1. Working Locally with lakectl local

What is lakectl local?

lakectl local is a feature that allows you to checkout data from a lakeFS branch or commit into a local directory. This allows you to work with the data locally, make changes, and then later commit those changes back to the lakeFS repository. It provides a full read/write capability, meaning you can modify the data offline and then push the changes back to lakeFS when ready.

This method essentially creates a local copy of the data from the object store, allowing you to work as though the data resides locally on your file system.

How It Works:

  • Checkout Data Locally: Using the lakectl local command, you can check out a specific branch, commit, or tag from a lakeFS repository into a local directory. This pulls the relevant data down from the object store to your machine.
  • Modify Data Locally: Once the data is checked out, you can make changes to the data locally. Since it provides full read/write access, you can modify files, delete or add new data, and process the data as needed.
  • Push Changes Back: After making changes, you can use the lakeFS CLI to commit those changes back to lakeFS, ensuring version control and keeping the object store data up to date.

Example Workflow:

  1. Checkout data locally:
    lakectl local clone <path URI> [directory]
    This command clones lakeFS data from a path into an empty local directory and initializes the directory.
  2. Work with the data locally:
    Once checked out, you can interact with the data as if it were local, performing any processing or modifications.
  3. Check the status of your changes:
    lakectl local status
    This command shows remote and local changes to the directory and the remote location it tracks.
  4. Commit changes:
    lakectl local commit [directory]
    After making changes, you can push the modifications back to lakeFS using standard lakeFS commit commands.

Learn more about lakectl local.

2. Working Locally with lakeFS Mount

What is lakeFS Mount?

lakeFS mount allows you to mount a lakeFS repository as a file system on your local machine, giving you access to data stored in the object store as if it were part of your local file system. This setup enables you to interact with lakeFS-versioned datasets as though they are local, but with some key differences in how the data is handled. Currently, lakeFS Mount is read-only, meaning you can explore and read data, but you cannot make changes.

The data remains in the object store, and lakeFS handles the retrieval of data efficiently through a two-step process involving metadata fetching and local caching.

How It Works:

  1. Metadata Fetching: When you mount a lakeFS repository, only the metadata of the files is fetched and made available locally. This means that the file structure, sizes, and other metadata are presented to you right away, without actually downloading the file contents. This gives you immediate access to view the directory structure and file listings, but not the actual data itself at this stage.
  2. Lazy Loading & Caching: The actual file contents are only retrieved from the object store when you attempt to read or access the files. Once accessed, the file is copied locally and cached to minimize repeated retrieval from the object store for future reads. This allows for efficient and seamless browsing of large datasets without needing to download all the data upfront.

    The caching mechanism ensures that files that have already been accessed are stored locally, reducing latency for future reads of the same data.

Supported Protocols:

There are two supported protocols for mounting data locally:

  • NFS V3 (Network File System) is supported on macOS.
  • FUSE (Filesystem in Userspace) is supported on Linux, without requiring root access.

Explore documentation to learn more about lakeFS Mount.

Comparison: lakeFS CLI Local vs. lakeFS Mount

Feature lakectl local lakeFS Mount
Usage Pattern Check out data for local offline use. Mount data for read-only access.
Data Location Data is copied locally to your machine. Metadata is fetched initially, files are cached locally after access.
Read/Write Capability Full read/write access. Read-only access (for now).
Performance Best for small to medium datasets that can be fully checked out. Ideal for large datasets, leveraging lazy loading and local caching.

When to Use Each?

  • Use lakeFS CLI Local if you need to work with a snapshot of the data, modify it offline, and then commit changes back to lakeFS. This is useful when you want to minimize access to object stores and work on a version of the data locally.
  • Use lakeFS Mount when you need to work with the data continuously without downloading it, or when working with large datasets that wouldn’t fit comfortably on your local storage. This method provides real-time access without the need to stage data locally first.

By leveraging these two methods, lakeFS provides flexibility for working with object store data locally, whether you’re performing large-scale reads or making fine-tuned modifications.

lakeFS