How We Optimized lakeFS Mount for Deep Learning

Oz Katz

Last updated on September 23, 2024

Home > Blog > How We Optimized lakeFS Mount for Deep Learning

Watch how lakeFS Mount works in this 7-minute tutorial

I am happy to share that lakeFS Mount is now available. lakeFS Mount allows for mounting a repository (or a specific path within one) as a local filesystem.

Why lakeFS Mount?

Working with large datasets locally allows for a lot more control in your executions and workflows.

However, this can present a number of tradeoffs that lakeFS Mount helps solve:

Git integration – Mounting a path in a Git repo automatically tracks the data version, linking it with your code. When checking older code versions, you get the corresponding data version, preventing local-only successes.

Speed – Data consistency and performance are guaranteed. lakeFS prefetches commit metadata into a local cache in sub-milliseconds, allowing you to work immediately without having to wait for large dataset downloads.

Intelligent – lakeFS Mount efficiently uses cache, accurately predicting which objects will be accessed. This enables granular pre-fetching for metadata and data files before processing starts.

Consistency – Working locally risks using outdated or incorrect data versions. With Mount, you can work with consistent, immutable versions, ensuring you know exactly what data version you’re using.

Let’s explore the technical details that led us to develop lakeFS Mount.

What is a mount?

A filesystem mount is the ability to present a local device or a remote location as a local directory. It is a basic feature provided by all operating systems and is widely used by system admins and developers.

Mount for object storage

Mounting an object store location as a POSIX directory isn’t a novel concept (see S3 Mountpoint, gcs-fuse and blobfuse). These tools provide an abstraction layer on top of the object store that allows it to appear as a directory on the machine. Reading and writing objects behaves exactly like reading and writing files from a local drive. There are several reasons why this is beneficial:

Ease of integration: Typically, interacting with an object store requires an SDK and custom code to handle network calls, authentication and configuration. On the other hand, reading and writing to local files is ubiquitous and supported on pretty much any framework, tool or language
Compatibility: This approach allows developers to implement their logic once, to work with a local directory and then switch that directory out for a mounted object store when required.
Separation of concerns: Wherein a data scientist can worry about business logic and less so about IO scalability, while software developers and operators can take that logic and apply at larger scale by simply replacing the input directory with a mount.

However, while these benefits are real, existing object storage mount solutions often fall short when it comes to performance and consistency, especially when used in the context of machine learning and deep learning environments. In this article we will review the reasons existing object storage mount solutions fall short, and why lakeFS, as a data version control system, is best positioned to provide a performant and consistent object storage mount.

Why mounting object storage typically leads to poor performance

Applications (and libraries) expect file system metadata operations to be very cheap

Here’s a quick example: see this small bit of code, commonly found in ML applications:

Copy Code

import tensorflow as tf

dataset = tf.data.Dataset.list_files("/path/*.txt")

Let’s run this through strace to see what this does:

Copy Code

strace --summary -f -e stat  -- python3 ./load_dataset.py
...
[pid 274851] +++ exited with 0 +++
+++ exited with 0 +++
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    2.048897          31     64344           stat
------ ----------- ----------- --------- --------- ----------------
100.00    2.048897          31     64344           total

Looking at the summary below, we see that our one line of Python actually triggered 64,344 filesystem operations only to look up metadata.

Now, imagine every such operation has to be translated into an HTTP call with ~10-50 ms of latency (vs the current average of 31 microseconds). This could substantially slow down our script even before we read a single byte of actual data.

Some implementations (such as S3 Mountpoint) get over this by allowing the user to sacrifice some consistency for performance, by caching metadata responses.

This alleviates some of that latency, but not all of it – many applications will still attempt to stat each file (or readdir every directory), expecting this operation to be very cheap – caching will help by only having to do this once per file, but every file encountered for the first time will still incur an unavoidable round trip.

(Deep learning) applications will typically read the same file many, many times

Training deep neural networks requires passing the same input multiple times as the network is being trained, so it is often actually bottlenecked on I/O instead of those expensive GPUs. In reality, this translates to the same files being accessed thousands of times in relatively quick succession. Feeding the GPU optimally depends not only on the overall throughput (for which object stores excel) but also latency.

Especially with smaller files, object stores are notorious for having a relatively long TTFB, sometimes reaching dozens of milliseconds. While this is OK for smaller datasets, at high throughput, we care about latency – remember Little’s law which states:

L=λW

Where L is the number of requests that are being processed, λ is the client QPS (query per second) and W is the average latency for a client request to be processed.

From this, we can deduce that in order to achieve high throughput, we really should care about our average latency, and aim to reduce it.

How lakeFS Mount optimizes deep learning workloads

lakeFS Mount allows for mounting a repository – or a specific path within one – as a local filesystem. When mounting, the everest command line utility spins up a local Fuse or NFS server (listening on localhost) and mounts that to a local path in the operating system:

For this post, I’ll focus on read-only mounts, referring to a specific lakeFS commit.

Sub-millisecond metadata operations

The first and major improvement utilized by lakeFS Mount is its ability to efficiently prefetch a commit’s metadata onto a local cache dir. This leverages lakeFS’ architecture and data model: commits in lakeFS are represented as a set of pointers to data on an object store.

Each pointer is a key/value pair: The key is the logical path within the lakeFS repository (for example: ”data/file.parquet”) and the value is a structure with the following attributes:

size	Size of the object in bytes
mtime	Last modification date
physical_address	Location of the actual data file on the object store
user_metadata	Additional attributes, user controlled
identity	A collision resistant identifier representing this object (based on path, size, etag and other attributes)

These key value pairs are stored as immutable chunks (1-8MB in size) – each one representing some lexicographical range within a commit. They are stored in rocksdb-compatible sstable files, which are named by the hash of all included identities.

These range files are then referenced by a “meta-range” – a special type of range that points to other range files, constructing a Merkle tree.

store file system metadata with lakeFS range files — *lakeFS range files (on the right), efficiently storing file system metadata)*

With this layout, lakeFS Mount can pre-fetch these range files very efficiently:

lakeFS Mount can pre-fetch range files efficiently

Once pre-fetched, the local mount server is then able to satisfy all filesystem metadata operations (such as stat and readdir) directly from these sstables. This is a very fast lookup – sstables tend to be relatively compact and are optimized for random access reads – many orders of magnitude faster than an object store lookup.

Data prefetching and caching

Caching data is notoriously hard to get right, especially when we care about consistency and reproducibility. As the famous saying goes, There are only two hard things in Computer Science: cache invalidation and naming things.

There are only two hard things in Computer Science: cache invalidation and naming things

However, lakeFS commits, along with their meta-range and ranges – are guaranteed to be immutable! Furthermore, each file within a said commit has an identity; if we store files in cache based on that identity, we can also reuse a cached object across commits without sacrificing consistency. This means there’s no invalidation to worry about – our eviction algorithm only has to take care of maintaining the most frequently accessed objects in storage.

lakeFS Mount implements this using a read-through cache: when objects are requested by the operating system, the mount server will first look them up in the cache dir based on their identity. If it is not found, the file will be fetched from the remote object store into the cache dir and then served from there.

This is both simple and effective – subsequent reads and seeks happen locally.

The second part to the story is optimally utilizing that cache – most workloads are pretty deterministic, so we can anticipate with high accuracy which objects are likely to be accessed. In some cases, it could be beneficial to pre-fetch them before processing begins. For this, lakeFS Mount allows granular pre-fetching, not only for metadata as described above, but also for data files! Here are a few examples where prefetching could greatly improve performance:

For a Delta / Iceberg table, prefetch the delta log files or Iceberg manifest files
For computer vision, prefetch bounding boxes and labels
For model training, prefetch model checkpoints that are guaranteed to be accessed

Getting started with lakeFS Mount

Prerequisites:

A working lakeFS Server running either lakeFS Enterprise or lakeFS Cloud
You’ve installed the lakectl command line utility: this is the official lakeFS command line interface, on top of which lakeFS Mount is built.
lakectl is configured properly to access your lakeFS server as detailed in the configuration instructions

To use lakeFS Mount, request access by booking a walkthrough to download and install the everest command line utility, currently available for MacOS and Linux.

Request access

Mounting a path to a local directory:

Copy Code

$ everest mount lakefs://repository/reference/path/ ./my_local_directory

Once complete, my_local_directory should be mounted with the specified path.

To unmount the directory, simply run:

Copy Code

everest umount ./my_local_directory

Which will unmount the path and terminate the local mount-server.

Of course, there are a lot of knobs that we can turn here to improve I/O efficiency – prefetching parallelism, usage of pre-signed URLs, cache directory size and location, and many others. For a complete list, visit the lakeFS Mount official documentation.

What’s Next?

A lot!

On its roadmap for the second half on the year:

Granular data pre-fetching strategies
Native Kubernetes support via a CSI Driver
Write-support (by branching out of the mounted commit and applying changes in isolation!)

This is only a partial list, of course. If you have a use case for lakeFS Mount and would like to help shape its future, we’re actively looking for design partners to help us turn it into the highest performance way to use object stores available, for deep learning, data science and data engineering use cases.

How We Optimized lakeFS Mount for Deep Learning

Table of Contents

Watch how lakeFS Mount works in this 7-minute tutorial

Why lakeFS Mount?

What is a mount?

Mount for object storage

Why mounting object storage typically leads to poor performance

Applications (and libraries) expect file system metadata operations to be very cheap

(Deep learning) applications will typically read the same file many, many times

How lakeFS Mount optimizes deep learning workloads

Sub-millisecond metadata operations

Data prefetching and caching

Getting started with lakeFS Mount

What’s Next?

Beyond RAG: Put Open Knowledge Format Bundles Into Production with lakeFS

Agentic Data Access: How AI Agents Securely Access Enterprise Data

Give Your AI agent a Versioned Filesystem: A Self-Correcting Receipts Pipeline on E2B and lakeFS

How We Optimized lakeFS Mount for Deep Learning

Table of Contents

Watch how lakeFS Mount works in this 7-minute tutorial

Why lakeFS Mount?

What is a mount?

Mount for object storage

Why mounting object storage typically leads to poor performance

Applications (and libraries) expect file system metadata operations to be very cheap

(Deep learning) applications will typically read the same file many, many times

How lakeFS Mount optimizes deep learning workloads

Sub-millisecond metadata operations

Data prefetching and caching

Getting started with lakeFS Mount

What’s Next?

Related Articles

Beyond RAG: Put Open Knowledge Format Bundles Into Production with lakeFS

Agentic Data Access: How AI Agents Securely Access Enterprise Data

Give Your AI agent a Versioned Filesystem: A Self-Correcting Receipts Pipeline on E2B and lakeFS

Pick up the Slack with lakeFS