Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Amit Kesarwani
Amit Kesarwani Author

Amit heads the solution architecture group at Treeverse, the company...

Published on September 12, 2024

What is a mount?

A filesystem mount is the ability to present a local device or a remote location as a local directory. It is a basic feature provided by all operating systems and is widely used by system admins and developers.

Let’s break down the differences between Mountpoint for Amazon S3 and lakeFS Mount:

Amazon S3 Mountpoint

What it is

Amazon S3 Mountpoint is a high-throughput open source file client for mounting an Amazon S3 bucket as a local file system. With Mountpoint, your applications can access objects stored in Amazon S3 through file system operations, such as open and read. Mountpoint automatically translates these operations into S3 object API calls, giving your applications access to the elastic storage and throughput of Amazon S3 through a file interface.

How it works

Mountpoint supports basic file system operations, and can read files up to 5 TB in size. It can list and read existing files, and it can create new ones. It cannot modify existing files or delete directories, and it does not support symbolic links or file locking.

To use Mountpoint for Amazon S3, your host needs valid AWS credentials with access to the bucket or buckets that you would like to mount.

Key Benefits

  • Direct Access:  You have complete control over your data and its interactions.
  • No Third-Party Dependencies:  You don’t need to install any additional software except Mountpoint for Amazon S3.

Considerations

  • Limited Versioning and Branching: S3 offers basic versioning capabilities

lakeFS Mount

What it is

lakeFS Mount is a powerful feature that lets you seamlessly integrate Amazon S3 data into your local filesystem (like your computer’s file system) along with lakeFS’s versioning and branching capabilities. It works by creating a local “mount point” (a directory on your machine) that acts as a gateway to your versioned S3 data.

How it works

When you access files through the lakeFS mount, lakeFS handles the communication with S3 in the background, giving you the impression of a regular file system.

To use lakeFS Mount, your host needs a lakeFS Mount binary file and valid lakeFS credentials with access to the lakeFS branch/commit that you would like to mount.

Key Benefits

  • Simple Access:  No need to write complex code to interact with S3 APIs. You can treat S3 data as if it’s on your local drive.
  • Faster Development:  The familiar file system interface speeds up development and testing.
  • Improved Data Visibility:  You can easily explore and manipulate S3 data from your local tools (like editors, IDEs, or even command-line tools).
  • Versioning and Branching:  lakeFS’s versioning and branching capabilities extend to your S3 data when mounted, providing powerful data management features.
  • Listing Operations: lakeFS Mount efficiently prefetches file metadata onto a local cache dir. So, listing files and directories is much faster on lakeFS because metadata is local on the mounting host.
  • Git Integration & Data Reproducibility: lakeFS Mount has reproducibility built-in. When mounting a lakeFS repo path within a Git repository, Git will automatically track which version of the data got mounted, allowing code and input data to be linked together.  So, when you checkout an older version of the code, you’ll automatically get the corresponding version of the data that code was used on.
  • Collaboration: Teams using lakeFS can work simultaneously on different versions of data without conflicts and share specific data versions with others to ensure data consistency.

When to use Amazon S3 Mountpoint or lakeFS Mount

Feature Amazon S3 Mountpoint lakeFS Mount
Ideal For Applications that need fast read/writes but do not need all of the features of a shared file system Data-heavy applications and projects that need versioning and consistent reads
Usage Large-scale, read-heavy applications Quick prototyping, development, data exploration
Key Strength Elastic throughput for large S3 datasets Seamless access, versioning, and branching for S3 data
Supported OS Linux only Linux, macOS, Windows (upcoming)
Cache Mechanism Not mentioned Read-through cache with pre-fetching for deep learning
Advanced Features Branch management, version control, and granular pre-fetch

Amazon S3 Mountpoint

  • Mountpoint for Amazon S3 is generally used by large-scale read-heavy applications: data lakes, machine learning training, image rendering, autonomous vehicle simulation, extract, transform, and load (ETL), and more.
  • Mountpoint is ideal for applications that do not need all of the features of a shared file system and POSIX-style permissions but require Amazon S3’s elastic throughput to read and write large S3 datasets.
  • Mountpoint for Amazon S3 is available only for Linux operating systems.
  • Mountpoint does not provide the advanced features of lakeFS, such as branch management and version control.

lakeFS Mount

  • For quick prototyping, development, or data exploration when seamless file system access is desired. You could mount the S3 bucket to your local machine and use familiar tools like Pandas to read and manipulate the data, leveraging lakeFS’s versioning for data consistency.
  • Deep Learning applications that typically read the same file many, many times. lakeFS Mount implements a read-through cache: when objects are requested by the application, the mount server will first look them up in the cache dir based on their identity. If it is not found, the file will be fetched from the remote object store into the cache dir and then served from there. 
    Also, most Deep Learning workloads are pretty deterministic, so we can anticipate with high accuracy which objects are likely to be accessed. In some cases, it could be beneficial to pre-fetch them before processing begins. For this, lakeFS Mount allows granular pre-fetching, not only for metadata as described above, but also for data files.
  • For projects that benefit from versioning and branching features for your S3 data.
  • lakeFS Mount is available for Linux and macOS operating systems and will be available for Microsoft Windows also.

Summary

The choice between lakeFS Mount and Amazon S3 Mountpoint depends on the specific requirements of your use case. If you need the advanced features provided by lakeFS, such as branch management and version control, the lakeFS Mount may be the better choice. If you have simpler needs and just want to access S3 data as a file system, Mountpoint for Amazon S3 may be a more lightweight and efficient option.

lakeFS