Table of Contents
Privacy Overview
This website uses cookies to improve your experience.
For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!
Home > Blog > Advancing lakeFS: Version Data At Scale With Spark
Table of Contents
When integrating two technologies, the aim should be to expose the strengths of each as much as possible.
With this philosophy in mind, we are excited to announce the release of the lakeFS FileSystem! This native Hadoop FileSystem implementation allows for Spark applications on lakeFS to realize the best of both worlds. Spark workers can utilize their full capacity for distributed data operations, while lakeFS provides versioning capability to large-scale datasets.
In this article, we will explain what the lakeFS Filesystem is, how it works, and how you can use it in your own Spark applications!
Let’s start with a simple example. Previously, in order for a Spark application to access data objects managed by lakeFS, we would leverage S3 Gateway functionality like so:
spark.read.parquet("s3a://example-repo/main/example-path/example-file.parquet")
Note the URI scheme “s3a” at the beginning of the object path. What this does is specify Spark to use the S3AFileSystem to communicate with lakeFS via the S3-compatible endpoint. From there, the lakeFS server is responsible for interacting with the underlying object store and carrying out any operations.
With the new lakeFS Hadoop FileSystem implementation however, the scheme undergoes a small change:
spark.read.parquet("lakefs://example-repo/main/example-path/example-file.parquet")
Updating the URI scheme to “lakefs” allows Spark to access lakeFS using the new lakeFS Hadoop FileSystem and benefit from its performance improvements.
Let’s dive into an example to show where these improvements come from!
The key concept of the lakeFS Hadoop FileSystem is that it distinguishes between metadata and data operations.
Less intensive metadata operations continue to route through the lakeFS server. Resource-heavy data operations, however, get handled by the existing Hadoop FileSystem, capable of utilizing the full object store throughput.
This separation carried out by the lakeFS Hadoop FileSystem prevents overloading the lakeFS server with massive data I/O typical of Spark workloads.
Let’s look at what happens from the lakeFS FileSystem’s perspective during a typical read operation like the one below that takes data from a lakeFS repository (backed by S3) and reads it into a Spark DataFrame.
df = spark.read.parquet("lakefs://example-repo/main/example-path/example-file.parquet")
The above read operation gets split into its data and metadata parts. In this case, the metadata operation will take the object’s lakeFS path as input and return its physical address on S3.
The pseudo-code snippet below gives a sense of the procedure:
# Metadata operation -
# Query lakeFS OpenAPI for the location of the object holding the data on underlying storage.
path = "lakefs://example-repo/main/example-file.parquet"
physical_address = get_physical_address(path)
# Data operation -
# Utilize the underlying fileSystem to operate on data in the underlying storage.
s3a = get_underlying_file_system(path)
s3a.open(physical_address)
Before, Spark apps reading from a lakeFS repository used the S3A filesystem to access the lakeFS server via the S3 Gateway.
With this setup, distinguishing between data and metadata operations is done entirely on the lakeFS server. The consequence of this is data throughput becomes dependent on the lakeFS server’s throughput.
Comparing this diagram to the one above makes it clear how we are able to get these performance gains with the new lakeFS Hadoop FileSystem. And also why we chose to prioritize its release to ensure lakeFS is not a bottleneck for Spark workflows.
The lakeFS Hadoop FileSystem currently supports using the S3A Hadoop FileSystem for data access, but there is more to come! We plan to extend support for others including EMRFS, Databricks FileSystem, GCS FileSystem, and Azure FileSystem.
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Table of Contents
Table of Contents
This website uses cookies to improve your experience.
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.
More information about our Cookie Policy
Join our community of experts:
introduce yourself, share your knowledge and discover best practices from fellow peers