Combining lakeFS and Spark provides a new standard for scale and elasticity to distributed data pipelines.
When integrating two technologies, the aim should be to expose the strengths of each as much as possible.
With this philosophy in mind, we are excited to announce the beta release of the lakeFS FileSystem! This native Hadoop FileSystem implementation allows for Spark applications on lakeFS to realize the best of both worlds. Spark workers can utilize their full capacity for distributed data operations, while lakeFS provides versioning capability to large-scale datasets.
In this article, we will explain what the lakeFS Filesystem is, how it works, and how you can use it in your own Spark applications!
The lakeFS FileSystem: How to use
Note the URI scheme “s3a” at the beginning of the object path. What this does is specify Spark to use the S3AFileSystem to communicate with lakeFS via the S3-compatible endpoint. From there, the lakeFS server is responsible for interacting with the underlying object store and carrying out any operations.
With the new lakeFS Hadoop FileSystem implementation however, the scheme undergoes a small change:
Updating the URI scheme to “lakefs” allows Spark to access lakeFS using the new lakeFS Hadoop FileSystem and benefit from its performance improvements.
Let’s dive into an example to show where these improvements come from!
Lifting the Hood: How it works
The key concept of the lakeFS Hadoop FileSystem is that it distinguishes between metadata and data operations.
Less intensive metadata operations continue to route through the lakeFS server. Resource-heavy data operations, however, get handled by the existing Hadoop FileSystem, capable of utilizing the full object store throughput.
This separation carried out by the lakeFS Hadoop FileSystem prevents overloading the lakeFS server with massive data I/O typical of Spark workloads.
Let’s look at what happens from the lakeFS FileSystem’s perspective during a typical read operation like the one below that takes data from a lakeFS repository (backed by S3) and reads it into a Spark DataFrame.
df = spark.read.parquet("lakefs://example-repo/main/example-path/example-file.parquet")
The above read operation gets split into its data and metadata parts. In this case, the metadata operation will take the object’s lakeFS path as input and return its physical address on S3.
The pseudo-code snippet below gives a sense of the procedure:
# Metadata operation - # Query lakeFS OpenAPI for the location of the object holding the data on underlying storage. path = "lakefs://example-repo/main/example-file.parquet" physical_address = get_physical_address(path) # Data operation - # Utilize the underlying fileSystem to operate on data in the underlying storage. s3a = get_underlying_file_system(path) s3a.open(physical_address)
How It used to work
Before, Spark apps reading from a lakeFS repository used the S3A filesystem to access the lakeFS server via the S3 Gateway.
With this setup, distinguishing between data and metadata operations is done entirely on the lakeFS server. The consequence of this is data throughput becomes dependent on the lakeFS server’s throughput.
Comparing this diagram to the one above makes it clear how we are able to get these performance gains with the new lakeFS Hadoop FileSystem. And also why we chose to prioritize its release to ensure lakeFS is not a bottleneck for Spark workflows.
The lakeFS Hadoop FileSystem currently supports using the S3A Hadoop FileSystem for data access, but there is more to come! We plan to extend support for others including EMRFS, Databricks FileSystem, GCS FileSystem, and Azure FileSystem.
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.