Iddo Avneri
January 3, 2023

lakeFS is gaining momentum as a solution for data versioning on top of an object store, and more and more data driven organizations adopt lakeFS as their data version control. Once you start using lakeFS, the files on your object store will form in a new structure. Other solutions, such as Iceberg, also create a new metadata structure, for example, that might be concerning unless deployed responsibly and unless the solution has methods to import and export data into different formats.

In previous articles, we covered the benefits of using lakeFS as a data engineer, mainly to gain easy to use ETL testing environment, enable reproducibility for experiments, and eventually reach CI/CD for data to avoid future failures.


In this article, we will examine the structure of lakeFS files under the hood in a practical way. Because developers, understanding these internals, will better understand the solutions for high availability and disaster recovery of lakeFS.

Understanding lakeFS File Structure

Creating a repository

Let’s first briefly review the file structure once implementing lakeFS. First, I will create a new repository. In this case, I’ll use the web interface:

This creates an “empty” lakeFS repository understand-lakefs-repo sitting on an S3 bucket my-lakefs-managed-bucket:

If we look on the AWS side (similar behavior on Azure, GCP, MinIO etc.), we will see a single file created at this time:

The dummy file is created to check the permissions of the AWS role used by lakeFS to write into the bucket. 

Importing files

A typical first step when adopting lakeFS is to import files into the solution. Importing doesn’t copy the files into the bucket (as we are about to demonstrate). This also means lakeFS does not manage the metadata if you change the data in the original bucket.

I will be importing data from my bucket s3://my-original-data, which contains a single directory product-reviews with 2 parquet files:

Now, examining my bucket, we see a new folder – _lakefs was created.

That folder, contains a couple of files:

Notice, that while the original parquet files are around 700 KB each, these files are around 1.4 KB. Meaning, these are not copies of the imported files, but range and metarange of the location of these files. 

Pro tip 

To view the content of these files, I can use the command:

 aws s3 cp <path to file on s3> - | lakectl cat-sst -f -

For example (shortened): 

% aws s3 cp s3://my-lakefs-managed-bucket/_lakefs/58..ec - | lakectl cat-sst -f -
+----------+------------------+------------------------------------------------------+-------+----------------+
| RANGE_ID | MIN_KEY          | MAX_KEY                                              | COUNT | ESTIMATED_SIZE |
+----------+------------------+------------------------------------------------------+-------+----------------+
| d7....28 | product-reviews/ | product-reviews/part-00001-tid-39..00.snappy.parquet | 3     | 1521           |
+----------+------------------+------------------------------------------------------+-------+----------------+
Link to full size image.

Now that I have a separate branch with my _production_imported data, I can merge that data into my main production branch:

In this case, no additional files were added to my object store. I can now query my data through lakeFS using the regular lakeFS path:

# Review data from the production branch

df = spark.read.parquet("lakefs://understand-lakefs-repo/production/product-reviews/")


df.show()

Adding / Uploading Files

Going forward, we will write new objects to our repository, which will be written to the S3 managed bucket (in my case, s3://my-lakefs-managed-bucket/, the bucket I used when creating the repository).

In this example, I’ll add 3 images my (perfect) 8-year-old daughter created on Dall-E of a “Cotton Candy boat in a chocolate Ocean”:

These are my files:

% ls -lh | awk '{print $5, $9}' | grep Image 
1.3M Image1.png
1.3M Image2.png
1.3M Image3.png

Once again, I’ll upload the files using the web interface:

Once uploaded, I now have a new directory created data/:

If I browse the subdirectories in of the data folder, I will see 3 files in 2 different subdirectories:

These are, of course, the three 1.3 MB image files I’ve uploaded. As expected, the files are named differently uploaded via lakeFS, and the references to these files are maintained in the range and metarange files for the different commits. 

At this point, we uploaded the files, but didn’t commit the changes yet. Let’s commit the files via the UI (click on the uncommitted changes tab):

Once committed, new files are added under the _lakefs directory:

These are additional range and metadata files that were added one I created the new commit:

Link to full-size image

Another way to get the physical address of a file on S3 from lakeFS is to click on the cog-wheel icon () next to the object name in the UI and get the physical address:

Note: In general, the exact structure of the storage namespace might be different between lakeFS versions. However, the data files will always be stored locally on your bucket.

Conclusion

Using lakeFS, the files themselves will be stored in a data directory, with range and metarange files sitting on the object store under the _lakefs directory, associating the individual files with different commits. 

Regardless to where lakeFS runs – Data, range, and metarange files are stored in place, on your object store.

Restore my original human-readable file structure without lakeFS

Us humans, often want to make sure we have a way to read our data in a logical structure we understand directly on top of the object store. lakeFS allows a couple of ways to export the data. For example, the branch lakefs://understand-lakefs-repo/production can be exported to s3://axolotl-company/production/latest. Data consumers, unaware of lakeFS, could use that base URL to access the latest production files. Of course, we can continue using the lakeFS S3 endpoint to access files on lakefs://understand-lakefs-repo/ or on any branch or any historical commit.

Exporting data with docker

If you have access to spark, it is just as easy and more performant to use spark for the export instead of exporting with docker. However, here I’ll demonstrate a single command to export my data using docker, which is always available. 

First, make sure you have a bucket available to store the data, as described above, I’ll use s3://axolotl-company/production/latest. 

I’ve created a bucket and confirmed my role has access to write to that bucket:

Next, to export the data, I’ll run the following command (I’m using —platform linux/amd64 because I’m running on an M1 processor):

docker run --platform linux/amd64 -e \
LAKEFS_ACCESS_KEY=AKIA_LFS -e \
LAKEFS_SECRET_KEY=LAKEFS_SECRET -e \
LAKEFS_ENDPOINT=https://YOUR.lakefs.io -e \ 
S3_ACCESS_KEY=AKIA_S3 -e \ S3_SECRET_KEY=S3_SECRET \ 
-it treeverse/lakefs-rclone-export:latest understand-lakefs-repo \
s3://axolotl-company/production/latest --branch=production

And… voila:

We now have a human-readable version of our data stored on our bucket, which anyone (with the right permissions) can access. 

More ways to access your data

Another technique worth mentioning to access your data – lakeFS also exposes its metadata in an open format, readable as a Spark DataFrame.

For example:

import io.treeverse.clients.LakeFSContext
    
val commitID = "a1b2c3d4"
val df = LakeFSContext.newDF(spark, "example-repo", commitID)
df.show
/* output example:
   +------------+--------------------+--------------------+-------------------+----+
   |        key |             address|                etag|      last_modified|size|
   +------------+--------------------+--------------------+-------------------+----+
   |     file_1 |791457df80a0465a8...|7b90878a7c9be5a27...|2021-03-05 11:23:30|  36|
   |     file_2 |e15be8f6e2a74c329...|95bee987e9504e2c3...|2021-03-05 11:45:25|  36|
   |     file_3 |f6089c25029240578...|32e2f296cb3867d57...|2021-03-07 13:43:19|  36|
   |     file_4 |bef38ef97883445c8...|e920efe2bc220ffbb...|2021-03-07 13:43:11|  13|
   +------------+--------------------+--------------------+-------------------+----+
 */

Conclusion

There are no usage limitation due to the new file structure presented by lakeFS – We can easily export the data to a normal “human-readable” structure if needed. 

Manage Failures

Deployment

lakeFS is stateless, implementing a shared nothing architecture. We suggest deploying lakeFS on multiple instances behind a load balancer, a configuration supported OOTB using K8s with helm chart.

Of course, you can also take advantage of the hosted lakeFS cloud solution, where this is completely transparent to the users consuming lakeFS as a service (while the data itself sits on top of your buckets). 

Disaster Recovery

We demonstrated how you can easily export your data from lakeFS managed structured to your “regular” object store structure. 

Having said that, we also demonstrated how all files are stored on your object store. Meaning,  even in the most extreme case where the lakeFS server cannot be restored – A new lakeFS server can be spun up against your data and you can continue working as usual.

Summary

Due to the structure of lakeFS, with all files saved on your managed bucket within your cloud (or on-premises) there are multiple ways to achieve high availability for lakeFS. Furthermore, there are multiple easy ways to export the data to an original human-readable structure for other systems / people to consume, or as a backup.

Git for Data – lakeFS

  • Get Started
    Get Started
  • LIVE: Develop Spark pipelines against production data on February 15 -

    Register Now
    +