lakeFS: Where’s my data?

Ariel Shaqed (Scolnicov)

Last updated on April 26, 2024

Home > Blog > lakeFS: Where’s my data?

Ready to version control your data with lakeFS? Watch how

If you’ve come across our content, you may have noticed blogs diving into the technical details of lakeFS, and this is one of them. These are lakeFS internals and you do not need to know any of the details below in order to use lakeFS at any level. Either way, if you’re just curious about the technical side of lakeFS, or if you want to start digging into the internals, stick around and read this post.

For a data management platform, sometimes it may seem as though lakeFS takes pains to hide your data. Indeed, one very common question on our Slack #help channel is a polite variation on “where’s my data?”. lakeFS does indeed keep your data and it does so inside your namespace. This core functionality works and is well-tested. But lakeFS does such a great job of abstracting away all of these details that it’s hard to find the data!

Where is Pluto Story Garden — Source: Wikimedia

Where, indeed, is my data?

My repository has an object allstar_games_stat.csv on branch main

When I look inside the storage namespace of a repository with just a few commits, I see something like

Copy Code

❯ aws s3 ls s3://$storage_namespace/
                       PRE _lakefs/
                       PRE data/
2022-04-28 19:26:16     70 dummy

But it is nowhere to be found in the storage namespace! Even a recursive listing of the storage namespace cannot find it:

Copy Code

❯ (aws s3 ls --recursive s3://$storage_namespace/ | grep -q allstar_games_stat.csv) || echo not found.
not found.

If I search for an object with the exact same size in bytes, I can find it…

Copy Code

❯ aws s3 ls --recursive s3://$storage_namespace/ | grep -w 2146
2023-03-12 10:51:15       2146 
.../data/gp0n1l7d77pn0cke6jjg/cg6p50nd77pn0cke6jk0
❯ aws s3 cp 
s3://$storage_namespace/data/gp0n1l7d77pn0cke6jjg/cg6p50nd77pn0cke6jk0 - | head -4
Season,Age,Team,ORB,TRB,AST,STL,BLK,TOV,PF,PTS,Player
2004-05,20,CLE,1,8,6,2,0,3,0,13,Lebron James
2005-06,21,CLE,2,6,2,2,0,1,2,29,Lebron James
2006-07,22,CLE,0,6,6,1,0,4,0,28,Lebron James

The object is clearly there, but why the weird names? How does lakeFS find these files, and how does making this mess help tidy things up?

Object stores: Names are forever

lakeFS manages immutable data, but the mapping from a path to an object changes as you upload, commit and merge on lakeFS. When lakeFS stores an object, behind the scenes it uses an object store. lakeFS needs the name of the object there to be stable and unique. A stable name is one that will never change; a unique name is guaranteed not to be used for any other object.

If we want to store the object allstar_games_stat.csv on branch main, we could not keep these properties. Let’s start by asking: why does lakeFS not use an object store path main/allstar_games_stat.csv or similar?

The same path on a branch can correspond to multiple versions. Suppose I upload another version, and commit that one as well. The object store name main/allstar_games_stat.csv, used for the first version, is unique and cannot be re-used. The name of the existing version cannot change because it needs to be stable! So the new version would need to use another name, and would include a unique randomly-chosen identifier in the object path after all.
The same object path can appear on multiple branches. If I branch out of main to a new branch dev, the branch name prefix “main/” in the object store only adds confusion. Similarly, after a merge the prefix will not be helpful.
Meanwhile, object store pathnames are long. Standard data lake naming such as this one from Azure, combined with table partitions, can easily yield pathnames that are hundreds of characters long! Having to store them all would bloat lakeFS metadata.

Instead, all object names are random strings.

lakeFS uses its object store in an immutable manner: anything uploaded is never changed there. Renaming an object on the object store would require renaming all references to it.

What’s in a name? (data/ edition)

lakeFS stores all repository data objects under the prefix data/. Let’s peek in there!

Copy Code

❯ aws s3 ls s3://$storage_namespace/data/
                          PRE gkfl0n82r3o715gv9hg0/
                          PRE gkfl0qitvaclb8o14vdg/
                          PRE gkg7uu02r3o715gv9h8g/
                          PRE gkg809atvaclb8o14v80/
                          PRE gkg8ulitvaclb8o14v50/
                          PRE gkg9820a1b1dmlf8fu6g/
                          PRE gkgclnl6qh8ttnpbl61g/
                          PRE gkgcm30qitsjq3gqsfo0/
                          PRE gm3j76auepe2pfne1sc0/
                          PRE gm3j76t2vadktuh5sv10/
                          PRE gp0n1l7d77pn0cke6jjg/
                          PRE gpqcrbnald867bpq783g/
                          PRE gq9d1csnsopenecnt8og/
                          PRE gq9d5lk7ifk0v5jqvf40/

Remember, this is a small repository for only a few commits. Looking under the first and last prefixes, we see that prefixes are sorted in reverse chronological order:

Copy Code

❯ aws s3 ls s3://$storage_namespace/data/gkfl0n82r3o715gv9hg0/
2023-10-18 13:25:40       6898 cknr58o2r3o715gv9hgg
2023-10-18 13:25:55       5437 cknr5cg2r3o715gv9hh0
❯ aws s3 ls s3://$storage_namespace/data/gq9d5lk7ifk0v5jqvf40/
2023-01-09 17:29:27          4 ceu35lk7ifk0v5jqvf60

Without going into too many details: The names “g…” at the top level after data/ are staging tokens: lakeFS puts all uncommitted data that it manages for a branch in a staging area named by a token. These tokens are in rough reverse chronological ordering, which helps garbage collect objects that were never committed. lakeFS switches tokens during commits and around related activities. Full technical details are available in the designs for KV storage of uncommitted objects and for garbage collection of them.

How does lakeFS find my data?

To use lakeFS data, I use a lakeFS URL such as lakefs://ariels-repo/main/allstar_games_stats.csv. It indicates a repository – ariels-repo – and branch – main – in which to find the object at path /allstar_games_stats.csv. The lakeFS objects and versioning component, named Graveler, needs to search several places, in order:

In the Graveler KV store under the current staging token of branch main. If the object was recently uploaded and never committed, it will be found there.
In the Graveler KV store under any sealed staging tokens that are currently being committed. If the object is being committed (or, in future, if lakeFS needs to change something else in how it manages staging), it will be found there.
In the Graveler metarange, the stored list of all objects as they appeared in the latest commit for branch main.

Each of these will hold a record that includes the so-called “physical” path s3://$storage_namespace/data/gp0n1l7d77pn0cke6jjg/cg6p50nd77pn0cke6jk0 that we saw earlier. The same object usually has many names on lakeFS, for each version where it appears. They can all share the same physical path. That’s why lakeFS does not copy any data objects when committing, merging, or branching out.

How can I see it?

As a data management platform, lakeFS of course controls and is always aware of the locations of all versions of your data. lakeFS also allows controlled direct access to data for clients. To determine this physical location you can use lakectl fs stat without pre-signing:

Copy Code

❯ lakectl fs stat --pre-sign=false lakefs://ariels-repo/main/allstar_games_stats.csv
Path: allstar_games_stats.csv
Modified Time: 2022-11-10 11:30:31 +0200 IST
Size: 2146 bytes
Human Size: 2.1 kB
Physical Address: s3://.../83e1cc11afc447eea553efac96a1bc5d
Checksum: 99d48a86ddc815ebaaafc6ebf862663b
Content-Type: text/csv

The “physical address” field holds the exact location of my data on S3. With proper credentials I can even access it! However, you must obviously never modify this data, which is controlled by lakeFS, in any way.

It’s probably safer to access the data with a presigned URL. If you have permissions to read the object, lakeFS can give you direct HTTP access to the underlying object on S3:

Copy Code

❯ lakectl fs stat lakefs://ariels-repo/main/allstar_games_stats.csv
Path: allstar_games_stats.csv
Modified Time: 2022-11-10 11:30:31 +0200 IST
Size: 2146 bytes
Human Size: 2.1 kB
Physical Address: https://bucket-name.s3.us-east-1.amazonaws.com/.../83e1cc11afc447eea553efac96a1bc5d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAAWSKEY%2F20240205%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240205T134053Z&X-Amz-Expires=900&X-Amz-Security-Token=XYZZY.long.secret.data.XYZZY&X-Amz-SignedHeaders=host&x-id=GetObject&X-Amz-Signature=abcdef99887766554443221
Physical Address Expires: 2024-02-05 15:55:53 +0200 IST
Checksum: 99d48a86ddc815ebaaafc6ebf862663b
Content-Type: text/csv

Any HTTP client can read the data, directly from S3, without going through lakeFS:

Copy Code

❯ curl -o - 'https://bucket-name.s3.us-east-1.amazonaws.com/.../83e1cc11afc447eea553efac96a1bc5d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAAWSKEY%2F20240205%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240205T134053Z&X-Amz-Expires=900&X-Amz-Security-Token=XYZZY.long.secret.data.XYZZY&X-Amz-SignedHeaders=host&x-id=GetObject&X-Amz-Signature=abcdef99887766554443221' | head -3
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  2146  100  2146    0     0   2342      0 --:--:-- --:--:-- --:--:--  2342
Season,Age,Team,ORB,TRB,AST,STL,BLK,TOV,PF,PTS,Player
2004-05,20,CLE,1,8,6,2,0,3,0,13,Lebron James
2005-06,21,CLE,2,6,2,2,0,1,2,29,Lebron James

The object really is there!

lakeFS can show you any version of your object. For instance, to see data on branch dev from 3 versions ago, use reference dev~3:

Copy Code

❯ lakectl fs stat lakefs://ariels-repo/dev~3/allstar_games_stats.csv

Can I leave?

lakeFS always handles this indirection for you, and you never need to handle these details directly. Even if you need to snapshot your data or to leave lakeFS entirely, you can learn how to export or learn how to migrate away. (We’ll be sorry to see you go; please tell us why!)