We recently released the first version of lakeFS supported by Pebble’s sstable library – RocksDB. The release introduced a new data model which is now much closer to Git. Instead of using a PostgreSQL server that quickly becomes a bottleneck, committed metadata now lives on the object store itself.
Early on we realized that we needed to cache those metadata files in the lakeFS local disk storage.
lakeFS “secret sauce” is fast metadata operations. A diff operation is less usable if it needs to fetch dozens of 10MB files from the object store, just to compare entries. Caching metadata files locally using LRU made sense for us, as some files are more frequent than others. For example, metadata files which represent the latest commit of master.
Another argument supporting our decision to add a local cache is that those files are immutable by nature. A commit represents a snapshot of your objects at a certain point in time and is guaranteed not to change. Not having to deal with cache invalidation is a huge advantage since consistency is a must for our users.
This led us to the conclusion that simply keeping the most commonly accessed files locally, handling eviction by deleting those files will allow us to enjoy the performance boost. How hard could that be?
Challenge #1 – Bookkeeping storage usage
We wanted to leverage one of the many implementations of in-memory caches in golang, for bookkeeping the files on disk and when to evict those. The pattern we had in mind:
- Whenever we store a file on disk (when a new metadata file is written, or when we fetch it from the object store). Insert the filename to the in-mem cache.
- When the file is accessed, try to get the filename from the in-mem cache to update the last access time.
- Whenever the in-mem cache evicts the file, delete it from the disk.
Our requirement for the in-mem cache was clear:
- LRU with cost eviction. Not all files are of the same size. It’s sometimes better to evict one 10MB file than five 2MB files, if the probability to fetch that file is lower than the probability of fetching any one of those five files.
- Allow running a hook when eviction occurs – so that we can delete the file from the local storage.
Dgraph’s Ristretto was a great fit for us due to its cost-based eviction, pluggable hooks, great throughput, and concurrency control. The downside is that in some cases it was too smart for us – it can silently reject new entries from entering the cache. We can live with rejected files that are immediately deleted from the local storage. However, not knowing that it happened means we lost the chance to delete the file. This could eventually lead to lakeFS exhausting disk-space! Ristretto only recently introduced the “onReject” hook (not officially released yet) that together with the “onEvict” hook gives us everything we need.
Challenge #2 – Storing metadata’s metadata?
We didn’t want to keep track of additional information about the local stored files, other than their location on disk (which is the key in Ristretto cache). We also wanted a mapping between their local path and the object store path. Therefore, when lakeFS restarts it knows which files are available in the local cache. There’s also the added benefit of observability, when the local paths match with the object store paths. Therefore we mapped an object-store path to the local disk path, e.g. <bucket-prefix>/foo/bar/file1.json to <root-cache-dir>/foo/bar/file1.json.
Trivial as it may sound, matching directories need to be created when a file is stored, and deleted when it is removed. That almost immediately introduces a concurrency problem. Continuing the previous example, consider file foo/bar/file2.json being written while foo/bar/file1.json is being deleted.
Can you spot the concurrency problem?
file2 writer gets a green light since directories exist, meanwhile file1 deleter sees an empty directory (foo/bar) and starts deleting it too. file2 write now fails since the path foo/bar doesn’t exist.
To solve this problem, we ended up locking on the directory on which actions (store/deletes) are being performed.
Challenge #3 – Open file-handlers
On POSIX, you’re allowed to delete a file that has open file-handlers. It will be removed from the disk only when all open file-handlers are closed.
Sometimes Ristretto cache evicts an entry that points to an open file and we’ll delete the matching file from the disk. We don’t keep open files handlers for too long, but it means that there are short periods where the limit the user set on the disk usage isn’t strict, as we documented it.
There were some other interesting parts that we’ll briefly mention:
- Abstracting the filesystem for all file operations: We didn’t want the logic of reading files from the object-store and storing them locally to leak everywhere for each reader or writer. We introduced TierFS which exposes the same golang FS API, only it stores files in 2 tiers: object-storage and local disk.
- Testing: As always with concurrency, testing all possible races is a hard thing to achieve. When you combine it with a third party cache that controls cache eviction and rejection, it becomes even harder.
A task that started as a standard file storage that leverages an existing cache package, turned out to be much more complicated. Caches are tricky creatures, even under the assumption of immutability, it is still hard to get right. It is important to use the right mapping between the object store and local disk, and between the local-disk and in-memory representation of the available files. During eviction, you should also consider the subtleties of the item being evicted.