Itai David
September 20, 2022

This blog discusses advanced topics within lakeFS. If you are new to lakeFS, or would like to expand your knowledge of how lakeFS works, make sure to check out our documents section.

In the Beginning There Was Postgres

Up until recently, lakeFS was using a strongly consistent SQL DB, namely PostgreSQL, where all metadata was organized in tables, much as one can expect. While PostgreSQL has many advantages, it also falls short in some aspects. Like other SQL databases, it is designed to be scaled vertically, and as space and processing requirements grow, that might become a limitation. This scaling process and other maintenance tasks, may become the user responsibility, requiring knowledge and effort. Even if you run your PostgreSQL on a managed system, such as Amazon RDS, scaling it usually requires some downtime.

For a long time, lakeFS architecture was bound to PostgreSQL. While definitely operable, it forced many of us, using lakeFS to use PostgreSQL. As we are working on managed lakeFS solution, and aspire to bring the maximal value and flexibility to our users, these limitations became more notable.

After many conversations, we  decided to loosen this tight coupling to PostgreSQL and let the users choose their favorite DB. This can be any database that fits your current architecture, performance needs, your scaling plans or your security requirements. You can still use PostgreSQL.

lakeFS v0.80.0 released recently, introduces lakeFS with a Key Value Store. lakeFS abandoned the SQL way of work, and moved to the more flexible KV Store. Along Came Key Value (KV)  Store.

Along Came Key Value (KV)  Store

When deciding to switch from SQL DB to KV Store, while decoupling from PostgreSQL, we had our users in mind. We tried to give you the highest level of flexibility with your DB selection. Our KV is designed as a very generic KV Store, supporting a minimal set of Get/Set/Scan/Delete and a very required SetIF operation.

somewhat pseudo code based on store.go

type Store interface {
	Get(partitionKey, key []byte) (*ValueWithPredicate, error)
	Set(partitionKey, key, value []byte) error
	SetIf(partitionKey, key, value []byte, valuePredicate Predicate) error
	Delete(partitionKey, key []byte) error
	Scan(partitionKey, start []byte) (EntriesIterator, error)
	Close()
}

Backing this generic store, there is a database, accessible via a specific driver. This driver implements the actual store operations over the specific database, and it is a key component in the flexibility aspect. This driver implements the basic KV Store operations. Out of the box, lakeFS can run its KV Store over PostgreSQL database (yes, the same PostgreSQL, used differently) and DynamoDB. It also supports a local, non-persistent, in-memory database, for testing and experimentation purposes, and we are about to release an additional driver, to support Badgerdb, a local fast DB. We will definitely provide other drivers, to support other databases, in the future, but you can add support to your own database of choice by implementing a driver following this interface. For example, You can take a look at our currently implemented dynamodb driver and postgres driver, and follow along to better understand the functionality. When you do implement a new driver, please submit a pull request, so we can include your code with our next release of lakeFS. All contributions are most welcome 😀.

On top of the generic KV Store, we implemented various layers that make specific use cases easier to implement. These layers support the different usages by the different services, but at the end, all layers rely on the basic KV interface, and so, remain unaffected by any changes to any modification concerning the underlying database.

This simplistic architecture, perfectly aligns with lakeFS` services architecture.

lakeFS on DynamoDB

Let’s discuss scalability. Even though the requirements, posed by lakeFS metadata usage, are a mere fraction of those of the user-data, it still requires database resources. Depending on the amount of user-data and usage patterns, the metadata requirements may reach a point where one may need to scale the database. As explained earlier, with PostgreSQL, this may not be trivial. Running our KV Store over PostgreSQL, though absolutely a workable option, might not be the best solution for everyone.

As aforementioned, one of the key motivations to change lakeFS’ database approach, is the ability to use different backing databases, according to one’s needs and preferences. To back that approach, we released a DynamoDB driver, which enables you to run lakeFS with KV Store over DynamoDB, out of the box. All you have to do is to make sure you have an accessible DynamoDB instance, running LakeFS server and configure lakeFS accordingly. lakeFS deployment guide contains some examples of such configurations.

DynamoDB is a designated NoSQL DB, designed for scale and low latency. It is a highly partitionable NoSQL database, that can scale horizontally, with no downtime. The fact that it is a fully managed solution, takes the maintenance burden off the user.

While our KV Store PostgreSQL driver forces a table to behave like a KV Store, using one table column as a partition key and another as a key, our DynamoDB driver elevates the database partition key (also known as `hash attribute`) and key (`range attribute`) to store and access elements. DynamoDB optimizes its storage by keeping items in the same partition physically close together, sorted by their `range attribute` – key – and so, it can provide greater efficiency with KV Store operations – especially `Scan` operations.

lakeFS, as a data repository, is required to provide strong consistency: applications should receive the latest version of data, introduced by the data sources, and consumers should have access to the most recent version of processed data. Once data is committed, you should be able to get this snapshot, or any other, as one piece. It is the single source of truth for your data. To provide that strong consistency, lakeFS uses its database to hold the metadata for any requested data state, and so, all database reads must also be strongly consistent. DynamoDB, on the other hand, is an eventual consistent DB, in order to reduce latency and reduce the chance of failures due to temporary network outages. However, DynamoDB supports Strong Consistency, by a designated flag – `ConsistentRead` – with its `GetItem` and `Query` APIs. This makes all lakeFS’ read operations, from DynamoDB, reflect the most recent updates.

You can read more about the challenges we faced, and the solutions we implemented, as part of lakeFS’ transition to KV Store over DynamoDB in our KV in a Nutshell document.

Summary

This post introduces a major new change in the way lakeFS manages its metadata – the implementation of a KV Store which replaces our SQL DB – and the resulting decoupling from PostgreSQL, much to our users convenience.
Our DynamoDB support, as an underlying database, introduces greater scalability, and much easier maintenance as a managed database. As DynamoDB supports hash partitioning, it will provide better data distribution, and as a result, better performance as KV store, then SQL databases, as PostgreSQL. 

Feel free to read more in our lakeFS on KV Design, browse our code on GitHub and join our Slack group.

LakeFS

  • Get Started
    Get Started
  • Join our live webinar on October 12th:

    Troubleshoot and Reproduce Data with Apache Airflow
    +