The growth in amounts of data was the catalyst for replacing traditional analytics databases with data lakes. While data lakes were able to handle large amounts of data, they did not provide us with all the capabilities of an analytics database… But we did not succumb to this tradeoff, and a set of technologies emerged to close that gap while keeping all the best parts of data lakes.

A data lake uses object storage to store data. Contrary to an analytics database where we need a schema to write the data to, a data lake lets us throw files into object storage and then use technologies to read and analyze data. But the schema the database maintains in its catalog, which is essentially the database management layer, is now missing in our data lake.

In other words, we moved from a database to a much more scalable but less manageable data lake environment.

What does it mean for us as data practitioners?

Data lakes: gains and losses

The advantages of data lakes include:

Great write performance – This is without a doubt the greatest advantage of data lakes.
No schema required – We can throw stuff into the storage and, since it’s highly paralleled, we can write a lot of data in parallel without thinking about what’s going to happen when we have to read it.
Suitable for all data types – We can use any data format we see fit and don’t even have to be consistent with it since this problem moves to the read stage. This allows us a lot of freedom and high performance when we write structured or unstructured data.
Highly scalable – The scalability of data lakes matches the growing volume of data that teams need to deal with.
Highly durable – Availability issues are rare, even for cloud-managed object storage. S3’s durability is 11 9s, which is just amazing.

That said, data lakes also made us lose a lot if we previously used analytics databases, ones that provided OLAP.

We lost quite a few things that are extremely useful for managing data (especially structured data):

Schema enforcement and evolution – When we use a database, the schema is enforced and it can evolve safely.
Read performance (secondary indexes) – Object storages allow a primary index, and they’re very efficient if the data is organized in the object storage correctly. But they don’t have any secondary indexes that you can work with, which is the power of OLAP databases.
ACID guarantees and transactionality – Databases provide ACID guarantees (atomicity, consistency, isolation, durability), which are safeguards that make sure transactions end before others begin to solve concurrency problems.
Mutability: Append, Insert, Delete, Update – If we have a table in a database and would like to delete a row, we can use all those capabilities. Object storages are immutable, so if we want to change one row in a file, we need to rewrite the entire file and replace it.
Standard Interface (SQL) – Databases for structured data have a standard interface; it’s called SQL. Accessing data in object storage with SQL requires the adoption of additional technologies.
Granular access control – Object storages provide access control as well but at the file or partition level. Databases can provide much more granular access control.

So, how do we regain what we lost with the emergence of data lakes, all the while keeping their unique advantages?

We have data infrastructure components that allow us to do even more than what databases provide. Let’s take a closer look at open table formats, metastores, and data version control systems to see how they help.

Object storages store files, which can have formats such as Parquet, CSV, or whatever format you choose to use. But when you access the data directly in the storage, what you see is a list of files. We want to be able to get an abstraction of a table and treat the data as if stored in tables to gain some of what we lost in the data management.

This is where open table formats come in. Open table formats basically create an abstraction of a table in the world of files.

How do open table formats work?

Open table formats work by managing a layer of metadata over the data files. The data is saved in a parquet format, that is a columnar file format. In addition, open table formats use another type of file that holds metadata, describing the data files.

Let’s say we want to append additional data to a certain table. We would just save another file. If we wanted to delete something, we would save another file with what we wanted to delete.

And then there would be a layer of metadata files (their names change between the different open table formats but the high-level functionality remains the same). Those metadata files hold information about the changes over time that are saved in the data files. For example, if a file includes a few deletions, the metadata file would tell us that.

When we interpret that log file together with the data files, we can reconstruct the table, as it is at its current point in time… We can also control the schema this way. If we add a column, the log file would know that the operation happened, when it happened in one of the Delta files that were added. And when we read the data, we can interpret the log file to know that there was a column added.

In other words, we’re adding a layer of metadata that creates an abstraction of a table.

Still, we leave a lot of work to a compute engine because someone needs to read the metadata files, interpret them, and then take action on the data files to create the latest version of the table (which now exists due to all those changes that we have accumulated in the storage).

Most of the open table formats started with Apache Spark as the compute engine. With time, other compute engines evolved to support them. This means that they created the logic of running and interpreting those metadata files.

What do we gain back by using open table formats?

First of all, we gained a schema and the support for its enforcement and evolution. We know when something has changed and there is a lot of logic that you can add around enforcing that or evolving that automatically within the format.

We also gained the most important thing: mutability. We can now look at a table and behave as if we’re in a database. We can add, remove, delete, and append columns or rows, so we have the full control we used to have in a database.
In addition, we gained improved performance by utilizing secondary indexes, provided by open table formats. You can refer to a Delta Lake optimize command for example.

2. Metastores / Catalogs

We want to have open table formats accessible to everyone with the standard interface of SQL, so we can add another layer of abstraction that is called the metastore. Today it’s mostly called catalog but I don’t want to confuse this with regular data catalogs that might not be metastores. So, let’s stick to the term “metastore.”

How does a Metastore make open table formats accessible?

Hive Metastore was the first metastore to ever provide the world with an abstraction of a table within Hadoop. Today, metastores allow accessibility for open table formats.

As mentioned earlier, we need a compute engine to read the logic of the metadata files, interpret them, and provide us with a version of the table. This is what a metastore can do for us.

Metastores open the door to open table formats by representing the logic of a table into a database that provides us with an SQL interface. Now, we can have a representation of an Iceberg, Hudi, or Delta table within the metastore the way it’s explained within a database schema.

Databases also manage data in storage and have an interpretation layer of a schema – this would be our interpretation layer for the open table formats we have chosen to use.

Once we have a metastore, we gain a general interface of SQL. This means we can now work with Hudi, Delta, or Iceberg with an SQL interface through the metastore.

What did we gain back with Metastore?

One major gain is the abstraction of a table so it can be used with SQL.

Another advantage is better access control. The moment this is a database schema, we can provide better access control through the metastore.

Since it also sees all the tables that we decided to manage in the open table format, it can provide some of the ACID guarantees that we lost.

We can now implement transactionality so we can create a transaction that works on several tables and make sure that it finishes before anyone else has access to the data we’ve manipulated. This gives us the atomicity that we need and safeguards transactions that exist in databases.

3. Data Version Control Systems

Open table formats and their metastores create this almost-database on a huge scale. Yet it’s still not enough; there’s one critical component missing.

We need to be able to work with data the way we work with code. We need to experiment quickly and safely. All that should be done efficiently and without the need to copy data.

This is where a data version control system comes into play. It provides the additional capability allowing us to adopt engineering best practices in data.

How does data version control work?

Data version control helps to ensure:

Repository-level version control
Dev/Test/Staging environments for data pipelines
Revert/Rollback capabilities
Write-Audit-Publish for data pipelines
Full data reproducibility

Using lakeFS, we get Git-like operations over the data.

We can now define a repository of our data, all the tables that come into play in our analytics, a model that we would like to build, or any other abstraction of the data. We can put all those data sources together in one repository and from that point on, we can use Git-like operations to manage the data.

We can open a branch to experiment in isolation, commit our changes, and then go back and forth with that commit. The commit records changes for all the tables so we can move or travel in time based on the status of that repository, and not just a single table. We can then merge changes that happened in different places into one outcome, just as we do in code.

All these capabilities should remind you of the way you manage code and the freedom that you have to collaborate and make mistakes because you’re protected by a version control system.

In a data version control system, you can run a new version of the pipeline on a branch, and if any issues crop up, you can revert to the original state.

But that’s not all…

You can use hooks to test data at specific points. This is compatible with the way you work with code, and lakeFS offers an API for storage.

Data version control supports everything. The only change is that now you have to specify the version of the data that you’re accessing.

What do we gain with data version control?

The data version control layer is innovative in the sense that it doesn’t try to imitate something we had in the past. We’re not trying to gain something back but rather face the fact that the way we work with data has completely changed in the last 50 years and we can’t continue to manage data in the same way we did before.

In which area of a data practitioner’s job does data version control make the biggest impact? Take a look at the lakeFS documentation to explore the most common lakeFS use cases.

Wrap up

Data lakes offer many advantages over traditional databases, but they also come with limitations. Open table formats, metastores, and data version control tooling can help regain the advantages of databases while keeping all the good stuff about data lakes.

If you’d like to see how these technologies play in real-life, head over here for practical examples with Databricks, Iceberg, and AWS: Building A Management Layer For Your Data Lake: 3 Practical Examples With Databricks, Iceberg, And AWS

Building A Management Layer For Your Data Lake: 3 Architecture Components