_{Originally presented at Big Data LDN 2024.}

More than two decades ago, data warehouses outgrew the capacity of single machines, and scaling them started to become costly or inefficient. This prompted the tech industry to rethink the architecture and start to use distributed systems.

If we wanted to store more data, we just bought more commodity hardware. But what about when we tried to query that data? Well, that didn’t scale so well.

Fast forward 20 years, and we got used to the concept of data lakes that are structured somewhat like this diagram:

At the bottom layer, we have storage, which has turned from self-hosted commodity hardware into object storage. Right on top of it, we have the open table formats that brought performance SQL to the world of big data.

The third layer is the metastore or catalogs that store those structured formats, whether the legacy Hive Metastore or newer catalogs such as Unity Catalog, Glue, or Snowflake. At the top, we have the distributed compute engines that you’re all familiar with, which allow us to query such a huge amount of data.

How did we get from files on disk to petabyte-scale data lakes?

Well, this has been a gradual evolution. In the early 2000s, Google published the famous MapReduce paper, which revolutionized big data processing, bringing distributed computing into the industry’s awareness.

However, the outputs of those MapReduce jobs proved inefficient for large-scale analytics, so we had to shift to a more structured format.

First, we tried to use CSV and TSV, which existed long before. However, they also proved inefficient for more complex queries or aggregations, leading to the emergence of columnar file formats such as Parquet or ORC, which were designed to solve these scale problems.

However, even with those columnar file formats, it was still very difficult to select specific data due to the lack of indexing in big data systems.

Then partitioning was introduced, splitting files into directories according to logical segments. The problem was that we could query large-scale data using this, but it was very complex. You had to write cumbersome MapReduce jobs to execute a very simple query, but all we wanted to do was use SQL.

So, Hive Metastore was introduced to solve this exact problem. Hive Metastore and Hive-style tables lasted for a long time – some would say too long. While they created a warehouse-like user experience for writing SQL and getting results, quite a few challenges remained unsolved in big data processing.

Problems in big data processing Hive Metastore failed to solve

Transactional consistency

Although the Hive Metastore and Hive-style tables brought SQL to big data processing, they didn’t provide full transactional support, making it very difficult to ensure that changes are ACID-compliant in big data environments.

Schema change management

The next problem was managing schema changes without breaking existing workflows or having to rewrite huge amounts of data.

Performance and scalability

The next challenge was performance and scalability. Although these solutions were operational in large-scale environments, they didn’t provide good enough optimizations to deliver satisfying read and write performance on the systems.

Reproducibility

It was also almost impossible for teams to reproduce a dataset at a specific point in time, which is crucial for efficient debugging and achieving compliance.

Error recovery

When something went wrong, it was extremely difficult to heal from it instantly. Teams struggled to understand what caused an issue, and even when they identified the origin, there was simply no big red button they could press to bring the system back to a healthy state.

Write-Audit-Publish

Lastly, there was no easy or standard way to implement the Write-Audit-Publish pattern. This pattern separates data-producing processes from data-publishing processes, adding an auditing phase between those two and preventing bad data from reaching production.

Open table formats: a potential solution?

Solving these issues was no walk in the park. Let’s examine the various solutions that teams have been testing since.

While Hive Metastore and Hive-style tables solved key challenges in big data processing by bringing SQL support, the need to solve the remaining challenges became increasingly clear at this point.

Next came the open table formats such as Delta Lake, Iceberg, or Apache Hudi. These open table formats were to Hive-style tables what Parquet was to CSV. They really shifted how we treated and managed big data systems.

Why were open table formats game changers?

First, they brought to life powerful features designed to operate at scale.

Open table formats introduced ACID transactions, making it possible to ensure the consistency and reliability of large-scale systems. If we wanted to look at last week’s or last year’s tables, time traveling became easy, allowing us to query historical versions of the data.

Open table formats also enabled schema evolution on large-scale systems without having to rewrite huge amounts of data or break existing workflows. This resulted from introducing features like adding or renaming columns or schema enforcement.

Finally, these open table formats were designed to perform at scale, employing techniques like data pruning, caching, better partitioning, advanced query planning, and optimized file formats to speed up analytical queries.

Did open table formats solve all the issues?

Solving these problems is quite an ambitious task, so let’s break it down and see how open table formats address these issues:

Problem	Solved?
Transactional consistency	Yes, even pretty well!
Schema evolution challenges	Yes
Performance and scalability	Yes, they were literally designed for this

But here is where things get a little more interesting. Open table formats don’t fully solve the next three problems, and I want to explain why.

Let’s break it down and look at them one by one.

Reproducibility, Error Recovery, Write-Audit-Publish and Open Table Formats

Using the time travel capability of open table formats, we could query historical versions of a table, enabling reproducibility at the table level.

With open table formats, we could restore a table to a healthy state, but again, it was only a table that we could look at and restore to a previous state. We could shallow clone a table and check things on our table, but open table formats work only at the table level.

Table-level is just not effective enough. Why? Because, in reality, we don’t manage single tables; we work across environments. When working with data lakes, thinking in tables is like thinking of one piece of a larger puzzle.

Just think of the data you usually work with. Perhaps you are thinking about your bronze, silver, or gold environments, or on a critical data project. You’re not likely to think of specific tables. No one is waking up in the middle of the night thinking about “table XYZ”.

Because here’s the deal: tables are just building blocks.

When we try to reproduce data or recover from errors, we do that at the environment level. What happens when we rely on table-level-only solutions to solve the remaining problems? Spoiler alert: it’s not going to be very pretty.

What is reproducibility, and why are open table formats not enough?

Let’s discuss reproducibility and why it’s an important concept in our daily work with data. As long as the process is deterministic, reproducibility ensures that it replicates the same result every time we run it on the same input data.

Consider our data workflows. If we have a data workflow that reads data, processes it, and then writes the output, we must ensure that we’re working from the same baseline to tweak it for some improvement. Otherwise, if our results change, we can’t tell if it’s due to our input data changing or the code modifications we made.

Reproducibility is especially important when the data we deliver impacts customers. If something goes wrong, we want to be able to figure out what happened and then fix it as soon as possible.

Now, let’s take an example:

Imagine you’re a data engineer working at Spotify on a team building the Discover Weekly playlist. This playlist is generated based on user listening habits, local trends, etc.

Let’s try to build something reproducible.

Disclaimer: All the examples use Delta Lake syntax but apply to other table formats.

Here I’m looking at the liked songs table. If I run this SQL query today, I get certain results. But what if I run it tomorrow or a few months later? The results may change, and why is that? Because the user may have liked some additional songs, or something in the table was updated or changed.

Achieving reproducibility with time travel

This is where Delta Lake time travel comes in.

Delta Lake time travel allows us to query specific historical versions of the data, assuring that if I run the exact same SQL query today or at any other time, I will get the same results: the same code, same input, same result.

Achieving reproducibility with open tables

But things get messy when dealing with more than a single table.

It becomes really difficult to select the correct version of each one of those tables. It’s an endless process. It’s doable, but you have to do a lot of bookkeeping and ensure your input aligns with the exact same point in time, and it gets really confusing quickly.

Error discovery and open table formats

Now, let’s discuss error recovery. Your team members may start sending you messages expressing their annoyance with the results of the latest release you published.

As a good data engineer, you start debugging and troubleshooting to understand what went wrong. If only the liked songs table were corrupted, then we could use Delta Lake’s restore functionality to bring it back to a healthy state.

But how would you go about rolling back hundreds of production tables?

Write-Audit-Publish and open table formats

The last challenge that open table formats don’t fully solve is the ability to easily implement the Write-Audit-Publish pattern.

So, let’s explore the Write-Audit-Publish pattern. It’s a pretty straightforward concept: first, I write data, then I run automated checks on that data. If the tests pass, I publish the data for consumption. Essentially, it’s like a CI/CD pipeline for data.

Now, why does this matter? Well, in reality, things are much more complex.

Typically, there isn’t just a single producer and a single data consumer. Instead, we have a much more intricate dependency graph within our systems. One process might write data to a bronze table, while other processes aggregate it into silver tables, with everything consumed further downstream.

When something breaks earlier in the process, errors can cascade.

The worst-case scenario is that the data appears to be fine, but it isn’t. In such cases, your data consumers start propagating these data errors downstream, which affects the entire system.

This is why it’s crucial to implement the Write-Audit-Publish pattern into our daily data workflows. We want to prevent issues from cascading throughout the system.

Now, let’s try to implement this pattern using Delta Lake

Let’s take the example of the recommendations table, which stores personalized recommendations for Spotify’s Discover Weekly feature. If we want to ensure that the data we write to this table is correct, we can apply the Write-Audit-Publish pattern to catch errors early, preventing them from affecting downstream systems and data consumers.

If I want to work on changes to that table, I can use Delta Lake’s shallow clone functionality to create a shallow clone of the table. Now, I can work on a copy of my table in isolation.

By the way, shallow cloning means that I’m cloning the table without duplicating the data, so it’s efficient and very useful.

I’m working on that table, applying my transformations, and once I’m done, I run some SQL validations to ensure that my recommendations are accurate and align with what the user wants to see.

But there are two challenges here:

The first is similar to what we’ve discussed with reproducibility and error recovery: we’re still looking at a single table. Imagine doing the same process for multiple tables, maybe even hundreds of them. Even if it’s just ten tables, the complexity increases significantly.

But there’s another challenge here.

Open table formats don’t provide a smooth way to publish changes after auditing them

So, after I run those SQL validations and I’m confident that my change is good, I need to decide. Do I switch to the cloned table or rerun the transformation on the original table?

There are important decisions to be made, and what’s missing is a smooth process that allows me to publish data without losing the ability to audit across multiple tables.

Let’s recap the problems we argue open table formats don’t fully solve: reproducibility, error recovery, and Write-Audit-Publish. Now, taking you back through the evolution process we’ve been observing, the next solution that emerged to solve these problems was data version control systems.

Enter data version control

Data version control systems were developed with the understanding that these issues had already been solved in software engineering.

If I want to reproduce something in software engineering, I inspect my Git commit log and check out a specific commit. If I want to recover from an error, I simply revert. The Write-Audit-Publish pattern is essentially CI/CD for data.

So, just like Git for software engineering, Data Version Control Systems bring engineering best practices to data management.

lakeFS is the data version control system that solves these challenges. lakeFS fits right into your data workflow, ensuring that everything from ingestion to analytics can be versioned, audited, and rolled back if needed.

How lakeFS solves these problems

lakeFS and reproducibility

With lakeFS, instead of wasting time tracking down which tables to look at to reproduce a specific state, you can simply refer to a lakeFS tag. These tags have meaningful names and represent key events or checkpoints in your workflow.

Referring to those tags automatically retrieves the correct version of each table, simplifying the process and making it more reliable.

This is how it looks in lakeFS: tables are essentially collections of objects.

Achieving reproducibility on open table formats with lakeFS

Now, moving to error recovery, if we encounter a problem, we must find the commit that caused the issue and reverse it.

lakeFS and error recovery

If we encounter a problem, our first step is identifying the commit that caused the issue and then reversing it. With the lakeFS Python SDK, this becomes straightforward—essentially, we have a “big red button” we can press to restore data across multiple tables, not just one. This means we can revert changes at the data lake scale.

lakeFS and Write-Audit-Publish

The first step is to create a lakeFS branch, which is a metadata-only operation that doesn’t duplicate any data.

After that, I’ll define a lakeFS hook that adds some validations to my recommendations before the merge occurs.

Once I finish making changes, I write the output of my transformation to the isolated branch I created. When I’m ready to merge, the pre-merge hook runs. If it passes, the merge will be successful; if not, it will fail.

Versioned catalogs are the next thing?

Now that we’ve established a solid foundation and addressed key challenges, and saw how data version control systems like lakeFS work across both structured and unstructured datasets.

I want to share an exciting development: we can now interact more effectively with tables by treating them through a catalog rather than relying on object storage paths.

With versioned catalogs, we’re shifting the data version control systems perspective from managing datasets at the object level to a more intuitive table-level approach. This shift allows us to leverage the guarantees that data version control systems provide while working with our tables more naturally.

The versioned catalog enables Git-like functionality on top of your catalog, allowing seamless versioning across tables.

Introducing lakeFS for Databricks

If you’re curious about products that support versioned catalogs, let me introduce you to lakeFS for Databricks. This solution is designed specifically for Unity Catalog and allows for versioning Delta Lake tables stored within it.

Here’s how it works: you continue to work with your tables directly through the Unity Catalog while using the lakeFS for Databricks CLI tool to perform versioning operations such as creating branches or commits. This means LS operates outside the data paths, monitoring the changes made to your catalog.

So why should you use it?

It allows you to maintain the same guarantees provided by data version control systems while continuing to work with your tables in a familiar way.

Dataset Versioning in the Age of Open Table Formats

How did we get from files on disk to petabyte-scale data lakes?