The lakehouse architecture has become the backbone of modern big data operations, but it comes with specific issues.

The challenge of data versioning arises in various DataOps areas, including:

Ability to Write-Audit-Publish (WAP) to test and verify changes before releases
Rolling back changes to a consistent and well-known state
Creating reproducible workloads that encapsulate multiple tables (and code!)
Creating cost-effective, ad hoc dev/test environments with no data copies

Fortunately, open-source tools can help overcome these issues.

In this article, we’ll demonstrate how by implementing Git-like semantics, Delta Lake and lakeFS can work together to improve time travel for lakehouses.

Delta Lake provides a linear history via table snapshots, while lakeFS offers branching and merging options, resulting in higher data quality and economics for your operations.

Let’s start with what a modern data lake looks like

Most of data lakes today consist of these layers:

Object storage
Table format
Metastore/Catalog
Distributed compute engine

How do you create a resilient data lake?

A data lake that is resilient can withstand failures. But what does that mean?

We broke it down into three different categories:

Reproducible process – we can reproduce the issue to check if our solution works
Write-Audit-Publish – when releasing new data to its consumers, we want to make sure that the data matches our standards
Ability to travel in time – we all wish we had a big red button we could press to undo a mistake

Let’s dive into each of these categories in detail.

1. Reproducible process

What is reproducibility?

Reproducibility means that for the same process and the same input, we can expect to get the same results (as long as the process is deterministic).

If we think about data processing, our inputs would be existing data sitting somewhere, whether it’s being streamed to us or we’re reading it from another table, data source, or an API.

We would have some process that reads it, transforms it, and eventually creates

an output. That output could be another dataset but also a dashboard, report, machine learning model, etc.

Why does reproducibility matter?

When we want to improve any of those outputs, we want to make sure that we’re always running on top of that same baseline

For example, if we want to improve an ETL that takes two tables and joins them, we need to ensure we keep feeding the same input as the one being tested. Getting different results could be due to the fact that our input changed. This is why we need to be able to isolate only what we’re changing to make sure that it has the desired effect.

In essence, we create a feedback loop, which is especially important when dealing with machine learning and AI.

Let’s say we want to build a machine learning model that, when given an image, will tell us whether something is a hot dog or not.

We want to be certain that the input fed into the machine learning algorithm would generate a given result. If that input changed from the moment we trained our model, capturing bugs like that will be very hard. Or improving the model in the first place.

How to achieve reproducibility

We’re going to use Delta Lake in this example but this is applicable to other data formats, such as Iceberg.

We have an image classification table and we’re reading out the labels from that table. If we pulled this query into our machine learning model, we’d get a result that’s great. If we run it tomorrow, we might get different results.

But we want to keep this reproducible.

With Delta and most such systems, we can always refer to either a snapshot or a point in time. Next, we ask for results from this specific instance in this way. If we run that same query tomorrow or next month, we’re guaranteed to get the same results as long as no one messes with the existing data and replaces it with something.

This can get a bit difficult to manage at some point because, typically, we’re not dealing with a single table.

If we open our Unity Catalog now, we will see 2300 tables there, which means that it’s pretty hard to maintain this approach. It takes a lot of bookkeeping to make sure that our inputs are all aligned to the same version, whether we’re using a timestamp or log entries.

Let’s be real: going back to that point in time requires quite a bit of tooling and kind of messy SQL.

Before we move on to showing you the solution, let’s jump to the next key category.

2. Write-Audit-Publish

What is the Write-Audit-Publish pattern?

The idea behind the Write-Audit-Publish pattern is pretty self-explanatory if you look at its name.

First, we make a transformation, and then we’re going to have some form of auditing process in place. This doesn’t have to be an actual person looking at the data; it could just as well be an automated test. Once these tests pass, we’re going to publish that output onto its consumers.

This pattern involves separating whoever is writing the data from those consuming the data.

Software engineers typically do this using Write-Audit-Publish. They can get actions or green checkboxes – and once these are done, they deploy to production. They typically don’t directly write code to production.

Why is Write-Audit-Publish important?

In many cases, we don’t have a single data producer and a single data consumer.

Here’s a graph of dependencies that illustrates the situation many of us are dealing with:

Just to give you an example: reading off a stream, we’re writing into our bronze table, where some other process might kick off and, say, do some aggregations. From there, it might sprawl as someone else joins a few tables to create another data set from that or enrich it.

And that’s how we end up with this pretty complex dependency graph illustrating how our data flows throughout an organization.

Let’s say we’re at the beginning and something fails. Everything depending on that data is either going to fail or be delayed. Either way, we don’t have new data to show or we have data that’s broken. And to put it mildly, that’s a “delightful outcome” scenario ????.

If we’re failing, at least our downstream consumers can see that.

But what happens if we succeed? We wrote bad data but our DAG is green. We didn’t notice that we’re putting on a bunch of nulls where there shouldn’t be nulls. Our downstream consumers are going to read that and now they have bad data.

It’s really hard to figure out what kind of damage we’re going to face. How can we prevent this?

How to implement Write-Audit-Publish

Let’s say we have our hot dog table. We can create something called a shallow clone, which is essentially a copy of the table but based on Delta’s metadata. This metadata refers back to the data objects as they exist in our original table.

So, we get something that feels and behaves as if it’s its own copy but we don’t actually have to copy all the terabytes or petabytes of data.

We create a shallow clone where we can carry out our transformations. In our case, we’re inserting more hot dogs, of course.

Next, we run our audit. In this case, we want to do a select to make sure that the bun we’re using is of the correct type.

The publish step is missing here. In the case of Delta Lake, we have two options. We can switch to both tables, saying that from now on we will use a shallow clone as our source of truth.

Alternatively, we can rerun the transformation again, this time on our original table. In this case, there’s no audit step anymore, meaning we can’t entirely trust it.

3. Ability to travel in time

This point is best illustrated with a story about someone who made a mistake in production. Spoiler alert: yes, it was me. ????

I worked for an analytics company. We had about seven or eight petabytes of data on S3 and there was this one job that everyone hated. Nobody wanted to touch it. The name of the job was retention. It would look for different scenarios where data is referred to by other things and, if there were no more references, it could actually delete the underlying files.

It would run daily for 12 hours.

There were a few edge cases where data was supposed to be deleted but wasn’t. It was all kind of a mess.

We decided to fix it, make it faster and delete the stuff that wasn’t deleted. Long story short, it did finish the job faster, so we went home that evening and suddenly we started getting paged with alerts. And more alerts. And more alerts. And you get the picture…

Our entire flow was suddenly red all over the place, even stuff that was unrelated to what we touched. We had no idea what was happening so we talked to the person who was on call.

We pulled out the graphs and he showed me the bytes stored on S3. It looked bad.

We suddenly saw a drop of about one petabyte of data: about one billion objects that were suddenly missing.

S3 has object-level versioning. Theoretically, we should be able to restore each and every

one of those 1 billion files that were missing. But just getting that list and understanding which data files actually got deleted by mistake took about two weeks.

In an analytics company, we need data. We all wished there was that big red button that would take us back in time, to five minutes before that ever happened. It would be amazing to just be able to get to the last known good state.

What is time travel?

We can travel in time with our existing architecture. Fortunately for Delta Lake, there is a restore operation. We can take a table, pull it back in time, and restore it back to version X or to timestamp X.

But what happens if we have multiple tables?

This probably wouldn’t take two weeks to recover but it would still require some manual effort to figure out the offsets that actually include those dels and then restore them one by one to get back to that previous state.

Software engineers have already figured out all three categories

In the world of software engineering, we have a big red button that rolls us back to a

previous version. We have reproducible processes, we have Git commits that we can send out, and we have tags and version numbers.

We have all these capabilities, so why not have something similar for our data?

Introducing lakeFS

lakeFS is an open-source data version control solution. It works like a layer between the storage itself and all the components that run on top of it.

It doesn’t matter if we have Delta tables, Iceberg, or a Spark cluster. lakeFS creates a

layer on top of this storage via a metadata layer to let you treat that biggest S3 bucket that you have with petabytes of data as if it were one giant Git repository.

We can create branches and commits, we can roll them back, and we can do everything that we can do for code, but for data.

Let’s see how lakeFS can give us all the key requirements to create resilient data lakes.

Achieving reproducibility with lakeFS

We can use a tag or a version number. In this case, we’re using a tag called Model v15 prod and all these tables are in the lakeFS repository sitting on top of our S3 bucket or our Azure Blob Storage.

When we create a commit, we create a snapshot of all these tables together at a given point in time on our branch. Once we assign a tag to that commit, that tag is fully immutable. If we query that same tag again a week from now, we’re guaranteed to get the same results.

How does this work in practice?

Imagine that we’d like to test a new version of a model and want to share it with a colleague. If the test is positive, we can tag it and suggest to our colleague that they use this Unity schema and see if the data looks good to them.

We can refer back to even cross tables and this could scale up to many tables and petabytes of data. And best of all: we do it using just one tag representing all this together.

lakeFS & Write-Audit-Publish

Remember our hot dogs over time example? Let’s say that we want to be able to run a transformation. In this case, we’re doing it on an isolated lakeFS branch.

We call our branch ETL and today’s date, branching out of our main branch just like inGit. Then we run the transformation, insert into our table, run our data quality test on our isolated branch and we’re good to go!

When you create a branch in lakeFS, it’s a copy on write, so we don’t have to copy all these petabytes to another location. It’s a metadata-only operation, essentially just creating new pointers to existing data. This takes milliseconds and it only takes up the storage of what we modified on that branch, meaning there’s no need to duplicate the storage.

Once we run all the quality gates and things are looking good, it’s time to merge our changes to the production branch. Just like in Git, what we typically would recommend is to set that main branch to be a protected branch. This means we don’t want anyone to directly write to it. We only want people to merge into it because we want to ensure that that pattern is maintained.

In an idea world, all our transformations look like this.

Remember that big DAG from before? What if we could branch out for each step, run modifications on multiple tables, commit them, run the tests and, if successful, merge?

Time travel with lakeFS

Remember, lakeFS is essentially “Git for data” so by default it also enables time travel.

Here’s a commit log for a repository and here we’re referring to a specific commit (this could be that person who deleted a bunch of data).

Using lakeFS, we can take that commit ID for that repository and do the inverse. Whatever files were deleted, we’re going to restore them, essentially giving us that big red button that we really wanted.

Using the power of both Delta Lake and Lake FS, we can achieve reproducibility, get all the quality gates in place for Write-Audit-Publish, and get the option to travel in time.

To see how lakeFS works, play around in the sandbox or if you’re looking for an enterprise-ready solution, book a demo with one of our specialists.

Bonus: For a visual talk-through of this article, watch my presentation from the 2024 Data+AI Summit.

Power Up Your Lakehouse with Git Semantics and Delta Lake

Let’s start with what a modern data lake looks like

How do you create a resilient data lake?