Paul Singman
May 19, 2021

Debugging an issue is never fun, but why make it harder?

In this post, we show how reproducing data is possible whether interacting with a single file, entire table, or data repository.


Introducing Data Reproducibility

There are two types of issues in the world — reproducible and unreproducible. 

A reproducible issue is one where the original conditions for an error can be recreated, allowing for the controlled manufacture of its occurrence. This is the state a seasoned engineer will strive to reach when debugging an error.

It is extremely hard to solve a problem you don’t yet understand. And reproducing it is the best way to establish the necessary foothold to start down the path of solving it.

A challenge when working in data-intensive environments is recreating the prior state of a dataset — what we call data reproducibility.

Think of the last time you troubleshot an error and needed to reproduce the data, what did the process look like?

There is no right answer, and yours will depend on which type of data asset your task interacted with. We identify three types:

  1. A Data Table
  2. A Data Repository 
  3. An Individual Data File or Message

Each of these data types has its own strategy for reproducing data. And to enable reproducibility in all situations, we need to understand how to handle all three.

In this article, we’ll walk through examples of reproducing data in each scenario.

In addition, we’ll touch on strategies that make it easier to achieve reproducibility for tables (hint: snapshots). And finally, since it is a newer concept designed intentionally to enable reproducibility, we’ll explain what a data repository is!

Let’s get started.

Reproducing Data Type #1: Data Tables

Whether defined in a lake or warehouse, tables can present a challenge with regards to data reproducibility. They are often quite large for starters. And it only takes a single bad row to cause an error in a job that interacts with a table.

When this happens, it is helpful to know the state of all rows in the table at the time of the failure. Sometimes the bad record is still present and reproducing the error is easy. 

Other times the data changes and the issue mysteriously vanishes. This may seem like a good thing, but it gets old quick to say “We don’t know why the issue happened. We just hope it doesn’t happen again.”

Spoiler alert: it usually does.

When discussing reproducibility on tables, it’s useful to split the discussion between fact tables and dimension tables.

Fact tables containing an append-only log of events (e.g. orders) allow for reproducibility in a straightforward way. By filtering on a field like created_at (which best practices say should always be included), we can easily recreate the state of a table at a given time. 

For example, a simple query like the one below does the trick:

select * from orders where meta_created_at <= '2021-05-19 12:53:01'

Dimension tables on the other hand, present more of a challenge. Specifically, those of the infamous slowly-changing type.

This is due to the common overwrite update strategy slowly-changing dimension (SCD) tables employ, meaning new changes overwrite the previous state of existing records. 

The old state is lost forever unless we do something else — and that something else is to snapshot them¹. It is a common pattern to snapshot dimension tables and there are a couple of ways to go about it.

One is to implement your own snapshotting script, something I have experience with. If you’re a dbt user, another is to leverage the dbt Snapshots feature. 

Without going into more detail (since I plan to make snapshots the subject of a future post) the key point is that by saving snapshots of a table, we allow for the reproducibility of it.

With snapshots we give ourselves a shortcut to understanding that User123’s email used to be blank, and that’s why our job failed.

Reproducing Data Type #2: Data Repositories

Since data repositories are a newer concept, we’ll start with a definition:

A data repository is a logical namespace used to group data objects and enable git-like functionality such as committing, branching, and reverting over them.

The key point is that repositories let you create commits over data², which are essentially a snapshot of an entire branch of a repository at a given point in time. 

Each commit gets assigned a unique commit_id that can easily be referenced to return the data to its state at the time of the commit. In this way, repositories easily allow for reproducibility of the data contained in them. 

While this might sound great, you’re probably wondering, how do I create a data repository?

The easiest way is to use lakeFS, an open-source project that enables the creation of repositories over data in an object store. Once a repository is created and data added, you are free to create commits galore!

Example from the lakeFS UI showing commits in a repository.

By using the pattern of lakefs://<repo-name>/<commit-id> we can access the data as it was at the time of the second commit for example, by specifying lakefs://my-repo/006be4acddf7f52b in our code.

For more information on how this works and when it makes sense to use repositories, check out the lakeFS documentation.

Reproducing Data Type #3: Individual Files or Messages

It is common in event-driven or streaming architectures to process a single input file or message. From a data reproducibility perspective, this is the simplest scenario to deal with.

Simply point the job to re-run over the same input file or message to reproduce the data.

Easy peasy!

Wrapping Up

With greater awareness of the different data types your tasks interact with, you can determine with confidence how best to reproduce data in the event of a failure.

If nothing else, understand that data reproducibility is possible for all types of data — from the smallest row to the largest petabyte-scale table. 

So the next time you’re struggling to reproduce data, you have no one to blame but yourself… and maybe your teammates.


If you enjoyed this article, check out our Github repo, say hi in our Slack group, and related posts:

 — Building A Data Development Environment with lakeFS
 — How to Manage Your Data the Way You Manage Your Code

Notes

¹Generally speaking, a snapshot is “the state of a system at a particular point in time”. In terms of data, a snapshot is a representation of the values of a table or dataset at a specific time. This can take the form of a full copy or utilize metadata-based approaches to achieve more efficient representations.

²And not just tabular data, but any type of data object.

³Cover photo by Marcus Dall Col on Unsplash.

LakeFS

  • Get Started
    Get Started