Oz Katz
November 17, 2020

Modern Data Lakes are a complexity tar pit.

They involve many moving parts: distributed computation engines, running on virtualized servers connected by a software defined network, running on top of distributed object stores, orchestrated by a distributed stream processor or pipeline execution engine.

These moving parts fail. All the time. Handling these failures is not trivial, and testing the code that is supposed to handle them is even more complex. Imagine how hard it is to test the following:

  • What happens if I accidentally delete yesterday’s `events/` partition? or any other random date?
  • What jobs break if I introduce an incompatible schema change?
  • If I kill this job half-way, leaving some intermediate state, will it recover on retry?
  • What happens if a job runs twice, in parallel, by mistake?
  • Can a started job complete in the face of a DNS failure?
  • If input data is missing or empty, will this pipeline continue?

Think about how hard it is to answer those questions for your current environment.

chaos data engineering

Injecting failure into production

Application engineers have an interesting solution: they test for these errors in production, and do it regularly. This practice, popularized by Netflix in the early 2010s was given the name “Chaos Engineering”:

“Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions.”

The problem with data is that recovering from these failures in production could be extremely expensive. The possibility of a corruption or loss is very real, rendering whatever value we derive from the experiment a negative ROI.

Building a separate staging environment is possible, but it is either:

  1. Very expensive (in both storage and operations efforts) since it’s a full replica of our production environment. 
  2. Doesn’t represent the reality and complexity of our production system. Samples are not good enough as they will hardly constrain the same system resources. Thanks to Adaptive query execution and Cost Based Optimization, the actual behavior of our system will be different, giving us a false sense of security.

Data Development Environment with lakeFS

What if we could inject all the same failures, on top of real, full sized, production data – without affecting our actual production environment? 

lakeFS lets us do exactly that, even at a massive scale, without actually having to copy any data, with guaranteed, full isolation.

lakeFS is an open source platform that delivers resilience and manageability to object-storage based data lakes, by providing a Git-like branching and committing model that scales to Petabytes of data by utilizing S3 or GCS for storage.

Since lakeFS supports the AWS S3 API, it works seamlessly with all modern data frameworks such as Spark, Hive, AWS Athena, Presto, etc.

Creating chaos data branches

Once we’ve set up lakeFS, we can create “chaos branches” – complete snapshots of our current data lake, which are immediately available for us to abuse. Here’s an example of creating such a branch.

$ lakectl branch create \
      lakefs://my-lake@chaos-empty-events \
      --source lakefs://my-lake@main

As this is a metadata operation (similar to Git’s), it only takes a few milliseconds to do.

Once we have an isolated branch, we are free to operate on its data without affecting other users. Since lakeFS is API compatible with S3, we can even use the AWS CLI to do that.

For this example, we’ll delete an arbitrary partition from a production critical collection, to see how dependent jobs behave:

$ aws s3 --endpoint-url https://s3.lakefs.example.com \
      rm --recursive s3://my-lake/chaos-empty-events/collections/events/date=2020-11-04/

When using the S3 API, we prefix the path with the name of the branch we’d like to address.

This means different jobs and experiments can all happen in parallel, each in its own isolated branch.

Let’s run our Spark job to see how it handles the missing partition:

spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "https://s3.lakefs.example.com")

var branch = "chaos-empty-events"
spark.read.parquet(s"s3a://my-lake/${branch}/collections/events/...")

// remainder of our job

Once we’re done, we can simply discard the branch, or revert all changes made to it, atomically:

$ lakectl branch revert lakefs://my-lake@chaos-empty-events

This is especially useful since we might want to fix and reproduce the steps we’ve made.

Alternatively, we can commit these changes, documenting what happened. This will create an immutable snapshot of the lake as it existed during the experiment – great for testing fixes

$ lakectl commit \
    lakefs://my-lake@chaos-empty-events \
    -m 'chaos experiment #1: deleted a random partition' \
    --meta 'spark_version=2.8' \
    --meta 'spark_job_version=923be9f09c8' \
# output:
# Commit for branch "chaos-empty-events" done.
# 
# ID: ~901f7b21e1
# Timestamp: 2020-11-08 19:26:37 +0000 UTC
# Parents: ~a91f56a7e1

We can later run the same experiment by using the commit ID we received instead of the branch name:

spark.read.parquet(s"s3a://my-lake/~901f7b21e1/collections/events/...")

As long as our branch’s lifecycle rules allow, this is guaranteed to always return the same data.

Baking it into your process

lakeFS provides very strong primitives to make experimenting with real data easy.

The process described above could be a good starting point. Imagine a nightly job that creates a branch, injects a random failure somewhere and runs a production DAG on that branch.

This way we can continuously monitor the resilience of our system and improve it over time, without waiting for it to break in production.

If we do want to stay a little closer to the original definition of Chaos Engineering, we can of course inject these failures into our production branch. lakeFS ensures that even if things do go wrong, reverting back to a previous, known working state is immediate and atomic. This requires a little more thought about potential impact to consumers who are (for better or worse) no longer isolated from the faults we inject.

Learn More

We believe lakeFS is a great building block for building resilient, manageable data lakes.

It allows not only for easy experimentation, but makes introducing changes to production data safer, reproduction of bugs easier, and provides much faster recovery for when things do go wrong.

Learn more about how to branch and merge data, and of course, feel free to check out lakeFS on Github

LakeFS

  • Get Started
    Get Started