Amit Kesarwani, Vino SD, Iddo Avneri
October 26, 2022

You’re bound to ask yourself this question at some point: Do I need to test the Spark ETLs I’m developing?

The answer is yes; you certainly should – and not just with unit testing but also integration, performance, load, and regression testing. Naturally, the scale and complexity  of your data set matters a lot, so scalability testing is important too.

Suppose you have a table that has one user and one action that the user did. Why don’t you copy that a million times and test that? 

This approach might not solve the problem because you need to test with the real volume of data – not just the number of users, but the volume and complexity and variety of your data set. You also need to test the environment – all the different tools and software versions should be the same as the production environment. 

Here’s some good news: we have an environment that’s exactly like production – it’s complex, big, has all the variety, all the configurations. It’s called “Production” :). But obviously – we won’t develop and test directly against production and risk corrupting the data.

Keep reading to find out how to test your ETL pipelines directly as if you were testing against production data without ever making a copy of it. 

Table of Contents

Developing ETL pipelines – what it looks like today

Developing and testing ETL pipelines is very complicated. Since you typically don’t want to test against production data, you need to create your own personal space, like a bucket, and

Then either sample or copy data out of production. Both approaches have their advantages and disadvantages. Taking a sample means that you’re not likely to get the full-scale complexity, and copying means you’ll have to wait a long time (and it’s going to be expensive).

Next, you typically point the ETL that you’re developing to that personal bucket. You might need to configure this to get the upstream from somewhere and then put the downstream somewhere else. There’s a lot of configuration fun here.

Finally, you run the ETL and constantly go back to production and compare results to check if it all makes sense. When you’re done, you either “promote” the data to production (which typically means manually copying it). Or, you point your code to production and execute the ETL.

The next step? Hoping that everything works fine.

This process is very time-consuming, labor-intensive, and error-prone since everything here can break – the chance of failure rises with each step. 

By the time you’re done, so much time has passed that you might need to go back to the very beginning to resample or recopy the data because the data in production has already changed, and your ETL testing is no longer representative

Source: lakeFS

Sounds familiar? 

At lakeFS, we asked: why don’t we test data the same way as we test code? 

Source: lakeFS

How lakeFS makes testing ETLs simpler

lakeFS sits on top of the object store and provides Git-like capabilities (merge, branch, revert, commit) via an API, a CLI or a GUI. So, the ecosystem of tools that we have can either access the object store as usual or through lakeFS to get versioning capabilities.

The only difference is that if before we accessed a collection in a bucket – AWS S3, Azure Blob, or any object storage that supports the S3 protocol – now we include a name of a branch or a commit identifier. For example, main, prod, gold, or whatever you call your production data branch.

How does lakeFS do that without copying data?

Let’s talk a little about how lakeFS works under the hood and manages metadata. Every commit in lakeFS is a collection of pointers to objects, on your  lakeFS managed bucket (data stays in place on your object store). Since the object store is immutable, we can copy on write. When a new object is created, the next commit will point to that new object. 

Source: lakeFS

However, files that do not change between commits, will have multiple commits pointing to the same physical object. This is actually very useful for the use case of developing in isolation against production data. What this means is that if I want to create an identical environment to production and create a branch, you can branch out in milliseconds, since we are not actually copying the data.  

Many lakeFS users have massive lakes with petabytes of data. Still, they can branch out in milliseconds because it’s a metadata only operation. Meaning, that you can create 20, 50, or 100 testing environments without copying any physical objects in your environment. That way, you also get to save a lot on your storage.

When you develop ETL pipelines with lakeFS, you basically develop against your branch, and – if you’re happy with the result – you can merge back in. It’s that easy.

Advantages of using lakeFS 

Reduction of storage costs by 20-80% 

When you use lakeFS to test your environment and operate in your data lake, you save significantly on storage costs. Every developer can have an environment identical to production without ever copying an object. 

Furthermore, by nature of data lakes, most files are static and a subset changes on a regular basis. lakeFS provides a deduplication for the entire lake over time.

Increased engineering productivity

Applying engineering best practices, means that engineers are more effective and less frustrated. They can just get an environment on demand, work on it, and then throw it out or use it to merge the environment, in seconds

99% faster recovery from production outage

If something bad happens in production, by the nature of continuously committing the data, you can always go back to the last known good state of your entire lake in milliseconds. 

You can get all these benefits through the lakeFS open-source solution hosted locally or lakeFS Cloud, which is our managed hosted solution where we provide a service level assurance, security benefits, and other advantages.

Running a Spark ETL pipeline in lakeFS: Demo

The prerequisite for this demo is to be able to run a docker container locally. In this demo, we will utilize  lakeFS playground, to spin up in a single-click ab on-demand lakeFS server. 

If you already have lakeFS running – whether it’s an open-source environment or the Cloud version – you can use this for the demo as well.

Running the lakeFS playground

To spin up a playground, go to https://demo.lakefs.io and insert your email. You will receive an email with all the playground details. 

Once you spin up the playground, by default, we create one repository using our S3 storage bucket. If you’d like to use your own S3 bucket with your own data, you can sign up for the free trial or run lakeFS locally in a single command.

The playground is ready, and you can test your ETL pipeline! 

Cloning the Git repository

The next step is cloning the Git repository. To do that, you need to go to the lakeFS Git repo and pick the lakeFS samples repo. You can find multiple samples there, and now we’re going to use the Spark Python demo sample. This repo includes notebooks and all

the packages required to run the demo. You can run this on your local machine as long as you have a Docker container running there.

To clone the repo:

lakeFS-samples/03-apache-spark-python-demo

To run the container:

docker build -t lakefs-spark-python-demo .
docker run -d -p 8888:8888 -p 4040:4040 -p 8080:8080 --user root -e 
GRANT_SUDO=yes -v $PWD:/home/jovyan -v 
$PWD/jupyter_notebook_config.py:/home/jovyan/.jupyter/jupyter_notebook_config.py
--name lakefs-spark-python-demo lakefs-spark-python-demo

Running the demo

Once we run the container, go to the JupyterLab UI – http://127.0.0.1:8888/ (If you’re running this on a server or some other VM, you need to change this IP to go there).

We have multiple notebooks here, but will only focus on the Spark Demo notebook. The notebook will walk you step-by-step on all setups needed. 

Keep in mind, when using the playground, you can skip the creation of the repo and use the preconfigured “my-repo” on its existing storage namespace (every lakeFS repository requires a unique namespace) 

Experimenting in isolation

Going through the step-by-step notebook, it will walk you through creating an experiment branch via a zero-clone copy operation. Modifying data on that branch (and that branch only). And eventually deciding if you want to merge these changes back into prod, or simply delete our experiment. 


Wrap up

Using lakeFS, you can quickly create an entire copy of the environment and develop against it. This just makes life so much easier!

Give this a try and spin your own playground environment at https://demo.lakefs.io/.

Also, we have a phenomenal community where you can learn new things and meet like-minded people – join it at https://lakefs.io/slack.

LakeFS

  • Get Started
    Get Started
  • Git for Data - What, How and Why Now?

    Read the article
    +