Iddo Avneri
December 28, 2022

What is ETL Testing?

ETL testing is the process of evaluating and verifying that the ETL (Extract, Transform, Load) processes work correctly. 

What is ETL?

An ETL process Extracts data of potentially many different structure or unstructured formats from multiple sources into a centralized repository. Then, an ETL process Transforms the data to a format that will be easier to consume for specific business use cases. This transformation typically includes “cleaning” the data (i.e. data deduplication, standardizing of date and time formats, data cleansing, etc.) and restructuring of the data (i.e. joining different data sets to a single table, summarization, etc.). Finally, the transformed data is Loaded into a central repository (warehouse). 

Source: https://learn.microsoft.com/en-us/azure/architecture/data-guide/relational-data/etl

ETL Testing – What are the challenges?

While many challenges in ETL testing are similar to generic software testing challenges, some complexities are far more challenging when testing ETLs. 

In this blog, I’ll review some of those specific challenges in more detail, and then address, from a practical point of view, lakeFS’s approach to solving these challenges.

Scale: ETL Testing against Big Data

The first challenge you will typically face testing ETLs derives from the nature of the data lake use case. Innately, we are writing or streaming mass amounts of data to a central location, therefore, the process we are testing, should be tested against mass amounts of data. Replicating production data is a time-consuming and expensive process. While object stores are cheap, they aren’t free.
Data lakes are often petabytes in size, and are rapidly growing. Copying files to separate buckets for a “production like” test may take hours. Furthermore, if, for example, my data lake consumes 100 TB of data on S3, creating a single copy of the data for a continuous testing environment will cost roughly $25,000 annually. Want to run multiple test environments in parallel? Keep multiplying. 

If you are reading these lines, you are probably well aware of this obvious statement: Your data usage is ONLY going to grow. Likely, faster than you expect. 

Reproducibility: Repeatedly Test ETLs Against the Same Data Set

Data constantly changes. However, the output of an ETL pipeline shouldn’t necessarily change when running on the same data set input. Meaning, when testing an ETL, you would like to compare the result set with an expected result set from a previous run. However, since your production data changed, reading the new data makes it challenging to verify the new code works correctly. 

Accuracy: ETL Testing Against Production-like Data

While size matters (see above point about scale), the complexity and variety of the data matter as well. The variety of the data can dramatically impact the performance of an ETL. For example, if part of the ETL re-partitions data by a specific column, it could be an incredibly expensive operation, or not, depending on the column values. 

Isolation: Test Complex ETLs Without Risking Production Data

The “E” part of your ETL is theoretically easier to address. Potentially (though not recommended) you can test your ETLs by reading directly from your production object store, where that files are extracted. However, what happens when your transformation needs to delete data? For example, I want to test a new retention process that supports GDPR requirements. Are you willing to risk accidentally deleting production data?

Automation: Automatically Detect Defect While Testing ETLs

ETLs are long-running, multi-steps, expensive processes. You would like not only to detect defects (made much easier by repeatedly running tests on ETLs against the same data set) but also to detect them early. There needs to be a way to automatically compare results mid-way throughout the process multiple times, instead of waiting for the process to end and only then comparing the results.

ETL Testing – Practical Steps with lakeFS

What is lakeFS?

lakeFS is an open-source project that applies software engineering best practices to data engineering. Concepts such as Dev/Test environments and CI/CD are harder to implement in data engineering, since the data, and not just the code, should be managed. lakeFS provides version control over the data lake, and uses git-like semantics to create and access those versions, so every engineer feels at home with lakeFS in a few minutes.

Scalability

lakeFS’s data versioning takes advantage of metadata manipulation. Therefore, with lakeFS, setting up a new environment, identical to production will:

  1. Be done in milliseconds for data lakes of any scale.  
  2. Not consume additional storage.

Using lakeFS, each developer can Test ETLs by branching out of production as many times as they want, immediately, and with no additional costs.

Reproducibility

Once you’ve developed new versions of your ETL, those can easily be tested against historical commits. This way, you can compare the output of different versions of your ETL against the same input.

Accuracy

Since branches are done directly on production data, using lakeFS, ETL testing is done against production identical data including all the complexities that come with it. Rest assured that this is precisely what the production environment looks like. 

Isolation

Accidentally deleted all your data? No problem. Delete the branch and start your testing over with your new code against a fresh branch. 

Automation

Take advantage of lakeFS hooks: Run post-commit hooks, testing ETLs, for every step of the process. This approach not only helps quickly identifying issues in a specific step that is a part of a multi-step ETL. But in addition, is extremely helpful in root cause analysis. 

Get started: Test your First ETL with lakeFS

  1. Deploy a lakeFS Server

docker run --pull always -p 8000:8000 treeverse/lakefs run --local-settings

  1. You might want to configure the server to run against your cloud storage.
    For example, to connect to AWS:

docker run --pull always -p 8000:8000 \
  -e LAKEFS_BLOCKSTORE_TYPE='s3' \
  -e AWS_ACCESS_KEY_ID='YourAccessKeyValue' \
  -e AWS_SECRET_ACCESS_KEY='YourSecretKeyValue' \
  treeverse/lakefs run --local-settings

  1. Create a new repository (in this example, I’m using the lakectl command line interface, but this can be achieved from the UI as well):

lakectl repo create lakefs://some-repo-name s3://some-bucket-name

  1. Import your data into lakeFS (this will only create metadata and not copy any of the objects) – Once again, can be executed from the web user interface as well:

lakectl import --from s3://bucket/my-data/path/ --to lakefs://some-repo-name/main/path/ --merge

  1. Create a development branch out of your production data (or via UI):

lakectl branch create lakefs://some-repo-name/test-etl -s lakefs://some-repo-name/main

Watch how to achieve this in under 3 minutes

Bonus: Don’t only Test ETLs, Promote Data

In practicing CI/CD, testing is generally a step (or a couple of steps) before deployment. Imagine this: You developed a new ETL, tested it successfully. Now what?

Option 1 (probably what you do today): Run the ETL again in production instead of your testing environment.

Option 2 (with lakeFS): Since you’ve just tested the ETL on production data, you can promote the data from your testing environment (or perhaps, staging environment) to production atomically, by merging the data back into production. This enables you to achieve CI/CD for your data that is not limited to testing ETLs only, but also to deploying the results of those tests with certainty. 

Conclusion

Due to the nature of working on top of massive amounts of constantly changing data, ETL testing holds unique challenges above your “typical” software test case.

In this blog, I’ve demonstrated a practical way to apply engineering best practices to testing ETLs.

Want to learn more about lakeFS? https://lakefs.io 
Join the conversation on our slack channel: https://lakefs.io/slack

Git for Data – lakeFS

  • Get Started
    Get Started
  • LIVE: Develop Spark pipelines against production data on February 15 -

    Register Now
    +