What’s not to love about data pipeline testing? Adding acceptance tests to your data pipelines makes them less likely to make mistakes and makes sure that enough quality checks are done on the data before it is sent to end users.
Testing data pipelines involves two components of any data pipeline: data and code used as a tool to manage the Extract, Transform, and Load (ETL) process. You’re bound to encounter more challenges around testing data than the code. But always keep in mind that it’s important to test both the code and the data when testing data pipelines.
How exactly does that work in the context of acceptance testing? How do you test ETLs? Keep reading to get all the answers and improve your testing process with industry best practices.
Table of contents
- Why is testing data pipelines important?
- What exactly is ETL testing?
- Code and data testing in data pipelines
- Challenges for testing data pipelines
- Solution? lakeFS makes testing data pipelines easy
Why is testing data pipelines important?
Data is the foundation of many crucial business choices. As more businesses move toward data-driven decisioning, data plays an increasingly significant role in modern organization operations. As a result, having high-quality data improves the quality and relevance of business decisions.
However, data is always changing. In contrast to code, which is often static and tidy, data is dynamic. Changes in operations, changes in the economy as a whole, and events like a global pandemic could have a big effect on the data.
In most circumstances, data needs to be cleaned before it can be used for analytics purposes. Data testing guarantees that major changes/drifting are noticed in real time, and that faulty data can be correctly filtered and discarded.
What exactly is ETL testing?
An ETL process takes data from many different sources, which could be in many different structured or unstructured formats, and stores it in one place. The data is then put into a format that makes it easier to use for some business purposes.
Cleaning the data is often a part of this transformation. This means removing duplicate data, standardizing date and time formats, cleaning the data, and doing other things. ETL also involves moving data around, like joining different data sets into a single table, summarizing, and so on. Eventually, the converted data is stored in a central location, like a warehouse.
ETL testing makes sure that data is moved from different sources to the central data warehouse while following transformation rules and passing all validity tests.
ETL testing differs from database testing used in data warehouse systems. It’s a key part of gathering useful data for business intelligence and analytics.
Code and data testing in data pipelines
Code testing in a data pipeline is more or less similar to testing any software product (though it comes with its own unique challenges, which we explore below). Unit tests, integration tests, and end-to-end system tests are all part of code testing. Code testing is typically performed as part of Continuous Integration (CI) pipelines.
Here, the goal is to make sure that the code quality is good and that the data ingestion, data transformation, and data loading functions work as planned.
What about data testing?
Setting expectations for important data pieces and making sure that the data meet these expectations before giving the final data to users is called “data testing.”
Engineers have to work on data testing all the time, and it becomes a lot more important as data pipelines are put into operational systems. Apart from evaluating the data quality, teams also need to monitor the testing results to ensure that any data quality breaches are corrected as soon as possible.
Let’s take a closer look at code and data testing for ETL pipelines to understand what exactly engineers deal with when launching a testing process.
Code testing in a data pipeline
There are a few characteristics of data pipelines that make code testing slightly more difficult.
Since data pipelines are inherently data-reliant, you need to create sample data for testing purposes. And we’re talking about huge amounts of sample data!
To run data pipelines, a lot of other systems are needed, such as processing systems like Spark and Databricks and data warehouses like Snowflake, Redshift, and Databricks SQL. As a result, you must come up with methods for testing them separately. What you want to achieve here is to separately test the functionality of the data processing logic from the links and interactions with these other systems.
To help you understand why testing code for data pipelines is different, let’s take a look at unit and integration testing in the context of data pipelines.
Unit testing in data pipelines
Unit tests are very low-level and close to an application’s source code. Each method and function used in your data pipelines needs to be tested for detecting mistakes without requiring a complex external environment. Mistakes in refactoring, syntax errors in interpreted languages, configuration mistakes, problems with the graph structure, and so on, are all examples of these kinds of problems.
To get the final results, you may need to use a number of data transformation functions. You can put these transformation functions through unit tests to make sure they produce the right data.
To run unit tests on functions that are “in” the data pipeline, you must also make sure that the test has data to work with. As a result, there are two typical methods for gathering enough data to test your pipeline code.
First, you create fake testing data based on the distribution and statistics of the real testing data. And then you’re ready to replicate a sample of real data in a development or staging environment for testing.
You need to ensure that there are no violations of data privacy, security, or compliance when transferring data to a less restrictive environment than a production environment.
Integration testing for data pipelines
This kind of testing makes sure that your data pipelines’ modules or services work well with each other. The key integration tests for data pipelines are interactions with data platforms like data warehouses, data lakes (primarily cloud storage locations), and data source apps. For example, OLTP database applications and SaaS applications such as Salesforce and Workday.
The three important processes of a data pipeline are Extract, Transform, and Load. At least two of them (extract and load), and sometimes all three, connect to the above-listed data platforms.
A data pipeline must talk to a messaging system, like Slack or Teams, in order to send out alerts or notifications when important things happen on your data pipelines. Because of this, you need to do integration tests with any other platforms and systems that your pipelines interact with frequently.
If your data pipeline was built with Python, you can use a Python testing framework like pytest to automate code testing. It lets you create a variety of software tests, such as unit tests, integration tests, end-to-end tests, and functional tests.
Consider using a data pipeline orchestration tool in addition to a testing framework to make code testing for your data pipelines easier.
Data testing in a data pipeline
Data testing is all about setting expectations for important pieces of data and making sure that the data going through your data pipelines meets those expectations before it gets to the end users of the data. If any of these data quality expectations are violated, appropriate communication or alerting and corrective action should be performed.
Unlike code testing, which is often done at the compilation or deployment stage, data testing is done constantly whenever fresh streams and batches of data are ingested and processed.
Data testing is a continuous series of acceptance tests in which you make assumptions and expectations about newly arriving data and then test in real time to make sure that these assumptions and expectations are met.
Here’s a quick overview of data testing methods:
- Table-level tests – they concentrate on comprehending the general form of a table.
- Column-level tests – these are classified into two types:
- single-column tests (focusing focus on establishing expectations based on statistical features of specific columns),
- multi-column tests (concerned with determining the connections between the columns.).
Some open-source libraries, like Great Expectations, Deequ, and PyDeeq, were made to help data teams test the quality of their data.
Challenges for testing data pipelines
Even though many ETL testing problems are the same as in general software testing, there are some details that are much harder to test when testing data pipelines.
Testing on a large scale
We are inherently writing or streaming large volumes of data to a central location, so the testing process needs large amounts of data as well. Replicating production data is time-consuming and costly. While object storage is reasonably priced, it’s still not free.
Data lakes are frequently petabytes in size and are rapidly expanding. It may take hours to copy files to various buckets for a “production-like” test. Also, if a data lake stores 100 TB of data on S3, it will cost about $25,000 per year to make a single copy of that data for a continuous testing environment.
Do you want to run numerous test environments at the same time? Continue to multiply. And you can be sure that over that year, your data use will only increase.
Reproducibility in testing data pipelines
While testing an ETL, you want to compare the result set to an expected result set from the previous run. But since your production data has changed, it’s hard to check that the new code works when you can’t get to the new data.
While size is important, so are the complexity and diversity of the data. The diversity of data can have a significant influence on an ETL’s performance. For instance, if a part of the ETL re-partitions data by a certain column, it could become a very expensive process (depending on the values of the column).
Testing ETLs without endangering production data
You can test your ETL by reading directly from your production object store, where the files are extracted (this is usually not a good idea).
But what if your change requires data deletion? Suppose you need to test a new retention procedure that complies with GDPR regulations. Are you prepared to take the risk of mistakenly erasing production data? Testing data pipelines is best done in isolation.
Automation in testing
ETLs are time-consuming, costly, multi-step processes. What every engineer wants is to not only discover problems but to detect them early on. Instead of waiting for the process to finish and then comparing the outcomes, teams need a means to automatically compare results many times throughout the process.
Solution? lakeFS makes testing data pipelines easy
lakeFS is an open-source project that applies best practices from software engineering to the world of data engineering. Dev/Test environments and CI/CD are more difficult to adopt in data engineering since you don’t just handle code.
lakeFS provides version control over the data lake and employs git-like semantics to build and retrieve those versions.
Metadata manipulation is used in lakeFS’s data versioning. As a result, with lakeFS, creating a new environment that is identical to production will. You can achieve that for data lakes of any size in milliseconds, without having to add any storage.
After you’ve created new versions of your ETL, you can quickly test them against previous changes. You may compare the output of multiple versions of your ETL against the same input this way.
Since branches are performed directly on production data via lakeFS, ETL testing is performed on production identical data, with all of the difficulties that entails. You can be confident that this is exactly how the production environment appears.
Have you inadvertently destroyed all of your data? No worries. Remove the branch and begin testing your new code against a new branch.
lakeFS users can execute post-commit hooks and testing ETLs at each stage of the process. This method not only aids in swiftly detecting flaws in a given phase of a multi-step ETL. It’s also incredibly useful in root cause analysis.
Try lakeFS in playground mode to see how it works and what benefits it brings to teams looking to test their ETL pipelines efficiently.
Table of Contents