Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
The lakeFS Team
The lakeFS Team Author

lakeFS is on a mission to simplify the lives of...

Last updated on May 22, 2024

What’s not to love about data pipeline testing? Adding user acceptance testing and contract acceptance testing to your data pipelines makes them less likely to cause errors and ensures that enough quality checks are done on the data before it’s sent to end users.

Data pipeline testing involves two components of any data pipeline: data and code used as tools to manage the Extract, Transform, and Load (ETL) process. You’re bound to encounter more challenges around testing data than the code. But always keep in mind that it’s important to test both the code and the data when testing data pipelines.

How exactly does that work in the context of acceptance testing? How do you test ETLs? Keep reading to get all the answers and improve your testing process with industry best practices.

Why is testing data pipelines important?

Data is the foundation of many crucial business choices. As more businesses move toward data-driven decisioning, data plays an increasingly significant role in modern organization operations. As a result, having high-quality data improves the quality and relevance of business decisions.

However, data is always changing, making user acceptance testing crucial. Unlike code, which is often static and tidy, data’s dynamic nature adds complexity to data pipeline testing. Changes in operations, changes in the economy as a whole, and events like a global pandemic could have a big impact on the data. Just like the software testing process in software development, data pipeline testing ensures that data meets business requirements and supports teams towards successful project completion.

In most circumstances, data needs to be cleaned before it can be used for analytics purposes. Data testing guarantees that major changes/drifting are noticed in real time, and that faulty data can be correctly filtered and discarded.

Ensuring data quality

Data quality is critical for making educated decisions. If data is erroneous, incomplete, or out of date, it might result in bad or misinformed decisions. Testing pipelines for data quality on a regular basis, covering elements such as correctness, consistency, and completeness, may help companies rely on their data confidently.

Guaranteeing data integrity

Data integrity entails ensuring that the data is accurate and consistent throughout its lifecycle. A strong data pipeline guarantees that data is not lost, duplicated, or incorrectly updated as it travels from source to destination. Pipeline testing ensures that data transactions are atomic, maintainable, and fault-tolerant.

Optimizing system performance

Performance difficulties may be costly and disruptive. Testing pipelines under a variety of conditions—such as huge data volumes, concurrent users, or resource-intensive tasks—aids in finding bottlenecks and enhancing system performance before they affect production settings.

Enabling continuous improvement

Continuous improvement is critical in today’s nimble and ever-changing world. The pipeline itself is expected to undergo periodic alterations to meet new requirements or technology. Regular testing guarantees that modifications may be seamlessly integrated without disrupting current functionality, promoting an agile development process.

Ensuring compliance and security

GDPR, HIPAA, and industry-specific standards are all examples of legal and regulatory criteria that data must meet. Testing pipelines for compliance and security guarantees that sensitive data is handled appropriately, encrypted both in transit and at rest, and that adequate access restrictions are implemented.

Reducing operational costs

A defective pipeline might result in a significant amount of effort wasted on debugging, hotfixes, and even manual data cleaning. All of these activities might be expensive. Rigorous testing can detect problems early, decreasing the time and resources required for troubleshooting and repair, and thereby lowering operational costs.

Facilitating collaboration and documentation

A well-tested pipeline is often well-documented, making it easy for team members to grasp. This promotes collaboration among data scientists, engineers, and business analysts. Good documentation, which is frequently developed as a result of extensive testing (including user acceptance testing or operational readiness testing), facilitates the onboarding of new team members and project transitions.

What exactly is ETL testing?

Incorporating user acceptance testing as part of ETL testing ensures that the transformed data meets specific user requirements and expectations, enhancing the reliability of data-driven decisions.

An ETL process takes data from many different sources, which could be in many different structured or unstructured formats, and stores it in one place. The data is then put into a format that makes it easier to use for some business purposes. 

Cleaning the data is often a part of this transformation. This means removing duplicate data, standardizing date and time formats, cleaning the data, and doing other things. ETL also involves moving data around, like joining different data sets into a single table, summarizing, and so on. Eventually, the converted data is stored in a central location, like a warehouse.

ETL testing guided by acceptance criteria, ensures data is moved from different sources to the central data warehouse while following transformation rules and passing all validity tests. 

ETL testing differs from database testing used in data warehouse systems. It’s a key part of gathering useful business intelligence and analytics data.

Code and data testing in data pipelines

Code testing in a data pipeline, including user acceptance testing, is more or less similar to testing any software product (though it comes with its unique challenges, which we explore below). Unit tests, integration tests, and end-to-end system tests are all part of code testing. Code testing is typically performed as part of Continuous Integration (CI) pipelines. 

Here, the goal is to make sure that the code quality is good and that the data ingestion, data transformation, and data loading functions work as planned.

What about data testing?

Setting expectations for important data pieces, defined by acceptance criteria, and making sure that the data meets these acceptance criteria before giving the final data to users is called “data testing.” 

Engineers have to work on data testing all the time, and it becomes a lot more important as data pipelines are put into operational systems. Apart from evaluating the data quality, teams also need to monitor the testing results to ensure that any data quality breaches are corrected as soon as possible.

Let’s take a closer look at code and data testing for ETL pipelines to understand what exactly engineers deal with when launching a testing process.

Code testing in a data pipeline

There are a few characteristics of data pipelines that make code testing slightly more difficult.

Since data pipelines are inherently data-reliant, you need to create sample data for testing purposes. And we’re talking about huge amounts of sample data!

To run data pipelines, a lot of other systems are needed, such as processing systems like Spark and Databricks and data warehouses like Snowflake, Redshift, and Databricks SQL. As a result, you must come up with methods for testing them separately. What you want to achieve here is to separately test the functionality of the data processing logic from the links and interactions with these other systems.

To help you understand why testing code for data pipelines is different, let’s take a look at unit and integration testing in the context of data pipelines.

Unit testing in data pipelines

Unit tests are very low-level and close to an application’s source code. Each method and function used in your data pipelines needs to be tested for detecting mistakes without requiring a complex external environment. Mistakes in refactoring, syntax errors in interpreted languages, configuration mistakes, problems with the graph structure, and so on, are all examples of these kinds of problems. 

To get the final results, you may need to use a number of data transformation functions. You can put these transformation functions through unit tests to make sure they produce the right data. 

To run unit tests on functions that are “in” the data pipeline, you must also make sure that the test has data to work with. As a result, there are two typical methods for gathering enough data to test your pipeline code. 

First, you create fake testing data based on the distribution and statistics of the real testing data. And then you’re ready to replicate a sample of real data in a development or staging environment for testing. 

You need to ensure that there are no violations of data privacy, security, or compliance when transferring data to a less restrictive environment than a production environment.

Integration testing for data pipelines

This kind of testing makes sure that your data pipelines’ modules or services work well with each other. The key integration tests for data pipelines are interactions with data platforms like data warehouses, data lakes (primarily cloud storage locations), and data source apps. For example, OLTP database applications and SaaS applications such as Salesforce and Workday. 

The three important processes of a data pipeline are Extract, Transform, and Load. At least two of them (extract and load), and sometimes all three, connect to the above-listed data platforms. 

A data pipeline must talk to a messaging system, like Slack or Teams, in order to send out alerts or notifications when important things happen on your data pipelines. Because of this, you need to do integration tests with any other platforms and systems that your pipelines interact with frequently.

If your data pipeline was built with Python, you can use a Python testing framework like pytest to automate code testing. It lets you create a variety of software tests, such as unit tests, integration tests, end-to-end tests, and functional tests.

Consider using a data pipeline orchestration tool in addition to a testing framework to make code testing for your data pipelines easier. 

Data testing in a data pipeline

Data testing, including operational acceptance testing, is all about setting expectations for important pieces of data and ensuring that the data going through your data pipelines meets those expectations before it gets to the data users. If any data quality expectations are violated, appropriate communication or alerting and corrective action should be performed.

Unlike code testing, which is often done at the compilation or deployment stage, data testing is done constantly whenever fresh streams and batches of data are ingested and processed. 

Data testing is a continuous series of acceptance tests in which you make assumptions and expectations about newly arriving data and then test in real time to make sure that these assumptions and expectations are met.

Here’s a quick overview of data testing methods:

  • Table-level tests – they concentrate on comprehending the general form of a table. 
  • Column-level tests – these are classified into two types:
    • single-column tests (focusing focus on establishing expectations based on statistical features of specific columns),
    • multi-column tests (concerned with determining the connections between the columns.).

Some open-source libraries, like Great Expectations, Deequ, and PyDeeq, were made to help data teams test the quality of their data.

Challenges for testing data pipelines

Even though many ETL testing problems are the same as in general software testing, there are some details that are much harder to test when testing data pipelines.

Testing on a large scale 

We are inherently writing or streaming large volumes of data to a central location, so the testing process needs large amounts of data as well. Replicating production data is time-consuming and costly. While object storage is reasonably priced, it’s still not free.

Data lakes are frequently petabytes in size and are rapidly expanding. It may take hours to copy files to various buckets for a “production-like” test. Also, if a data lake stores 100 TB of data on S3, it will cost about $25,000 per year to make a single copy of that data for a continuous testing environment. 

Do you want to run numerous test environments at the same time? Continue to multiply. And you can be sure that over that year, your data use will only increase.

Reproducibility in testing data pipelines

While testing an ETL, you want to compare the result set to an expected result set from the previous run. But since your production data has changed, it’s hard to check that the new code works when you can’t get to the new data.

Testing accuracy

While size is important, so are the complexity and diversity of the data. The diversity of data can have a significant influence on an ETL’s performance. For instance, if a part of the ETL re-partitions data by a certain column, it could become a very expensive process (depending on the values of the column).

Testing ETLs without endangering production data

You can test your ETL by reading directly from your production object store, where the files are extracted (this is usually not a good idea). 

But what if your change requires data deletion? Suppose you need to test a new retention procedure that complies with GDPR regulations. Are you prepared to take the risk of mistakenly erasing production data? Testing data pipelines is best done in isolation.

Automation in testing

ETLs are time-consuming, costly, multi-step processes, making data testing a critical component of early problem detection and resolution. What every engineer wants is to use data testing not only to discover problems but also to detect them early on. Instead of waiting for the process to finish and then comparing the outcomes, teams need a means to compare results many times throughout the process automatically.

Solution? lakeFS makes data pipeline testing easy

lakeFS is an open-source project that applies best practices from software engineering to the world of data engineering. This approach allows teams to conduct beta testing efficiently, ensuring that data pipelines are ready for production environments. Dev/Test environments and CI/CD are more difficult to adopt in data engineering since you don’t just handle code. 

lakeFS provides version control over the data lake and employs git-like semantics to build and retrieve those versions.

Scalability

Metadata manipulation is used in lakeFS’s data versioning. As a result, with lakeFS, creating a new environment that is identical to production will. You can achieve that for data lakes of any size in milliseconds, without having to add any storage.

Reproducibility

After you’ve created new versions of your ETL, you can quickly test them against previous changes. You may compare the output of multiple versions of your ETL against the same input this way.

Accuracy

Since branches are performed directly on production data via lakeFS, ETL testing is performed on production identical data, with all of the difficulties that entails. You can be confident that this is exactly how the production environment appears.

Isolation

Have you inadvertently destroyed all of your data? No worries. Remove the branch and begin testing your new code against a new branch.

Automation

lakeFS users can execute post-commit hooks and testing ETLs at each stage of the process. This method not only aids in swiftly detecting flaws in a given phase of a multi-step ETL. It’s also incredibly useful in root cause analysis.

Try lakeFS in playground mode to see how it works and what benefits it brings to teams looking to test their ETL pipelines efficiently.

Git for Data – lakeFS

  • Get Started
    Get Started
  • Who’s coming to Data+AI Summit? Meet the lakeFS team at Booth #69! Learn more about -

    lakeFS for Databricks
    +