Data Quality Testing: Ways to Test Data Validity and Accuracy
Table of Contents
If Sisyphus had been a data analyst or a data scientist, the boulder he’d be rolling up the hill would have been his data quality assurance. Even if all the engineering processes of ingesting, processing, and modeling work impeccably, data quality testing is essential at any stage of the data pipeline.
Facing the question “Is this OK?” is hard enough. Why? Because it is hard to define what OK means in the data engineering world.
What is data quality testing, and why is it important?
|Data quality dimension||Description|
|Timeliness||Data’s readiness within a certain time frame.|
|Completeness||The amount of usable or complete data, representative of a typical data sample.|
|Accuracy||Accuracy of the data values based on the agreed-upon source of truth.|
|Validity||How much data conforms to acceptable format for any business rules.|
|Consistency||Compares data records from two different datasets.|
|Uniqueness||Tracks the volume of duplicate data in a dataset.|
Traditionally, data quality testing is split into six dimensions:
- Timeliness: Is your data available when you need it?
- Completeness: Does it fulfill your expectations of what’s comprehensive?
- Accuracy: How well does a piece of information reflect reality?
- Validity (aka Conformity): Is the information in a specific format, type, or size? Does it follow business rules/best practices?
- Consistency: Do data assets stored in one place match relevant data stored elsewhere?
- Integrity (aka Uniqueness): Can different data sets be joined correctly to reflect a larger picture? Are relationships well-defined and implemented?
These dimensions were defined while taking a broad view of designing a data warehouse and data quality testing for this scenario. They consider all data sets defined and collected, the relationships between them, and their ability to properly serve the organization.
When we look at a single data set, our quality considerations are more narrow:
- It doesn’t require completeness, as other data sets may compensate.
- Consistency and integrity are irrelevant since other datasets aren’t considered.
- Timeliness depends mostly on the engineering pipelined functioning rather than on the quality of the data.
In our case, asking if a data set is OK is equivalent to asking “Is it valid and accurate?”.
In this article, I will describe validity testing, break down the concept of accuracy testing, and review the testing frameworks available.
Types of data quality tests
When data is updated on a regular basis, it provides an accurate image of the data source. However, as data becomes stale or old, it ceases to be reliable and becomes useless to downstream users.
Data freshness checks assess the quality of data inside a table by monitoring how frequently that data is updated in comparison to established latency restrictions, such as when an ingestion operation is expected to load on any given day. Teams can build manual data these data quality checks using SQL rules.
NULL values test
Missing data (NULL values) is one of the most prevalent data quality concerns. They appear when a field is purposely left blank or as a result of a pipeline problem (like an API outage). A NULL values test validates if values within a defined column for a certain model are missing after the model runs.
Why is it important? Imagine that you’re checking the impact of an event on sales by region, but in some entries, the “region” field is left blank. As a result, the rows missing the region data are removed from the report, giving you inaccurate data. Make sure to test your data with NULL values in mind to avoid this.
Numeric distribution tests
Is my data within a reasonable range? Are my values within the range of a certain column? Distribution testing helps to answer these questions. A distribution test determines whether the data in a given table is reflective of reality.
You can easily define these rules in SQL by specifying minimums and maximums for a specific column.
Referential integrity tests
The parent-child connection between tables in a database is referential integrity. The primary key, also known as the main key or the foreign key, is the fundamental data that we combine across tables to generate models and draw insights.
But what if the data used for that primary key is altered or deleted? Referential integrity tests help here. Referential integrity checks are data quality checks that verify that any data mirrored in a child database has a matching parent table. Referential integrity data quality testing assures that changes to a parent or main key affect dependent tables.
Human mistakes happen, and because data discrepancies are widespread, it is critical that those records are reconciled on a regular basis using data quality testing to guarantee that your data remains clean and correct.
A string-searching method, like RegEx, is an effective approach to checking that strings in a column match a certain pattern. UUIDs, phone numbers, emails, integers, escape characters, dates, and other common patterns are up for validation with the help of string patterns.
Duplicate data is another prevalent quality issue that plagues both data engineers and consumers. Any data record that has been duplicated and shared with another data record in a database is considered duplicate data. You can only imagine the repercussions of keeping duplicate data around.
Luckily, we can use uniqueness tests to automate discovering duplicate entries to clean and normalize raw data before it enters the production warehouse.
Is data being collected? Is it insufficient, or is it too much data? All of these questions relate to data quality challenges caused by the volume of data entering your database.
Volume tests are an important quality control method for assessing the quantity of rows in crucial tables. They should definitely be part of your data quality testing strategy.
When should you test data quality?
When developing new data transformations for new data sources and business entities, add testing throughout the development cycle for high data quality.
It’s a good idea to test the original quality of your data. This is where you need to test for:
- Uniqueness and non-nullness of primary keys
- Column values that meet fundamental assumptions
- Rows that are duplicates
You can also use source freshness checks to check that source data is being updated via an ETL tool on a regular basis.
What about testing during data transformations?
It’s easy to make a mistake when you clean, aggregate, combine, and apply business logic to raw data as you weave in additional data manipulations and generate new metrics and dimensions with SQL and Python.
This is when you should check if:
- Primary keys are distinct and non-null.
- The row counts are accurate.
- Joins don’t introduce duplicate rows.
- Your expectations are met through the relationships between upstream and downstream dependents.
During pull requests
It’s smart to carry out data quality testing before incorporating data transformation modifications into your analytics code base. Contextualized test success/failure results help in code review and serve as a final check before the code is integrated into production.
In practical terms, you’ll be testing a snapshot of the data transformation code on a GitHub pull request.
If you use a Git-based data transformation tool, you can invite other data team members to collaborate. Others can review your code changes, help you debug errors, and ultimately build out a high-quality analytics foundation.
Make sure that no new data models or transformation code can enter your code base without first passing through your team’s review and testing against your standards.
Once your data transformations and tests have been merged into your main production branch, it’s key that you execute them on a regular basis for high data quality.
That’s because many things may happen to your data model. For example, a software engineer may deploy a new feature that alters your source data or a business user adds a new field to the ERP system and the business logic of your data transformation is broken.
Your ETL pipeline might end up putting duplicate or missing data into your warehouse, and so on.
This is where automated tests help. They let you be the first to notice when something happens in your business or data. Airflow, automation servers like GitLab CI/CD or CodeBuild, scheduling cron tasks – these are all common approaches for executing data tests in production.
Validation: Metadata testing
Metadata is the information describing the data, rather than the data itself.
For example, if the data is a table, which is often the case with analytics, the metadata may include the schema, like the number of columns and the name and type of variable in each column. If the data is in a file, the file format and other descriptive parameters, such as version, configuration, and type of compression, may be part of the metadata.
The test definition is straightforward:
There is an expectation for each value of the metadata that is derived from the best practices of the organization and the regulations it must adhere to.
If you’re a software engineer, this type of test is very much like unit testing of a piece of code. Just like with unit test coverage, it may take some time to create all those tests, but reaching high test coverage is both possible and advised.
Maintaining the tests whenever metadata changes is also required. There’s often a gap in expectations. While we are used to updating our unit tests when we change the code, we must be willing to invest the same time and attention to maintain the metadata validation when our schemas evolve.
Three Shades of Data Accuracy
Type 1: Entry level fact-checking
The data we collect comes from the reality around us, and hence some of its properties can be validated by comparing them to known records, for example:
- Is this address real?
- Is this an active webpage?
- Do we sell a product by this name?
- For the price column, are the values non-negative?
- For a mandatory field, is it not null?
- Values come from a given range, so min and max are known.
Getting the values for validation usually requires querying another data set that can reliably provide the answer. This data set can be internal to the company, such as the employee records within the HR systems. As well as sources external to the company, such as OpenStreetMap for a global street, city, and country registration database.
Once the validating values have been obtained, the test itself is a simple compare/contain query, and its accuracy is limited to the accuracy of the external data set used.
This test validates the data itself rather than its metadata. It is best to perform such validations as close as possible to the collection of the data to avoid accuracy issues.
For instance, if a person fills out a form to collect data, the digital form can only present options that are valid. Since this is not always possible, validation at the ingest stage for the values is recommended.
Type 2: Set level sanity
Fact-checking is testing a value within a single record. When it comes to big data, we need to test the attributes of the set we have.
The set may include data from a certain timeframe, from a given operational system, an output of an ETL process, or a model. Whatever its origin, it has characteristics as a set that we would like to verify.
Those characteristics are statistical, such as:
- Data is expected to come from a given distribution.
- The average, variance, or median value is expected to be within a given range with high probability.
Statistical tests still require you to know what to expect, but your expectation now has a different form:
- Is the probability that this data comes from this distribution high enough?
- The average of this column should be within this range with 95% probability.
Consider a table that holds the hands dealt to players in a poker game (yes, gaming websites also have BI :-)). In this case, the expected distribution of hands can be pre-calculated.
The test we perform will look at the values within the column holding the dealt hands and ask: What is the probability this set came from the expected distribution?
In this statistical test, the parameters that define the pass/fail will have to be probabilistic. You’ll get an alert if the distribution is uniform with less than X% probability.
Certainly, you would rather manually check an alerted data set than have an error cascade into your data pipelines.
Type 3: History-based set level sanity
Just like statistical accuracy tests, we’re looking at the attributes of a set of records. In this case, we don’t have a real world source of truth to rely on. We have the history of the data set itself: the same data set as it evolved over time.
We can use this historical data to create a baseline for the characteristics of the data, and then test to see if the new data set of today is consistent with the baseline. Data quality testing needs to take this into account.
Some examples of characteristics we can deduce from historical data are:
- Expectation and variance of values in a given column
- Distribution of the values in a given column, e.g., the number of events per minute of the day.
- Look for features of the data and their expected distribution.
- A certain seasonality over time is expected, e.g., a spike in sales on Black Friday and less traffic on weekends.
- Run anomaly detection algorithms to look at the history of features and see if the current values are normal
Learning the baseline adds a probabilistic aspect not only to the result of the test but also to the validity of the baseline values. We perform the same statistical test as in type 2, but we have an additional risk to its correctness as the baseline we compare to is correct only with a certain probability as it was statistically deduced from historical data.
Should we still perform the test? If the probability that the baseline is correct is high enough and you use the thresholds wisely, absolutely.
It’s advisable to remember that alerting systems need to balance between false positive and false negative:
- False-negative: Test failed when it should have passed
- False-positive: Test passed when it should have failed.
You should structure your tests in a way that optimizes the desired error according to business needs. You will see errors, so make sure you have the errors you can live with, with high probability. 🙂
What data quality testing frameworks are out there?
An open source tool out of AWS Labs that can help you define and maintain your metadata validation. Deequ is a library built on top of Apache Spark for defining “unit tests for data”, which measure data quality in large datasets. Deequ works on tabular data, e.g., CSV files, database tables, logs, flattened JSON files. Basically, anything that you can fit into a Spark data frame.
The project is working to evolve to the accuracy tests described above, but its primary capabilities are within the validation domain.
Also focused on validation, this open source tool allows easy integration into your ETL code and can test data from a SQL or file interface. Since it is structured as a logging system, it can be used in documentation format and create automatic documentation from the tests defined. It also offers to profile the data and automatically generate expectations that are asserted during testing.
Torch allows validation using a rule based engine. Rules can be defined based on your own domain expertise and a large set of rules provided by Torch. This is really helpful for testing data quality.
The system offers some capabilities related to the analysis of the history of the data set, but those are very basic type 2 tests. Acceldata provides a wider set of tools for data pipeline observability, covering other aspects of the six dimensions of data quality, and Torch is one of its modules.
OwlDQ is a data quality testing tool based on a dynamic analysis of the data sets and an automatic adaptation of expectations. The rules allow a definition of a feature to be tracked and the probability of a pass/fail, but the heavy lifting of data characterization is left to the OwlDQ engine.
This is a code free implementation and observability platform that comes in handy for data quality testing. It uses machine learning to infer and learn what your data looks like, proactively identify data issues, assess their impact, and issue alerts through integrations with common operations systems. It also allows root cause analysis.
A pipeline metadata monitoring tool that also provides out-of-the-box data quality metrics (e.g., data schemas, data distributions, completeness, and custom metrics) with no code changes required.
Take a deeper dive into data quality tools: Top Data Quality Tools in the Age of Scalable Data 
CI/CD for data
Continuous integration and continuous deployment of data are automated processes that rely on the ability to identify issues with the data and prevent them from cascading into production. Ideally, you should run data quality tests whenever the occasion calls for it.
This is where version control solutions like lakeFS can help.
With lakeFS, you get zero-copy isolation, pre-commit and pre-merge hooks to support the automated data quality tests. The solution also integrates with data quality testing tools that provide the testing logic described above, helping you test your data seamlessly and at every important point.
Learn more about how lakeFS enables CI/CD on data lakes.
Table of Contents
Table of Contents