If Sisyphus had been a data analyst or a data scientist, the boulder she’d be rolling up the hill would have been her data quality assurance. Even if all engineering processes of ingesting, processing, and modeling are working impeccably, the ability to test data quality at any stage of the data pipeline, and being faced with the question “Is this OK” is hard enough. Why? Because it is hard to define what OK means.
Traditionally data quality is split into 6 dimensions:
- Accuracy: How well does a piece of information reflect reality?
- Completeness: Does it fulfill your expectations of what’s comprehensive?
- Consistency: Does information stored in one place match relevant data stored elsewhere?
- Timeliness: Is your information available when you need it?
- Validity (aka Conformity): Is the information in a specific format, type, or size? Does it follow business rules/best practices?
- Integrity: Can different data sets be joined correctly to reflect a larger picture? Are relations well defined and implemented?
These dimensions were defined while taking a wide view of designing a data warehouse. They consider all data sets defined and collected, the relations between them, and their ability to properly serve the organization.
When we look at a single data set, our quality considerations are more narrow:
- It doesn’t require completeness, as other data sets may compensate.
- Consistency and integrity are irrelevant since other datasets are not considered.
- Timeliness depends mostly on the engineering pipelined functioning, rather than on the quality of the data.
In our case, asking if a data set is OK is equal to asking “Is it valid and accurate?”.
In this blog, I will describe validity testing, breakdown the concept of accuracy testing, and review the testing frameworks available.
Validation: Metadata testing
Metadata is the information describing the data, rather than the data itself. For example, if the data is a table, which is often the case with analytics, the metadata may include the schema, e.g. the number of columns, and the name and type of variable in each column. If the data is in a file, the file format and other descriptive parameters, such as version, configuration, and type of compression may be part of the metadata.
The test definition is straightforward: there is an expectation for each value of the metadata that is derived from the best practices of the organization and the regulation it must adhere to. If you are a software engineer, this type of test is very much like unit testing of a piece of code. Just like with unit test coverage, it may take some time to create all those tests, but reaching high test coverage is both possible and advised.
Maintaining the tests whenever metadata changes is also required. There’s often a gap in expectations. While we are used to updating our unit tests when we change the code, we must be willing to invest the same time and attention to maintain the metadata validation when our schemas evolve.
Three Shades of Data Accuracy
Type 1: Entry level fact-checking
The data we collect comes from the reality around us, and hence some of its properties can be validated by comparing them to known records, for example:
- Is this address real?
- Is this an active webpage?
- Do we sell a product by this name?
- For the price column, are the values non-negative?
- For a mandatory field, is it not null?
- Values come from a given range, so min and max are known.
Getting the values for validation usually requires querying another data set that can reliably provide the answer. This data set can be internal to the company, such as the employee records within the HR systems. As well as sources external to the company, such as OpenStreetMap for a global street, city, country registration database.
Once the validating values have been obtained, the test itself is a simple compare/contain query, and its accuracy is limited to the accuracy of the external data set used.
This test validates the data itself, rather than its metadata. It is best to perform such validations as close as possible to the collection of the data, to avoid accuracy issues. For example, if the data is collected by a person filling a form, the digital form can offer only valid options. Since this is not always possible, validation at the ingest stage for the values is recommended.
Type 2: Set level sanity
Fact-checking is testing a value within a single record. When it comes to big data, we need to test the attributes of the set we have. The set may include data from a certain timeframe, from a given operational system, an output of an ETL process, or a model. Whatever its origin, it has characteristics as a set that we would like to verify. Those characteristics are statistical, such as:
- Data is expected to come from a given distribution.
- Average, variance, or median value is expected to be within a given range in high probability.
Statistical tests still require you to know what to expect, but your expectation now has a different form:
- Is the probability that this data comes from this distribution high enough?
- The average of this column should be within this range with 95% probability.
Consider a table that holds the hands dealt to players in a poker game (yes, gaming websites also have BI :-)). In this case, the expected distribution of hands can be pre-calculated.
The test we perform will look at the values within the column holding the dealt hands, and asking, what is the probability this set came from the expected distribution?
In this statistical test, the parameters that define the pass/fail will have to be probabilistic. You will get an alert if the distribution is uniform with less than X% probability. Certainly, you rather check manually an alerted data set, than have an error cascading into your data pipelines.
Type 3: History-based set level sanity
Just like statistical accuracy tests, we are looking at attributes of a set of records. Only in this case, we don’t have a real world source of truth to rely on. We have the history of the data set itself: the same data set as it evolved over time.
We can use this historical data to create a baseline for the characteristics of the data, and then test to see if the new data set of today is consistent with the baseline.
Some examples of characteristics we can deduce from historical data:
- Expectation and variance of values in a given column
- Distribution of the values in a given column, e.g. the number of events per minute of the day.
- Look for features of the data and their expected distribution.
- A certain seasonality over time is expected, e.g spik in sales on black friday, less traffic on weekends.
- Run anomaly detection algorithms to look at the history of features and see if the current values are normal
Learning the baseline adds a probabilistic aspect not only to the result of the test but also to the validity of the baseline values. We perform the same statistical test as in type 2, but we have an additional risk to its correctness, as the baseline we compare to is correct only with a certain probability as it was statistically deduced from historical data.
Should we still perform the test? If the probability the baseline is correct is high enough, and you use the thresholds wisely, absolutely. It is advisable to remember that alerting systems need to balance between false positive and false negative:
- False-negative: Test failed when it should have passed
- False-positive: Test passed when it should have failed.
You should structure your tests in a way that optimizes the desired error according to business needs. You will have errors, so make sure you have the errors you can live with, with high probability :-).
What data quality testing frameworks are out there?
An open source tool out of AWS labs that can help you define and maintain your metadata validation. Deequ is a library built on top of Apache Spark for defining “unit tests for data”, which measure data quality in large datasets. Deequ works on tabular data, e.g., CSV files, database tables, logs, flattened json files. Basically anything that you can fit into a Spark data frame.
The project is working to evolve to the accuracy tests described above, but its primary capabilities are within the validation domain.
Also focused on validation, this open source tool allows easy integration into your ETL code and can test data from SQL or file interface. Since it is structured as a logging system, it can be used through a documentation format, and create automatic documentation from the tests defined. It also offers to profile the data and automatically generate expectations that are asserted during testing.
Torch allows validation using a rule based engine. Rules can be defined based on your own domain expertise and a large set of rules provided by Torch. The system offers some capabilities related to the analysis of the history of the data set, but those are very basic type 2 tests. Acceldata provides a wider set of tools for data pipeline observability, covering other aspects of the 6 dimensions of data quality, and torch is one of its modules.
OwlDQ is based on a dynamic analysis of the data sets and automatic adaptation of the expectation. The rules allow a definition of a feature to be tracked, and the probability of pass/fail, but the heavy lifting data characterization is left to the OwlDQ engine.
This is a code free implementation observability platform. It uses machine learning to infer and learn what your data looks like, proactively identify data issues, assess its impact, and alerts through integrations to common operations systems. It also allows root cause analysis.
CI/CD for data
Continuous integration and continuous deployment of data are automated processes that rely on the ability to identify issues with the data, and prevent them from cascading into production. With lakeFS you get zero copy isolation, pre-commit and pre-merge hooks to support the automatic process. It also integrates with data quality testing tools that provide the testing logic described above.