Enterprises use more and more data as the foundation for their decisions and operations. The sheer number of digital goods that collect, analyze, and use data to feed decision-making algorithms in order to improve future services is also rapidly increasing. Because of this, data quality has become the most important asset for businesses in almost every industry, from finance to healthcare to retail. Many organizations ask themselves how to maintain data quality in their data lake.
Data engineering teams created new data management skills over time using various technologies. Such ecosystems are built around technologies like data lakes and data pipelines that make it easy and cheap to store and analyze data.
Despite several breakthroughs in data tools and processes, engineers still face many challenges. One is the ever-present question: how to maintain data quality when dealing with so many data changes?
Why is maintaining data quality in the data lake so hard?
The need to store, organize, and connect data led to the creation of a new type of job called data engineering. Its original goal was to help with business intelligence and database management. Since then, the field has changed dramatically, driven by the need to handle huge amounts of data and machine learning techniques.
Data engineers are dealing with more data than ever before and battling to maintain data pipelines in shape due to poor machine performance or obsolete ETL methods.
Here are a few issues that most data engineers face today:
- It’s difficult to validate data quality and consistency before it flows into the lake. That’s because – unlike software engineers – data practitioners don’t really have staging or QA environments for data. Everything, including possible issues, gets washed into the lake.
- Engineers can’t test and debug new data sets in isolation. Whether in the pre-production phase, deployment, or final QA before reaching end users. That’s since data doesn’t have its specialized testing environment. Everything ends up in one lake.
- Many challenges arise around troubleshooting as data engineers lack an effective technique for discovering, assessing, and troubleshooting errors in production.
As you might expect, a large portion of data engineering is based on manual work and involves heavy lifting. Unlike software developers, data engineers don’t have access to a wide range of automation technologies that make low-level manual work unnecessary and eliminate mistakes. Not to mention that the cost of making a mistake is quite high, which often stops businesses from moving quickly.
Is there an escape route? It’s right around the corner—you can find it in every modern software development team that uses Git for its operations.
Maintain data quality with versioning
The great news is that all of these issues have already been addressed at the application level. In a standard development team, different developers add to the same repository without any misunderstandings. Different users use different versions of the software at the same time, but developers can easily duplicate a user problem by using the exact version that user is using.
This is the purpose of the Data version control tools. They bring time-tested best practices from software development to data.
Managing data in the same way you manage code improves the efficiency of many data operations tasks. Here are a few examples.
Versioning and branching of data
When there are many versions of data, the version history is very clear from a lineage point of view. Engineers can easily keep track of changes to their repository or datasets and point clients to data that has just been made available.
Work in isolation
When introducing changes or fixes to the existing data pipelines, these changes need to be tested to make sure that they indeed improve the data and not create new errors. In order to do that, data engineers need to be able to develop and test these changes in isolation before they become part of production data.
If you expose users to production data and anything goes wrong, you can always roll back to the prior version in a single atomic move.
Imagine a problem with the quality of the data that causes a drop in performance or a rise in infrastructure costs. If you have versioning, you may open a branch of the lake from the point where the modifications were brought into production. Using the information, you can recreate all of the environment’s features as well as the problem itself to start figuring out what’s wrong.
Version control systems allow you to set actions to be triggered when certain events occur. For example, a webhook can check a new file to see if it fits into one of the allowed data types.
Using a Data version control platform gets rid of the problems haunting big data engineering teams who work on the same data. And when an issue arises, troubleshooting is significantly faster.
The open-source solution lakeFS is a great illustration of this, since it allows engineers to handle data like code, taking advantage of all the best practices and Git-like processes used by software developers today.
How to improve data quality with testing
Data quality is often classified into six dimensions:
- Accuracy – measures how effectively a piece of information reflects reality.
- Completeness – does the data meet your expectations of what constitutes comprehensiveness?
- Consistency – does data stored in one location match data stored elsewhere? Is it accessible when you need it?
- Validity (also known as conformity) – is the data in a specified format, kind, or size? Is it compliant with rules/best practices?
- Integrity – can you combine various data sets to provide a bigger picture? Are relationships specified and executed correctly?
These parameters were developed while constructing a data warehouse in a broad sense. They examine all defined and gathered data sets, their relationships, and their capacity to adequately serve the enterprise. This is what makes them a great foundation for data quality testing.
We make assumptions about the format, structure, and content of the data we’re going to use. Whether it’s the most recent data from an existing set, like data from the last 5 minutes, or a whole new set. To make sure the data is good, we should test them to see if they back up our assumptions. Validation tests include schema or format validation as well as checking the data itself to validate its distribution, variance, or characteristics.
With the emergence of new testing frameworks made this process easier. These include tools like Monte Carlo and the open source project Great Expectations. Both enable you to create quality tests for data and manage test results.
Metadata is information that describes data. For example, if data is a table, as is common in analytics, the metadata may include the schema, such as the number of columns and the name and type of variable in each column. If the data is in a file, the metadata may include the file format and other descriptive features such as version, configuration, and compression method.
The test description is simple. There is an expectation for each value of the metadata generated by the organization’s best practices and the rules it must follow.
This form of test is quite similar to unit testing a piece of code if you’re a software developer. As with unit test coverage, creating all of the tests may take some time, but achieving high test coverage is both doable and recommended.
It is also necessary to keep the tests running whenever the metadata changes. Expectations are frequently out of sync. While we’re accustomed to updating unit tests when we update the code, we must be prepared to devote the same time and effort to maintain metadata validation when our schemas evolve.
Continuous data integration
So, how can we accomplish high-quality data intake using atomic “Git-like” operations? One best practice is to ingest data into a separate branch so that data consumers are unaware of it. This makes it possible to test the data on the branch and merge it only if the tests pass.
You can configure a series of pre-merge hooks that trigger data validation checks to automate the procedure. Only if the tests have passed will the hook merge into the lake’s master branch. If a test fails, a solution like lakeFS will notify the monitoring system and provide a link to important information about the validation test failure. Since your data repository has been committed to the ingestion branch, the newly ingested data has a snapshot of it. This makes it easy to figure out what the problem is.
This technique enables data quality validation tests to be performed prior to data input. Testing data before it is ingested into the master branch will prevent the cascade quality concerns that occur when a new data batch starts a DAG of actions over the data.
Quick guide to data testing frameworks
This open-source tool focused on validation is straightforward to integrate into your ETL code. It can test data through an SQL or file interface. Since it was designed as a logging system, you can use it in conjunction with a documentation format to generate automatic documentation from the tests described. It also lets you make a profile of the data and come up with expectations that you can talk about during testing.
AWS has created an open-source tool to assist engineers in creating and maintaining your metadata validation. Deequ is an Apache Spark-based framework for building “unit tests for data,” which analyze data quality in huge datasets. Deequ is designed to work with tabular data, such as CSV files, database tables, logs, and flattened JSON files—anything you can fit into a Spark data frame.
Torch supports validation using a rule-based engine. You can define rules by using your knowledge of the subject and a large library of rules that Torch provides. The system has certain capabilities relating to data set history analysis, although they’re relatively simple type 2 tests. Acceldata has a wider range of tools for observing data pipelines, such as modules that cover more of the six dimensions of data quality.
OwlDQ is based on a dynamic analysis of data sets and automated expectation adaptation. The rules let you define a feature to be monitored as well as the likelihood of pass/fail. The OwlDQ engine does the heavy lifting of data characterization.
This is a framework for observability implementation that doesn’t require any code. It uses machine learning to infer and understand what your data looks like, proactively discover data issues, analyze their effects, and provide warnings via connections with standard operational systems. It also enables the investigation of root causes.
This pipeline metadata monitoring tool also provides out-of-the-box data quality measures such as data schemas, data distributions, completeness, and custom metrics, without requiring you to modify any code.
The quality of the data we add will tell us how reliable our data lake is as a whole. And the ingestion step is very important to keep our service and data safe. Just like software engineers who use automated testing to make sure that new code works, data engineers need to regularly test newly imported data to make sure it meets data quality standards.
Even though running your data lake on object storage has benefits for scalability and performance, it’s still very hard to use best practices and make sure the data quality is high. How do you maintain data quality in this context? The only hope lies in adding automation to the mix.
Continuous integration and continuous data deployment are both automated processes that depend on the ability to find data errors and stop them from spreading into production. You can use a bunch of open-source solutions to build this capability.
One of them is lakeFS.
To facilitate the automated process, lakeFS has zero-copy isolation, pre-commit, and pre-merge hooks. It opens the doors to the technologies for testing data quality that provide the testing logic we talked about above.
Get involved with our growing community of data innovators:
- Check out the lakeFS repository on GitHub
- Follow us on Twitter or LinkedIn
- Say “Hi” in our friendly Slack Group
Table of Contents