The quality of the data we introduce determines the overall reliability of our data lake. And the ingestion stage is a critical point for ensuring the soundnes of our service and data. The same way software engineers apply automatic testing to new code, data engineers should continuously test newly ingested data while ensuring they meet data quality requirements.
Despite the scalability and performance advantages of running your data lake on top of object stores. It remains extremely challenging to enforce best practices and ensure high data quality. In this post, we’ll explore how lakeFS, an open-source tool with ‘Git like’ capabilities over object storage, can be used to create an automated CI process for newly ingested data.
Whether we need to ingest the last input of an existing data set, such as the data from the last 5min, or a new data set altogether, we make assumptions as to the format, structure or content of the data we are about to use. In order to ensure data quality, we must test the data to validate our assumptions. Some examples for validation tests would be schema or format validation, testing the data itself to ensure it’s distribution, variance, or features.
This task becomes easier with the rise of new testing frameworks like Monte Carlo and the open source project Great Expectations. Both allow you to build quality tests for the data and manage the test results.
Continuous Integration of Data
So, how do we achieve high-quality ingestion of data with atomic ‘Git-like’ operations? A good practice is to ingest the data to an isolated branch so data consumers are not aware of it. This allows testing data on the branch, and merging to the main data branch only if the tests passed. To automate the process, a set of pre-merge hooks that trigger data validation tests can be defined. Only after the tests have passed, the hook will perform the merge into the lake’s master branch. If a test fails, lakeFS will send an event to a monitoring system, with a link to relevant information regarding the validation test failure. Since the newly ingested data is committed to the ingestion branch, it includes a snapshot of your data repository. Providing an easy debugging of the issue at hand.
This approach provides the ability to perform data quality validation tests prior to data ingestion. Testing data before its ingested to master, will prevent cascading quality issues that often happen if the arrival of a new data batch triggers a DAG of operations over the data.
Data Quality Branching Model
To tackle this problem, we created this short step by step guide on how you can ensure high quality data ingestion using lakeFS, and testing frameworks.
New data ingestion
- Ingest data to a designated Ingest Branch.
- A webhook (think GitHub action) initializes a test on a testing framework.
- Pass/Fail message is sent back to lakeFS with a string specifying location of information about test result.
- If a test fails, merge to master fails and your monitoring is alerted with a string of test information.
- If a test passes, data is automatically merged to the master branch.
We hope this blog gave you a good idea on how to ensure data quality in your data lake environment. If you have ideas on other branching models that help achieve this goal – we’d love to hear from you. Join our slack channel and say hello.
You might find this interesting: