Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community

An Up-to-Date Approach to Building and Maintaining a Data Quality Framework

Idan Novogroder
Idan Novogroder Author

Idan has an extensive background in software and DevOps engineering....

October 9, 2023

Many companies have data assumptions that just don’t match reality. The problem isn’t that they don’t have the data they require. The problem lies in data quality issues – low-quality data can’t be used properly and doesn’t add value.

That’s why it’s in your best interest to identify and correct data quality concerns before letting others use that data for business decision-making. Maintaining high data quality is a priority for every organization that wants to stay competitive today and in the future.

While some teams worry about gaps in data lineage and substance, others question its completeness and uniformity. That’s why there’s no silver bullet for data quality management – you can’t use the same set of approaches and procedures for addressing all data quality issues.

Implementing a data quality framework is a key step towards improving data quality. But where do you get started?

This guide covers modern approaches to building and maintaining a data quality framework.

What is a Data Quality Framework?

A data quality framework is a technique for measuring data quality within an organization. It allows teams to identify their data quality goals and standards, as well as the actions they need to carry out to achieve those goals.

A data quality framework is a complete set of principles, processes, and tools used by enterprises to monitor, enhance, and assure data quality. You can also see it as a roadmap for developing your data quality management plan.

Why build a data quality framework?

Data quality is far too essential to leave to chance. If your data quality suffers, it might have far-reaching ramifications not only for the systems that rely on it but also for the business decisions you make. That is why creating a viable data quality framework for your organization’s pipeline makes sense.

Data quality frameworks are often created for data lakes. Data lakes are centralized storage locations for massive volumes of data. It’s critical to have a process in place for keeping your data safe and consistent as it travels through your pipeline and to its final destination, wherever that may be.

What are the Main Components of a Data Quality Framework?

Illustrated Data Lifecycle from ingestion, transformations, testing & deployment to monitoring and debugging.

A data quality framework entails processes for validating, cleaning, transforming, and monitoring data to ensure that it is accurate, consistent, comprehensive, dependable, and timely for its intended usage.

In this section, we give an overview of the major components of a data quality framework.

Data Workflow

A data workflow that focuses on quality needs to include data quality checks – criteria for testing and monitoring data quality. You can perform such checks at various stages of the data pipeline, including data collection, transformation, storage, and analysis. They can be manual or automated, depending on the required complexity and frequency.

Data Quality Rules

Data quality rules entail conducting frequent audits or reviews of data quality performance using data quality criteria. You can develop rules through the use of data quality scorecards that are tailored to the company’s data quality requirements.

Data Issue Management

Data quality issues discovered during data profiling, data quality assessment, and data monitoring must be resolved. This is a critical point in a data quality framework. To achieve this in a timely manner that gains the trust of data consumers, you need proper data issue management processes and tooling.

Data Issue Root Cause Analysis

Identifying the underlying causes of a data-related problem is a common task for data teams. It pays to investigate the problem further and find the elements or sources that contribute to it.

To uncover the root causes faster, you can use methods such as fishbone diagrams, 5 whys, Pareto charts, and data profiling. A fishbone diagram, for example, might reveal that the main causes of the sales data problem are data input problems, data integration errors, data processing errors, and data governance concerns.

Data Quality Process Automation

Manual data quality management systems open the door to errors in data input and other areas, undermining data quality. Errors ranging from a minor, undiscovered typo to an entry filled in the wrong field or completely missed can have a substantial influence on data quality.

Manual systems also require hands-on tactical effort from data experts, who may otherwise be working on more strategic business tasks.

The solution to this is to automate your data quality operations. This will speed up and improve both the efficiency and accuracy of data quality management. A proper setup of automated data quality processes with the right rules and integrations helps to improve the overall quality of the data and avoid the most impactful data quality issues.

Continuous Improvement Processes

A continuous data quality improvement process helps to deliver valid and trustworthy data in a consistent manner. It establishes long-term expectations for data teams to deliver data that consumers can rely on.

How Do You Implement a Data Quality Framework?

Here are a few steps for a practical data quality framework implementation at your company.

Assessment

The first step here is defining data quality in terms of sources, metadata, and data quality indicators. Next, analyze how well your current data compares to it.

Here are a few steps you can take at this point:

  1. Choose sources for incoming data, such as CRMs, third-party providers, etc.
  2. Pick which properties you need for data completeness (examples include customer name, phone number, and address)
  3. Define the data type, size, pattern, and format for the chosen characteristics, such as the phone number that should have 11 digits and follow the pattern: (XXX)-XXX-XXXX.
  4. Select data quality metrics that determine acceptance requirements in your data quality framework.
  5. Run data profile checks to see how existing data compares to the required data quality.

Pipeline design

The next step is to create a data pipeline to guarantee that all incoming data is turned into the state determined during the evaluation stage.

At this stage, you need to choose the data quality methods required to clean, match, and safeguard data quality.

Data quality techniques at this stage include:

  • Data parsing and merging are used to separate or connect columns to make the data more intelligible.
  • Data cleaning and standardization help to remove issues from data like null values or leading/trailing spaces, at the same time performing actions like converting numbers into an acceptable format.
  • Data matching and deduplication are used to identify records that belong to the same entity and to remove duplicate entries.
  • To get a single view, use data merge and survivorship to erase obsolete information and combine records.
  • Rules for data governance to collect updated history and provide role-based access.
  • Choosing when to run the desired data quality processes: at the start, in the middle, or before the data is committed to the database

Monitoring

Once you define the data quality levels and set up the data quality processes, you’re ready to execute them on existing data and then enable them for incoming data streams.

Monitoring and profiling the data processed by the data quality pipeline is essential since it lets you:

  • Check that the configured processes are functioning properly.
  • Before transferring data to the target source, ensure that any data quality concerns are addressed or minimized.
  • Set up warnings whenever major faults occur in the system.

Iterate on your data quality lifecycle and processes. You might need to add new data quality metrics, change your definition of data quality, modify your data quality pipeline, or run new data quality processes against the data.

4 Stages of a Data Quality Framework

1. Define your data workflow

The process of handling data in a systematic manner is called a data workflow. It entails gathering, organizing, and processing data in order for it to be used for a variety of reasons. You can probably tell by now that it’s a key part of any data quality framework.

The primary goal of developing a data process is to guarantee that data is appropriately saved and arranged so that anyone can access it at any time.

Before you create a data pipeline, understand the data workflow diagram first. A data workflow diagram depicts the procedures involved in data processing. It’s a useful tool for teams working on data-related initiatives.

2. Create a continuous improvement process for data quality rules

A continuous data quality improvement strategy helps teams deliver trustworthy data. A periodic review of the data quality framework and the metrics used to measure its progress is the main building block of a continuous improvement process. The second aspect is deriving action items for improvement and implementing them in a timely manner. Those two aspects ensure that the data quality framework we choose to implement is constantly evolving and improving to suit our needs.

3. Choose your infrastructure

Next, it’s time to consider the infrastructure and how it will help scale your data quality framework processes. You need to have the flexibility and opportunity to expand the capacity and performance of your data infrastructure.

This is where vertical scaling and horizontal scaling can help. Vertical scaling involves expanding your existing system’s resources, such as adding extra memory, CPU, or disk space.

Horizontal scaling is the process of adding extra nodes or instances to your system and dispersing the workload across them. Depending on your data quality metrics, task patterns, and cost limits, these approaches may offer advantages and drawbacks.

4. Measure success: data quality metrics

The six data quality dimensions define what data quality means and serve as key metrics for understanding whether data quality processes work or not:

  • Accuracy – A measure of how well a piece of data resembles reality.
  • Completeness – Does the data fulfill your expectations for comprehensiveness?
    Does data saved in one location match data stored in another? Is it available when you need it?
  • Timeliness – often known as currency, is a metric that determines the age of data in a database.
  • Consistency – It quantifies how well individual data points from two or more sources of data synchronize. When two data points disagree, it suggests that one of the records is incorrect.
  • Validity – data validity is a metric that answers questions like: is the data in the correct format, kind, or size? Is it in accordance with the rules/best practices?
  • Integrity – can you merge different data sets to create a more complete picture? Are relationships appropriately declared and executed?

These data quality dimensions assess all defined and acquired data sets, their linkages, and their ability to serve the organization appropriately. That’s why they serve as an excellent foundation for a data quality framework.

Data quality dimensionDescription
TimelinessData’s readiness within a certain time frame.
CompletenessThe amount of usable or complete data, representative of a typical data sample.
AccuracyAccuracy of the data values based on the agreed-upon source of truth.
ValidityHow much data conforms to acceptable format for any business rules.
ConsistencyCompares data records from two different datasets.
UniquenessTracks the volume of duplicate data in a dataset.

Data Quality Framework Tools

Data Observability Tools

Monte Carlo

Monte Carlo
Source: TechCrunch

This is a code-free implementation and observability platform that is useful for assessing data quality. It employs machine learning to infer and understand what your data looks like, discover data issues proactively, analyze their impact, and deliver warnings via connections with standard operational systems. It also enables the investigation of root causes.

Databand

Databand
Source: Databand

A pipeline metadata monitoring tool that also offers out-of-the-box data quality measures (e.g., data schemas, data distributions, completeness, and custom metrics) without requiring any code modifications.

Torch by Acceldata

Torch by Acceldata
Source: acceldata

Torch is one of Acceldata’s modules for data pipeline observability, which covers additional parts of the six pillars of data quality. Torch supports validation using a rule-based engine. Rules may be defined using your subject expertise as well as the huge library of rules offered by Torch. This is quite useful for assessing the quality of data.

The system has certain capabilities relating to data set history analysis, although they are relatively simple type 2 tests.

Data Orchestration Tools

Deequ

Deequ
Source: AWS Amazon

AWS Labs has released an open source tool to help you create and maintain your metadata validation. Deequ is an Apache Spark-based framework for building “unit tests for data” that analyze data quality in huge datasets. Deequ works with tabular data such as CSV files, database tables, log files, and flattened JSON files – everything that can fit into a Spark data frame.

The project is striving to evolve to the above-mentioned accuracy tests, although its core competencies are in the validation area.

Great Expectations

Great Expectations
Source: great expectations

This open-source tool is an interesting addition to a data quality framework. It’s also focused on validation, is simple to integrate into your ETL code and can test data through a SQL or file interface. You can use it to generate automatic documentation from the tests defined, as it’s organized as a logging system. It also provides the ability to profile the data and develop expectations that are stated during testing.

OwlDQ

OwlDQ
Source: Collibra

OwlDQ (acquired by Collibra) is a data quality assessment tool that is based on a dynamic analysis of data sets and automated expectation adaptation in data quality processes. The rules enable for the definition of a feature to be monitored as well as the likelihood of a pass/fail, however the OwlDQ engine handles the hard work of data characterization.

Investigate data quality tools in further depth: Top Data Quality Tools for the Scalable Data Era [2023].

Data Version Control as a key part of a data quality framework

Many data quality concerns stem from challenges connected to the specific ways in which data practitioners operate – and the lack of equipment available to them.

Consider a typical software development team. Team members can contribute to the same repository without causing any confusion. Users can use several versions of the program at the same time, but developers can easily duplicate a user problem by using the same version that the user reported the problem with.

The purpose of data version control techniques is to bring the same capabilities to the data realm. Many data processing tasks that are part of a data quality framework – including data quality testing – become more efficient when data is managed in the same way that code is managed. One of such open-source data quality tools is lakeFS.

lakeFS

The tool offers hooks for zero-copy isolation, pre-commit, and pre-merge to aid in the automated process. It provides a solution for evaluating data quality in accordance with the best practices outlined above.

Data quality framework is a must-have

Continuous integration and continuous data deployment are automated procedures that need to be part of modern data quality frameworks. Managing data quality is just easier that way. They give you the capability to discover data errors and prevent them from cascading into production. Ideally, you should do data quality tests whenever necessary.

This is when version control systems like lakeFS might come in handy.

To facilitate automated data quality checks, lakeFS has zero-copy isolation, pre-commit and pre-merge hooks. The system also interfaces with data quality testing solutions that provide the above-mentioned testing logic, allowing you to test your data effortlessly and at every critical step to deliver high quality data.

Git for Data – lakeFS

  • Get Started
    Get Started
  • Create a Dev/Test Environment for Data Pipelines Using Spark and Python in this LIVE WEBINAR -

    Register here
    +