Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community

Data Quality Monitoring: Key Metrics, Techniques And Benefits

Idan Novogroder
Idan Novogroder Author

Idan has an extensive background in software and DevOps engineering....

January 12, 2024

If you manage data pipelines in your system, how do you know they’re working correctly and that proper data moves through them? If you aren’t monitoring aspects like data uptime or validating your data regularly, you’re bound to face some troubles later on. 

Ensuring that high-quality data is sent into your training pipeline from upstream data systems is critical to the overall success of your project.

Neglecting continuous monitoring will lead to poor data quality, which will likely degrade your application before it is triaged and troubleshot. Poor application performance, regulatory and compliance concerns, customer attrition, and revenue loss are just a few examples of issues you might have to handle. 

If you’d like to avoid them, you should definitely invest in data quality monitoring. Here’s a primer on this incredibly important aspect of data quality management. 

What is data quality monitoring?

Data quality monitoring is the practice of measuring and reporting changes in data quality dimensions and concerns. It comprises tasks like the examination, measurement, and management of data in terms of correctness, consistency, and reliability, using a number of strategies to detect and fix data quality concerns. 

Monitoring helps to uncover data quality concerns early, before they snowball into a real problem that impacts application performance or business operations.

But maintaining high data quality is still a challenge. According to Forrester, 42% of data analysts spend more than 40% of their time vetting and confirming data. Gartner estimated that poor data quality costs an average of $15 million each year in lost revenue.

Data quality dashboards and warnings are common tactics for data quality monitoring. Dashboards highlight crucial indicators such as the amount, kind, and severity of data quality concerns, as well as the percentage of requirements that have been satisfied. They allow you to share the status and trends to many stakeholders, as well as guide data quality testing efforts. 

Alerts notify users of substantial or unexpected changes in metrics or indicators, such as increases in mistakes, decreases in completion rate, or departures from predicted ranges. These warnings enable you to respond immediately in order to avoid problems that may affect downstream processes or outcomes.

6 key data quality dimensions

Data quality dimension Description
Accuracy Accuracy of the data values based on the agreed-upon source of truth.
Completeness The amount of usable or complete data, representative of a typical data sample.
Consistency Compares data records from two different datasets.
Validity How much data conforms to acceptable format for any business rules.
Uniqueness Tracks the volume of duplicate data in a dataset.
Integrity Prevents broken links between datasets.

To understand the quality of your data, you need a solid reference point. This is where data quality dimensions come into play: 

  1. Accuracy – the degree to which values are correct as compared to their real representation.
  2. Completeness – assesses the extent to which all necessary data is present and accessible.
  3. Consistency – refers to the consistency of data across diverse sources or systems.
  4. Validity – relates to a dataset’s conformance to prescribed formats, rules, or standards for each attribute.
  5. Uniqueness – assures that there are no duplicate records in a dataset.
  6. Integrity – helps to preserve referential linkages between datasets by preventing broken links.

Why do you need to monitor data quality?

Why data quality management is important

Where do data quality concerns arise in the data lifecycle? What challenges can you expect at each stage? 

There are three major areas where problems might affect data quality:

Data ingestion

Imagine that dreaded phone call from a data analyst notifying you that the report dashboard for some customers has been faulty for a few days and has resulted in incorrect decisions.

You start to panic. The tests you developed should have detected these problems in the pipeline, right? If you’re lucky, it will take you anything from a few hours to several days to properly investigate the core source of the problem. Finally, you learn the reason: mobile app developers changed the structure of the database that collects data from the app.

But the faulty data has found its way through to your reporting layer since you weren’t notified and failed to build validation tests to handle such edge circumstances. 

While there are various concerns that you cannot account for that are directly related to data sources, there are several that are common:

  • Duplicate data
  • Outdated data
  • Data distribution drifts
  • Missing data
  • Incorrect data type and format
  • Incorrect data syntax and semantics are incorrect

Data systems or pipelines

The Data Lifecycle

Whether it’s an ETL, ELT, or rETL-based pipeline, keeping it clean will be challenging. You might come across faulty or buggy data transformations that cause data quality issues.

For example, you may write your transformation steps incorrectly, leading your pipeline stages to execute in the wrong sequence. Data pipeline modifications can create further problems like data downtime, data corruption, and even difficulties for downstream customers.

Downstream systems

Data quality concerns can arise when data is flowing to downstream customers such as a machine learning training pipeline or analytics software. A code change in your ML pipeline may prevent an API from getting data for a live or offline model. Or a BI analysis tool may no longer be getting updates from your data source, resulting in stale results coming from a software upgrade or dependency changes.

Continuously monitoring and evaluating your data quality will ensure that errors are detected and you get a chance to investigate these problems before they generate silent or non-silent errors.

Luckily, you can use tools for monitoring the data pipeline, and detect and rectify faults that may arise during pipeline execution, such as failures, delays, or bottlenecks.

Key metrics to monitor

Metric Description
Error ratio Quantifies the proportion of records in a dataset that include mistakes.
Empty values Defines the pace at which value is extracted from data after it has been gathered.
Duplicate record rate The percentage of duplicate entries in a particular dataset vs. all records.
Data transformation errors Errors that you can discover by tracking and evaluating the frequency, length, volume, and latency with which the data pipeline runs and provides the converted data.
Address validity percentage Compares the proportion of valid addresses in a dataset to the total number of entries having an address field.
Volume of dark data Dark data often comes about as a result of organizational silos.
Data time-to-value Defines the pace at which value is extracted from data after it has been gathered.

Aside from data quality dimensions, you need to keep a close eye on other metrics that allow early detection and remediation of issues before they impact you.

Error ratio

The error ratio quantifies the proportion of records in a dataset that includes mistakes. A high error ratio suggests poor data quality, which may result in inaccurate insights or poor decision-making. To calculate the error ratio, divide the number of records with errors by the total number of entries.

Empty values

Data time-to-value defines the pace at which value is extracted from data after it has been gathered. A shorter time-to-value suggests that your company is effective at processing and evaluating data for decision-making reasons. Monitoring this indicator helps detect bottlenecks in the data pipeline and ensures that business users have access to timely data.

Duplicate record rate

Duplicate records can arise when many entries for a single entity are produced as a result of system problems or human error. Duplicates consume storage space and potentially skew analytical results, affecting the decision-making process. The duplicate record rate is the percentage of duplicate entries in a particular dataset vs. all records.

Data transformation errors

The fourth stage is to keep track of the data pipeline’s performance. This entails tracking and evaluating the frequency, length, volume, and latency with which the data pipeline runs and provides the converted data. To monitor the data pipeline, you can use logging, auditing, alerting, or dashboarding technologies. 

Address validity percentage

For businesses that rely on location-based services, such as delivery or customer assistance, having an exact address is critical. The address validity percentage compares the proportion of valid addresses in a dataset to the total number of entries having an address field. It’s critical to cleanse and check your address data on a regular basis to ensure excellent data quality.

Volume of dark data

Dark data often comes about as a result of organizational silos. One team generates data that might be beneficial to another, but the other team is unaware of it. Breaking down such silos allows data to be made available to the team that requires it.

Data time-to-value

Data time-to-value defines the pace at which value is extracted from data after it has been gathered. A quicker time-to-value suggests that your company is effective at processing and evaluating data for decision-making reasons. Monitoring this indicator helps detect bottlenecks in the data pipeline and ensures that business users have access to timely information.

9 data quality monitoring techniques

1. Data auditing

This is the practice of verifying the correctness and completeness of data by comparing it to predetermined criteria or standards. The goal here is to identify and track data quality concerns such as missing, wrong, or inconsistent data. 

Data auditing can be done manually by analyzing records and looking for problems, or automatically by scanning and flagging data inconsistencies.

To conduct a successful data audit, you must first create a set of data quality norms and criteria that your data must follow. Then, using data auditing tools, you can compare your data to these rules and standards, discovering any anomalies or errors. 

Finally, examine the audit results and take remedial measures to rectify any discovered data quality issues.

2. Data cleansing

Data cleansing is the act of discovering and resolving flaws, inconsistencies, and inaccuracies in your data. To guarantee that your data is correct, comprehensive, and trustworthy, data cleansing procedures employ a variety of approaches such as data validation, data transformation, and data deduplication.

Typical steps in this process include:

  • Identifying data quality issues 
  • Determining the root causes of these issues 
  • Selecting appropriate cleansing techniques 
  • Applying the cleansing techniques to your data
  • Validating the results

This is how you get your hands on high-quality data that supports efficient decision-making.

3. Data profiling

Data profiling is an umbrella term for tasks such as studying, analyzing, and comprehending the content, structure, and connections within your data. This method includes inspecting data at the column and row levels for patterns, abnormalities, and inconsistencies. Data profiling provides essential information such as data types, lengths, trends, and unique values to help you get insights into the quality of your data.

Data profiling is classified into three types: 

  • Column profiling, which evaluates individual attributes in a dataset 
  • Dependency profiling, which discovers links between attributes 
  • Redundancy profiling, which detects duplicate data 

You may acquire a full overview of your data and discover any quality concerns that need to be addressed by utilizing data profiling tools.

4. Data quality rules

Data quality guidelines are set criteria that your data must follow to be accurate, comprehensive, consistent, and reliable. Checking for duplicate entries, verifying data against reference data, and ensuring that data adheres to certain formats or patterns are all examples of data quality criteria.

To put in place effective data quality rules, first establish them based on your organization’s data quality requirements and standards. Then, using data quality tools or custom scripts, you can apply these criteria to your data, identifying any anomalies or concerns. 

Finally, you should check and update your data quality standards on a regular basis to ensure they stay relevant and effective in sustaining data quality. Monitoring data quality at this point is key.

5. Ingesting the data

Data ingestion is the process by which data enters a system. Data for ingestion might originate from a variety of internal and external sources. It can be ingested in real-time or in batches by the system. Existing databases, data lakes, real-time systems and platforms (such as CRM and ERP solutions), software and apps, and IoT devices all contribute to it. 

A proper data intake method does more than simply import raw data. Instead, it converts data from many sources in diverse formats into a single, standardized format. Data ingestion can also convert unformatted data into a pre-existing data format.

Make sure to implement data ingestion monitoring to detect low-quality or poorly structured data, clean and format the data, and prepare the data for use by others. 

6. Data performance testing

The process of analyzing the efficiency, efficacy, and scalability of your data processing systems and infrastructure is data performance testing. This method helps data practitioners ensure that their data processing systems can manage growing data quantities, complexity, and velocity without sacrificing data quality.

First, create performance standards and goals for your data processing systems, and then use data performance testing tools to simulate various data processing scenarios (like huge data volumes or complicated data transformations). Then you’re ready to compare your systems’ performance against set benchmarks and objectives. 

Finally, review the findings of your data performance testing and make any required changes to your data processing systems and infrastructure.

7. Metadata management

Metadata management is all about organizing, preserving, and using metadata to improve the quality, consistency, and usefulness of your data. Metadata is data about data, such as data definitions, data lineage, and data quality norms, that aids businesses in better understanding and managing their data. 

You may increase the overall quality of your data and guarantee that it is easily accessible, understood, and usable by your company by using strong metadata management standards.

Start by creating a metadata repository that stores and organizes your metadata in a consistent and systematic manner. You can then use metadata management solutions to capture, preserve, and update your information.

8. Real-time data monitoring

This point focuses on continually following and evaluating data as it is created, processed, and stored within your organization. Rather than waiting for periodic data audits or reviews, monitor data quality, discover, and fix data quality issues as they arise. 

Real-time data quality monitoring helps you maintain high-quality data and ensure that decision-making processes are based on correct and up-to-date information.

9. Tracking data quality metrics

Data quality metrics (dimensions) we mentioned above are quantitative indicators that assist companies in determining the accuracy of their data. You can use them to:

  • Track and monitor the quality of your data over time 
  • Discover trends and patterns 
  • Assess the success of your data quality monitoring approaches 

To track data quality metrics, first determine which ones are most relevant to your organization’s data quality requirements and standards. Then, using data quality tools or custom scripts, generate these metrics for your data, offering a quantifiable assessment of its quality. 

Finally, examine and analyze your data quality metrics on a regular basis to find areas for improvement and confirm the effectiveness of your data quality monitoring tools.

Data quality monitoring challenges

Data consistency evaluation

Many companies may keep the same data in many places. If the data matches across these locations, it is said to be “consistent.” To address inconsistency data quality issues, examine your data sets to see if they are the same in every instance. 

Data accuracy measurement

Data accuracy is another challenge. To address it, you need to take care of multiple issues, such as:

  • Incorrect data input
  • Problems with data integration and transformation
  • Obstacles to data governance and management
  • Technological constraints
  • The data architecture’s complexity
  • Workflow complexity in a data pipeline
  • Synchornization of data sources

Data quality monitoring benefits

Data quality monitoring is like having a watchdog constantly looking for flaws that might impact decision-making processes. 

Monitoring data quality comes with many benefits:

Data implementation is simpler

Monitoring introduces a higher level of accountability. It delivers early warnings, makes recommendations for changes, and assesses application performance. This makes implementing data-driven applications more straightforward and reliable.

Better organizational decision-making

Data quality monitoring prevents errors from seeping into the decision-making process through inaccurate insights. This also helps companies avoid resource waste or compliance difficulties.

Enhanced customer relationships

Reliable data quality monitoring means that teams get more accurate data, which in turn helps to build better relationships with customers (for example, during targeted marketing campaigns or direct contact).

Better revenue and profitability

Data quality monitoring saves time and money by eliminating resource-intensive pre-processing.

How to implement data quality monitoring

1. Collect the data

To check data quality, the first step is to collect the data, usually by importing data to a target data repository from several sources. 

Data can be consumed in batches or in real time and can originate from a variety of sources, including:

  • Platforms for customer relationship management (CRM)
  • Platforms for enterprise resource planning (ERP)
  • Platforms for collecting payments
  • Other internal and external databases, as well as data lakes

The data intake process should also transform data from several sources into a unified format. The data may then be monitored, cleansed, and made available for usage within the enterprise.

2. Detect data quality concerns

The second phase in the data quality monitoring process is to detect any concerns with data quality. A data quality monitoring solution specifically checks for all the data quality dimensions and key metrics mentioned above. 

You can discover these data quality issues in a variety of ways. Traditional data quality monitoring systems use a set of manually generated criteria to identify detected issues, which is time-consuming and resource-intensive. 

Newer systems use machine learning to automate the rule-creation and error-identification processes, bringing more consistent and accurate outputs while increasing productivity and cutting costs. 

3. Clean your data

When data quality issues are discovered, it’s time to address them:

  • Replace incorrect data
  • Complete data
  • Make data consistent
  • Save older data in a separate data archive 
  • Merge duplicate data 
  • Reformat invalid data 

When the data cleaning procedure is finished, the data quality is guaranteed, and the data is ready for use in your organization. 

Choosing the right data quality monitoring tool

A good data quality monitoring system should have the following features: 

  • Self-service – simple to use for everyone on the team, including the visualization component and setting up dashboards. 
  • Scalable – to accommodate increasing data volume and cardinality. 
  • Collaborative – data, insights, dashboards, and reports should be easily shared with others via Slack channels or e-mail.
  • Holistic – the solution should give a comprehensive picture of your data, including data lineage tracking from the data consumer to the data transformation process to the data producer. 
  • Automatable – it should allow you to programmatically automate the monitoring process and workflow. This involves creating automation scripts, YAML code, queries, and real-time analysis to turn insights into action without the need for manual involvement.
  • Privacy-preserving – the data quality monitoring tool needs to provide some type of security leverage against external threats in order to secure your data. 

Check out this overview of the best data quality tools on the market.

Conclusion

Teams must monitor data to guarantee accurate and trustworthy data, make educated choices, improve operations, and manage risks. That’s why it’s in your interest to enhance data quality, spot issues quickly, and maintain a robust data infrastructure by applying monitoring best practices and deploying automated, robust data quality monitoring solutions.

Git for Data – lakeFS

  • Get Started
    Get Started