How can data quality impact machine learning outcomes?

High-quality data allows machine learning models to generalize effectively to new data. If the training data is of low quality, the model may overfit the training data and perform badly on fresh data. Also, quality data makes ML models more interpretable, which means it is simpler to comprehend how the model arrived at its predictions. Data quality has a substantial impact on machine learning model performance across the following aspects: Bias – issues with data quality, such as missing or unbalanced data, can cause bias in ML models. This suggests that the model is better at predicting certain events and less good at predicting others. This can lead to unjust or biased conclusions. Accuracy – to generate predictions, machine learning algorithms rely on correct data. If the data used to train the model is full of errors, so will the model’s predictions. This can result in poor decision-making and a reduction in the machine learning model’s performance. Robustness – ML models that have been trained on high-quality data are more robust and can handle a wider range of inputs. If the data is of poor quality, the model may underperform on fresh, previously unknown data. Having high-quality data via data quality tools is critical for making accurate predictions, decreasing bias, and boosting the resilience, generalization, and interpretability of ML models.

Back to Home

What is Data Quality: Definition, Framework, and Best Practices

Data Quality Data Quality Testing Data Quality Tools Data Quality Framework Data Quality Dimensions Data Quality Management Data Quality Issues Data Quality Monitoring Data Integrity vs Data Quality Data Quality Metrics Data Quality vs Data Governance Improve Data Quality

Watch how to get started with lakeFS

Watch now

Home > Data Quality

Einat Orr, PhD

Einat Orr, PhD Author

Einat Orr is the CEO and Co-founder of lakeFS, a...

Full Bio →

Last updated on April 30, 2024

Table of Contents

Data quality standards are crucial to every business out there because they guarantee the accuracy of data-driven decisions made by teams every single day. Teams cannot rely on data to guide their decision-making processes when data quality is in question.

Poor data quality costs organizations an average of $12.9 million each year, according to Gartner. Luckily, we have data quality methods that limit the harmful impact of bad data. This is key because data quality opens the door to greater trust in data from data consumers. They can have confidence in the data and utilize it to improve decision-making, resulting in the development of new business strategies or the improvement of existing ones.

When a standard isn’t met for some reason, data quality solutions add value by assisting organizations in diagnosing underlying data issues. A root cause analysis enables teams to swiftly and efficiently address data quality issues.

How to maintain data quality is more than just a concern for day-to-day company operations – as companies integrate artificial intelligence (AI) and automation technologies into their workflows, high-quality data will be critical for the successful implementation of these tools. The classic adage “garbage in, garbage out” applies to machine learning and generative AI solutions as well.

What is data quality?

Data quality is a metric that assesses the state of data based on variables such as accuracy, completeness, consistency, reliability, and timeliness. Measuring data quality levels helps you identify data issues and determine whether your data is fit to serve its intended purpose.

As data processing has become more tightly integrated with business operations and businesses increasingly use data analytics to drive business decisions, the emphasis on data quality has increased as well.

Data quality management is an essential component of the overall data lifecycle management process (or master data management process), and efforts to improve data quality are often connected to data governance initiatives that guarantee data is formatted and used uniformly throughout an organization.

illustration of the data lifecycle from ingestion, transformations, testing, deployment to monitoring and debugging

Data quality vs. data governance vs. data integrity

Data quality, data integrity, and data governance are all interconnected.

Data quality is a broad range of criteria used by businesses to assess the accuracy, completeness, validity, consistency, uniqueness, timeliness, and suitability for the purpose of their data. Poor data quality impacts the trust consumers have for data and so influences their decision-making process.

Data integrity is concerned with only a subset of these characteristics, namely accuracy, consistency, and completeness. It also looks at this from the perspective of data security, putting in place protections to avoid data corruption by malevolent actors. Data integrity also refers to the protection and safety of data in terms of regulatory compliance, such as GDPR compliance.

Data governance is the process of managing data availability, accessibility, integrity, and security in corporate systems using internal data standards and policies that also control data usage. Data governance guarantees that data is consistent and trustworthy and that it’s not misused in any way. Compliance with key regulations around customer data is one of the outcomes of data governance policies.

Key data quality dimensions

Data quality dimension	Description
Timeliness	Data’s readiness within a certain time frame.
Completeness	The amount of usable or complete data, representative of a typical data sample.
Accuracy	Accuracy of the data values based on the agreed-upon source of truth.
Validity	How much data conforms to acceptable format for any business rules.
Consistency	Compares data records from two different datasets.
Uniqueness	Tracks the volume of duplicate data in a dataset.

Quality is measured along a number of set data quality dimensions that should all be addressed by a data quality improvement process. They may vary depending on the data source:

Timeliness

This dimension refers to the data’s readiness within a certain time frame. A customer in an e-commerce store may expect to receive an order number immediately after making a purchase, so this data must be created in real-time.

Completeness

This shows the amount of usable or complete data. If the data is not representative of a typical data sample, a significant percentage of missing values may result in a skewed or misleading analysis.

Accuracy

Accuracy refers to the accuracy of the data values based on the agreed-upon “source of truth.” Since numerous sources may report on the same measure, it’s critical that the company identify a primary data source for data accuracy. You can also use additional data sources to corroborate the primary one’s accuracy. To boost confidence in data accuracy, technologies can determine if each data source is moving in the same direction.

Validity

This dimension assesses how much data conforms to the acceptable format for any business rules. Metadata like valid data types, ranges, patterns, and so on are commonly included in formatting. Check out this article that dives into these two dimensions of data quality: Ways to Test Data Validity and Accuracy

Consistency

This dimension compares data records from two different datasets to check for inconsistent data. As previously stated, many sources might be identified in order to report on a single statistic. Using many sources to look for consistent data trends and behavior allows companies to have confidence in any actionable insights derived from their investigations. The same reasoning can also be applied to data relationships. The number of employees in a department, for example, should not exceed the overall number of employees in a corporation.

Uniqueness

The amount of duplicate data in a dataset is accounted for by uniqueness. Consider a machine learning model trained on millions of images. If the data set includes duplications, it will hurt both the efficiency of building a model and its accuracy. These metrics assist teams in conducting data quality reviews across their businesses in order to determine how relevant and usable the data is for a specific purpose.

Why is data quality management important at every stage of the data lifecycle?

importance of data quality management at every stage of the data lifecycle from collection, storage, processing, analysis and implementation

Data quality management is there to build trust and confidence around data when it’s served to consumers via data analytics projects such as business intelligence dashboards, machine learning, and generative AI-based applications in health care and automotive, and many more.

Without a solid data management strategy and tooling, a business might suffer severe consequences from consumers making decisions based on poor quality data, be it managers running a business, drivers using self-driving functions, or doctors using machine learning to help them diagnose or treat patients.

Your company won’t get away with a simple data quality assessment framework. To properly measure data quality and keep it in check, you likely need several tools and processes working in conjunction to get the job done.

Where do you get started with data quality assessment, monitoring, and testing?

Here are the benefits teams get from ensuring data quality at every stage of the data lifecycle:

1. Collection

Also called data ingestion, data collection, or data entry. This point is about collecting customer data from multiple internal and external sources at the initial stage of the data lifecycle.

This is our most vulnerable spot from a quality perspective since, in most cases, we don’t own the source of the data. If something went wrong in the collection process before the data entered the data lake, we wouldn’t know. That is, unless we validate the data quality.

For example, data from operational systems may be wrong due to human error or late due to a malfunction in the system that stores it or saves it to the data lake. So, it’s critical to validate the quality of data and ensure that issues like inaccurate or inconsistent data don’t cascade into our ETLs.

2. Storage

Next comes data storage. At this point, many organizations fall into the trap of dispersing data across multiple teams and tools – a phenomenon called data silos.

When data is managed in silos and storage is distributed, consistency issues become the norm.

Once we move data to a single source of truth, we must validate the consistency of the data from the different sources and make sure to fix any consistency issues before the next stages of the lifecycle.

3. Processing

The next step is to prepare your data for use by curating, deduplicating, and doing other preprocessing required for the use of the data, depending on the application.

Since those preprocessing processes are meant to increase data quality and create data sets that can be adopted for analysis, we expect outcomes in terms of both data and metadata. We must validate that once the preprocessing is done, the data meets our expectations.

A best practice would be to validate each step in the data preprocessing – in some organizations, we might be talking about tens of such steps.

4. Analysis

Machine learning, statistical modeling, artificial intelligence, data mining, and algorithms are some of the tools available at this stage. This is where we get the real value from the data that influences decision making and user satisfaction, improves our business, and provides value to our customers, no matter what vertical we are in or what type of data we analyze.

In this stage, we create and run data pipelines, and when we develop those pipelines for machine learning or business intelligence needs, we must be able to test the quality of those models during the development or improvement phases.

5. Implementation

The deployment stage is where data validation, sharing, and utilization take place. If you leave data validation for this last stage – i.e., the process of verifying the accuracy, structure, and integrity of your data – prepare for trouble.

But if you have performed data quality validation at all stages of the data lifecycle, you still must have those tests here as well. We’re talking not just before deploying into production but also after it as a form of monitoring to ensure data stays of high quality while your analysis models are in production. Here we will test for model drift, dashboard health, etc.

Further data quality issues and how to avoid them

Data engineers and data scientists are working with more data than ever before and are struggling to keep data pipelines in good shape due to outdated work methods.

Here are a few difficulties that most data engineers and data scientists are dealing with today:

Data validation

Validating data quality and consistency before it enters a data lake is difficult. That is because, unlike software developers, data practitioners do not have data staging or QA environments. Everything, including potential problems, gets washed into the lake – and engineers need to find a way to deal with it.

Testing and troubleshooting new data sets

Whether in pre-production, deployment, or final quality assurance before reaching end consumers, this one is tough. It’s all because data doesn’t have its own specific testing environment, and everything eventually ends up in one data lake.

Many other problems come to light during troubleshooting because data engineers lack an effective method for detecting, analyzing, and debugging production data quality issues.

Lack of automation

Data engineering entails a lot of manual labor and heavy lifting in distributed computation systems. Unlike software developers, data engineers don’t have access to a large range of automation tools that allow CI/CD for the data and, by doing so, remove low-level manual work and eliminate errors.

Not to mention the hefty cost of making a mistake, which frequently prevents organizations from advancing in their data-driven journey as fast as they would like.

Improving data quality: 3 best practices

Best practice	Quick overview
Data validation	Verifying the data itself to validate its distribution, variance, or features, as well as validating the schema or format.
Metadata validation	The best practices and standards apply to data and its metadata.
Real-time data validation	Ingest data into a distinct branch that data consumers can’t see, test the data on the branch, and merge it if it passes the tests.

1. Data validation

It’s inevitable that, as data engineers and data scientists, we make assumptions about the data that we’ll be using. It doesn’t matter if it’s the most current data from an existing data set or a completely new data set.

We may have made assumptions about the completeness, timeliness, distribution, variance, or coverage of a problem space we are looking to build a model for. Whatever our assumptions are, if they don’t hold, we will face poor results at the other end of our calculation.

To ensure that the data is reliable, we should put it to the test to determine if it supports our assumptions. Validation tests are a key part of data quality testing and include verifying the data itself to validate its distribution, variance, features, or any other assumption we made, to make sure it holds.

2. Metadata validation

We’ve already talked about data, but let’s not forget about metadata! Metadata is data that describes data. This includes data types, data schema, file formats and metadata they may hold, and more.

For example, if the data is a table, the metadata may include the schema, which includes the number of columns as well as the name and type of variable in each column. If the data is stored in a file, the metadata may include the file type as well as other descriptive characteristics such as version, configuration, and compression method.

The test description is straightforward. The best practices and standards that the organization must adhere to have expectations for each value of the metadata they produce.

If you’re a software developer, this type of test is quite similar to unit testing a piece of code. Creating tests may take some time, but obtaining high test coverage is both achievable and recommended.

It’s also required to maintain running the tests whenever the metadata changes. Expectations here get frequently misaligned. While we’re used to upgrading unit tests as we update the code, we must be prepared to dedicate the same amount of time and effort to maintaining metadata validation as our schemas evolve.

3. Data integration in real time

Illustration representing data ingestion into an isolated branch for testing and merging after successful tests — Ingestion to an isolated branch of raw data.

How can data practitioners achieve high-quality data during intake? One practice is to ingest data into a distinct branch that data consumers can’t see. This allows you to test the data on the branch and only merge it if the tests pass. And naturally, this calls for data versioning.

To automate the procedure, teams can set up a series of pre-merge hooks that trigger data validation tests. The changes will only be merged into the lake’s master branch if the tests pass. If a test fails, the testing solution should notify the monitoring system and provide a link to crucial validation test failure details.

Because the data repository was committed to the ingestion branch, the newly ingested data has a snapshot of it. This makes determining the source of the problem simple.

Prior to data input, this technique allows for data quality validation checks to be done. All in all, testing data before it is ingested into the master branch will avoid quality concerns.

Quick overview of data quality management tools and how to choose one

Here’s an overview of various data quality tools and testing frameworks that bring teams one step closer to high-quality data.

Great Expectations

This open-source validation tool is simple to incorporate into your ETL code. Data can be tested using a SQL or file interface. Because it was created as a logging system, it can be used in conjunction with a documentation format to generate automatic documentation from the stated tests. It also allows you to create a data profile and set expectations that you may discuss during testing for effective data quality management.

Deequ

AWS has developed an open-source tool to help developers establish and maintain metadata validation. Deequ is an Apache Spark-based framework for creating “unit tests for data,” which examine the quality of data in large datasets. Deequ is intended to work with tabular data, such as CSV files, database tables, logs, and flattened JSON files – basically, anything that fits into a Spark data frame.

Monte Carlo

This is a framework for implementing observability (one of the key data quality measures) without requiring any code. It uses machine learning to infer and comprehend the appearance of your data, proactively find data issues, analyze their consequences, and send warnings via links with conventional operational systems. It also allows for the exploration of underlying causes.

Anomalo

Anomalo helps teams to stay ahead of data issues by automatically detecting them as soon as they occur in the data and before they affect anybody else. Data practitioners can connect Anomalo to their data warehouses and immediately start monitoring the tables they care about. The ML-powered tool can understand the historical structure and trends of the data automatically, alerting users to many concerns without the need to define rules or set thresholds.

Lightup

Lightup lets data practitioners easily install and scale prebuilt data quality checks on massive volumes of data. Deep data quality checks can be deployed in minutes, not months. The solution also lets teams scale data quality tests across enterprise data pipelines quickly and efficiently using time-bound pushdown queries – without sacrificing performance. Plus, there’s an AI engine that can automatically monitor and detect data irregularities.

Bigeye

Bigeye monitors the health and quality of data pipelines, so teams never have to wonder if their data is reliable. Global data pipeline health and extensive data quality monitoring ensure data quality and anomaly detection technology shows issues before they disrupt the business. The tool also comes with lineage-driven root cause and effect analysis for quick insight into the roots of problems and a clear path to solutions.

Enforcing data quality with data version control

How data version control improves data quality through data branching and versioning, isolating your work, rollback, time travel, and hooks

A lot of data quality problems arise from issues related to the unique ways in which data practitioners work – and the lack of tooling at their disposal.

Take a look at a typical software development team. Team members can contribute to the same repository without any misunderstandings. Different users can use different versions of the software at the same time, but developers can quickly replicate a user problem by utilizing the same version that a given user was using when they reported the problem.

Bringing the same capabilities to the data world is the goal of data version control tools. Managing data in the same manner you manage code increases the efficiency of many data operations jobs:

Data branching and versioning

When there are several versions of data, the version history is quite evident from a lineage standpoint. Engineers can simply track changes to their repositories or datasets and point consumers to newly available data.

Isolating your work

When bringing updates or corrections to existing data pipelines, these changes must be evaluated to ensure that they actually improve the data quality and do not introduce new mistakes. To do so, data engineers must be able to design and test these modifications in isolation before they become part of production data.

Rollback

If you expose users to production data and something goes wrong, you can always roll back to a previous version in a single atomic operation. Eventually, this improves consumer trust in the data you deliver thanks to the good data quality.

Time travel

Suppose a problem with data quality causes a decline in performance or an increase in infrastructure expenditures. If you have versioning, you can open a branch of the lake from the point where the changes were put into production.

Using the information, you can duplicate all of the environment’s aspects as well as the problem itself to begin determining what’s wrong.

Hooks

Version control systems allow you to configure actions to be triggered when particular events occur. A webhook, for example, can verify a new file to determine if it matches one of the authorized data types.

Using a data version control platform eliminates the issues that plague large data engineering teams working on the same data. When an issue emerges, troubleshooting is substantially faster and helps everyone to focus on increasing data quality.

Data quality management calls for the right process and tooling

The reliability of your data lake will be determined by the data quality of everything you add there. High quality data and services are born already in the ingestion process, calling for constant testing of newly imported data to ensure that it fulfills data quality criteria.

Even though hosting your data lake on object storage provides advantages in terms of scalability and performance, it’s still difficult to follow best practices and ensure excellent data quality. In this situation, how do you maintain data quality? The only hope is to include automation in the mix.

Continuous integration and continuous data deployment are automated processes that rely on the capacity to detect and prevent data mistakes from moving into production. You can build this functionality using a variety of open-source solutions and move towards good data quality faster.

lakeFS is one of them. It includes zero-copy isolation, pre-commit, and pre-merge hooks to help with the automated process. That way, lakeFS provides the solution to testing data quality technologies in line with the best practices discussed above.

Frequently Asked Questions

Data quality and accuracy are both important features of data analytics, but they are not synonymous. Data quality relates to how well the data fulfills the needs and expectations of the consumers as well as the goal of the analysis, whereas data correctness refers to how close the data values are to being accurate or right.

High-quality data allows machine learning models to generalize effectively to new data. If the training data is of low quality, the model may overfit the training data and perform badly on fresh data. Also, quality data makes ML models more interpretable, which means it is simpler to comprehend how the model arrived at its predictions.

Data quality has a substantial impact on machine learning model performance across the following aspects:

Bias – issues with data quality, such as missing or unbalanced data, can cause bias in ML models. This suggests that the model is better at predicting certain events and less good at predicting others. This can lead to unjust or biased conclusions.
Accuracy – to generate predictions, machine learning algorithms rely on correct data. If the data used to train the model is full of errors, so will the model’s predictions. This can result in poor decision-making and a reduction in the machine learning model’s performance.
Robustness – ML models that have been trained on high-quality data are more robust and can handle a wider range of inputs. If the data is of poor quality, the model may underperform on fresh, previously unknown data.

Having high-quality data via data quality tools is critical for making accurate predictions, decreasing bias, and boosting the resilience, generalization, and interpretability of ML models.

There is no one-size-fits-all solution to how frequently you should audit your data. It all depends on various criteria, including the scale and complexity of your business, the quality and dependability of your data sources, and the market dynamics and volatility.

As a general guideline, you should audit your data at least once a year or anytime substantial changes are made to your data sources, systems, or procedures. However, depending on your individual needs and goals, you may want to audit your sales data on a more regular basis, such as quarterly, monthly, or weekly. And data quality testing works best when it’s a continuous process.

What are examples of quality data?

Data quality is measured across several dimensions, with the following examples:

Completeness – a product description can be considered incomplete if it fails to give a delivery estimate.
Accuracy – here’s an example of data accuracy: in employee data, an employee’s proper phone number ensures that the person is always reachable. Incorrect birth information, on the other hand, might deprive the employee of some benefits.
Uniqueness – unique customer profiles, for example, can help with both offensive and defensive customer interaction initiatives.
Integrity – a customer profile may include the client’s name and one or more addresses. If a customer address loses its integrity throughout the data journey, the associated customer profile may become incomplete and invalid.

Data quality is an important facet of data engineering, particularly when dealing with several pipelines that ingest, modify, and transform data from a variety of sources and formats. Poor data quality can result in erroneous insights, untrustworthy decisions, and squandered resources.

A data quality framework is an important part of the overall data lifecycle management process (or master data management process), and efforts to improve data quality are frequently linked to data governance initiatives that ensure data is formatted and used consistently across an organization.

Einat Orr, PhD Author

Einat Orr is the CEO and Co-founder of lakeFS, a scalable data version control platform that delivers a Git-like experience to object-storage based data lakes. She received her PhD. in Mathematics from Tel Aviv University, in the field of optimization in graph theory. Einat previously led several engineering organizations, most recently as CTO at SimilarWeb.

Watch how to get started with lakeFS

Watch now

Table of Contents

What is Data Quality: Definition, Framework, and Best Practices

Watch how to get started with lakeFS

What is data quality?

Data quality vs. data governance vs. data integrity

Key data quality dimensions

Timeliness

Completeness

Accuracy

Validity

Consistency

Uniqueness

Why is data quality management important at every stage of the data lifecycle?

1. Collection

2. Storage

3. Processing

4. Analysis

5. Implementation

Further data quality issues and how to avoid them

Data validation

Testing and troubleshooting new data sets

Lack of automation

Improving data quality: 3 best practices

1. Data validation

2. Metadata validation

3. Data integration in real time

Quick overview of data quality management tools and how to choose one

Great Expectations

Deequ

Monte Carlo

Anomalo

Lightup

Bigeye

Enforcing data quality with data version control

Data branching and versioning

Isolating your work

Rollback

Time travel

Hooks

Data quality management calls for the right process and tooling

Frequently Asked Questions

Watch how to get started with lakeFS

Pick up the Slack with lakeFS