If you put garbage in, you’re likely to get garbage out. This phrase rings particularly true in the era of Generative AI, where models keep hallucinating despite the time and money teams pour into them.
The saying ultimately relates to the quality of the data you use to develop and train your ML models.
This article explores the role of high-quality data in ML model development and training, explaining what data quality is, how to achieve it, and the challenges to expect when improving data quality for machine learning.
What Is Data Quality?
Data quality is the evaluation of data with respect to its ability to serve a goal, for example, using a data quality framework. A good level of data quality is key for supporting an organization’s operational, planning, and decision-making needs.
How to Measure Data Quality
Data quality is often measured using these six data quality dimensions:
| Dimension | Description |
|---|---|
| Accuracy | A metric of how well a piece of data conforms to reality |
| Completeness | Does the data fulfill your expectations for comprehensiveness? Is data saved in one area consistent with data stored elsewhere? Is it available when you need it? |
| Timeliness | A measure that determines the age of data in a database |
| Consistency | This metric assesses how well individual data points from two or more sources synchronize (when two data points contradict each other, it means one of the records is incorrect) |
| Validity | A metric that answers questions like: Are the data values in the correct format, kind, or size? Is it in accordance with the rules/best practices? |
| Integrity | Can you merge different data sets to get a more complete picture? Are relationships stated and carried out properly? |
When to Measure Data Quality
When should you test data quality, and what factors influence data quality dimensions?
It’s imperative to conduct data quality testing at every stage of the data lifecycle, including ingestion, transformations, testing, deployment, monitoring, and debugging.
Data quality testing during development
Testing the original quality of your data is a smart move. This is where it pays to do the following tests:
- Primary key’s uniqueness and non-nullness
- Column values that meet fundamental assumptions
- Rows containing duplicates
- Consider employing source freshness checks to guarantee that an ETL tool regularly updates source data
Data quality testing during transformation
Many issues can occur when you clean, aggregate, combine, and apply business logic to raw data while weaving in more data manipulations and generating new metrics and dimensions with SQL and Python.
Now is the moment to assess data quality and see if:
- The primary keys are unique and not null
- The row counts are accurate
- Joins don’t generate duplicate rows
- Your expectations are achieved through the interplay of upstream and downstream dependents
Data quality testing during pull requests
It’s good practice to assess data quality during pull requests before incorporating data transformation modifications into your analytics code base. Contextualized test success/failure results help with code review and serve as a final check before releasing the code to production.
In practice, you will test a GitHub pull request containing a snapshot of the data transformation code.
If you use a Git-based data transformation tool, you can ask other members of your data team to contribute. Others can assist you in resolving errors and establishing a solid analytics basis by analyzing your code improvements.
Ensure that no new data models or transformation code enter your code base without being vetted and tested against your team’s standards.
Data quality testing during production
Once your data transformations and tests have been included into your main production branch, you must run them on a regular basis to ensure high data quality.
This is because your data model could undergo a variety of changes. For example, a software engineer may implement a new feature that alters your source data, or a business user may add a new field to the ERP system, leading your data transformation’s business logic to fail.
For example, your ETL process may end up duplicating or missing data in your warehouse. This is when automated testing is useful. It lets you be the first to notice any unexpected activity in your business or data. Airflow, automation servers like GitLab CI/CD or CodeBuild, and cron job scheduling are all common ways to execute data tests in production.
The Impact of Poor Data Quality on ML Models
When attempting to address real-world problems using ML applications, the first step is typically data preparation. Raw data frequently has inconsistencies you need to resolve before the dataset can be given to machine learning algorithms.
What issues can you encounter when working with raw data? Here are some common problems:
- Missing dataset values – It’s not always possible to have values for each row of feature variables that correspond to the target variable. A data gathering system may be unable to capture some values due to technological limitations. These errors are then captured in the dataset as missing values.
- Different file formats – When working on genuine machine learning datasets, it’s uncommon to encounter the same data format for all files. Prepare to see several formats.
- Variable value inconsistency – A dataset’s variables may contain non-useful values. You must remove them in order to maximize your data processing effort.
Common Data Quality Challenges in ML
The process of achieving high data quality comes with several challenges:
- Data collection challenges – Data collection may appear to be the simplest phase in an ML project. But this is often the step that prevents organizations from tapping into the potential of their data for ML purposes.
- Legal regulations – One problem in collecting text, audio, photo, and video data for ML training models is complying with regulatory rules regarding personal data acquisition.
- Data management and integration—Effective data management and integration are critical for successful machine learning initiatives. Without a strong data management strategy, companies struggle to consolidate and harmonize data from many sources, which can result in fragmented and compartmentalized data that is difficult to evaluate comprehensively.
- Data governance and compliance – Ensuring compliance with data governance policies and regulations is another difficult task. Organizations must navigate a tangle of legal regulations for data privacy and security. Failure to comply with these regulations can result in significant fines and reputational damage.
- Scalability issues – As the amount of data increases, so does the complexity of maintaining and analyzing it. Growing data infrastructure to accommodate massive datasets while preserving performance and reliability is a huge challenge that calls for strong data architectures and innovative technologies.
- Bias and fairness – Data bias can lead to biased ML models, resulting in unfair or discriminating outcomes. To address bias, consider data sources, collection techniques, and continuous model output monitoring.
Best Practices For Managing Data In Machine Learning
Keep data quality in check
The first thing to examine is whether your data is reliable for machine learning training. Even the most advanced machine learning algorithms can’t work with bad data, so data quality is an important goal for every ML team, and the pre-processing phase should ensure that the data is clean.
There are numerous questions you might ask at this point:
- Were there any technical challenges with the data transfer?
- How many missing values are in your data?
- Is your data enough for your task?
- Is your data biased?
Data quality efforts begin with the initial stages of data collection, including ingestion and transformation. Make data quality management a central component of your ML project.
Develop a data governance structure
Data governance is a set of principles, methodologies, and technology that ensure the proper administration and usage of data in projects like machine learning systems. It’s an essential component of data management in company.
It should come as no surprise that it has become much more important with the rise of machine learning applications.
Access to sensitive data must be restricted, and teams must adhere to data protection regulations such as GDPR and CCPA. Encryption, access control, and regular system audits all contribute to improving data security and privacy.
Another aspect of governance activities is tracking the origin and transformation of data as it moves through the ML pipeline. This allows you to assess the influence of data on model performance and ensures data pipeline traceability. Data lineage is extremely important in machine learning applications because it lets you discover data sources and changes that influence model outcomes.
Finally, there is adherence to acceptable industry standards and norms, such as the Health Insurance Portability and Accountability Act (HIPAA), to avoid any legal or ethical issues linked with data use in ML applications.
Implement version control for datasets
Keeping track of many versions of data might be equally challenging. Without proper coordination, balance, and accuracy, things can quickly unravel. Data versioning is a technique that allows you to keep track of multiple versions of the same data without incurring significant investment in time, human errors and storage expenses.
Creating machine learning models is about iterating on code, data and models together. ; it also entails training models over different data and optimizing appropriate parameters.
Updating models is an iterative process that may require you to maintain track of all previous modifications. Data versioning enables you to save a snapshot of the training data and experimental results, making implementation easier at each iteration. The adoption of a data version control system provides faster time to market for AI/ML models and increases the quality of the results they provide.
Data version control can automate data quality checks to help data teams develop and enforce data governance. For example, having the ability to segregate data at ingest or after generation and only expose it to users if it has been verified requires data branching (shallow copy), data merging processes, and pre-merge hooks that run validation checks.
Almost every business vertical is subject to data auditing of the models they use with their customers, which compel them to keep certain information in order to verify compliance and the history of data sources. In this scenario, data versioning can benefit both internal and external audits.
Monitoring and maintenance
As with any project, ongoing monitoring and maintenance are key in ML. If you want your models to produce reliable results, you should update their datasets on a regular basis to reflect real-world changes.
Another factor is quality. Smart teams use technologies to constantly monitor data quality and drift and implement automated data validation techniques.
What Tools Can You Use To Improve Data Quality?
Overview of Popular Tools for Data Cleaning and Transformation
A data cleaning tool speeds up and simplifies the process by automating many tasks. Some useful data cleaning tools are Trifacta Wrangler, Astera Centerprise, OpenRefine, Winpure, and TIBCO Clarity.
You can use such a tool for:
| Process | What It Does |
|---|---|
| Automated data profiling | It automatically scans and profiles the entire dataset to identify potential data quality issues such as missing values, duplication, inconsistencies, and formatting errors. |
| Standardization | These tools use standardization criteria to ensure that data follows a consistent format and compare them to predefined rules or reference data. |
| Deduplication | Data cleansing technologies can easily detect duplicate data or entries and automatically combine or remove them. |
| Parsing and transformation | A tool can parse complex data structures like addresses or names and automatically convert them into a standard format. |
| Error correction | Using predefined criteria, these technologies can automatically correct common errors like misspellings or incorrect values. |
Tools for Data Quality Validation
A good data cleaning and transformation tool should support extensive data profiling and cleaning, data quality checks, and workflow automation options to automate the entire data cleansing process, from data profiling to transformation, validation, and loading to the desired destination.
Write-Audit-Publish for ML Preprocessing Pipelines
Automated data quality tests must be used to ensure the quality of data created by preprocessing pipelines. Two sets of tools are required to implement this CI/CD process for data pipelines: data version control systems and data quality tools.
Preprocessing pipelines move processed data from data lakes to downstream applications like business dashboards and machine learning algorithms. They are vital to ensuring that production data meets corporate data governance standards.
Data Version Control Systems
Data version control systems use a variety of ways to accomplish CI/CD for data in the context of machine learning. For example, the open-source solution lakeFS allows you to implement a Write-Audit-Publish process by triggering specific rules and validation checks with pre-commit and pre-merge hooks.
Data Quality Tools
Data quality tools let you streamline and, in many situations, automate data management operations, ensuring that your data is ready for analytics, data science, and machine learning applications. A solution like this allows teams to examine current data pipelines, detect quality bottlenecks, and automate various repair operations.
Using a data quality tool increases customers’ trust in the data. They understand that the data quality tool has removed low-quality data, leaving only high-quality data, enabling true data-driven decision-making.
Examples of data quality tools include Great Expectations, Monte Carlo, Deequ, Lightup, Anomalo, Acceldata, Bigeye, and Datafold.
Real-World Example: Data Preprocessing Using lakeFS
AI/ML projects require iterative work in all their phases and depend on high quality data. During the pre-processing phase, lakeFS helps ensure the data used to train models meets quality standards and effectively exposes features and their underlying data to data scientists. Without proper coordination, balance, and precision, everything can easily unravel. When launching a new machine learning project, this is the last place you want to find yourself.
Data versioning is a mechanism for keeping track of many versions of the same data without incurring considerable storage costs.
It allows you to save a snapshot of the training data and experimental outcomes, making implementation easier with each iteration. lakeFS helps you implement data versioning at all stages of the data’s lifetime.

lakeFS includes hooks for zero-copy isolation, pre-commit, and pre-merge, allowing you to design an automated procedure. Overall, it’s a huge help in assessing data quality using the methods outlined above.
Wrap Up
Machine learning relies heavily on high-quality data. Data quality challenges are nothing new, but as ML models grow larger and data becomes more available, you must discover scalable solutions for compiling high-quality training data.
Fortunately, data practitioners now have an increasing number of tools to help them overcome the challenges of producing clean, accurate, and trustworthy data.


