Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community

14 Most Common Data Quality Issues and How to Fix Them

Idan Novogroder
Idan Novogroder Author

Idan has an extensive background in software and DevOps engineering....

December 21, 2023

Data quality is a key objective for any organization looking to extract value from data. We can generally define high data quality as a state in which data is accurate, consistent, comprehensive, and up-to-date. However, the concept of quality is also context-dependent. 

Different jobs or applications require different types of data and, as a result, focus on different quality criteria. 

There is no such thing as a standard of quality, either. A data collection of credit card transactions full of cancelled transactions and verification issues may be too complicated for sales analysis. This is what we’d call poor data quality. But the team running fraud analysis might have an entirely different opinion.

Before setting out to build your quality management process or data quality framework, it’s essential to understand exactly what can go wrong with data. Let’s dive into the most common data quality issues, such as inaccurate data or duplicate data, to help you prepare for managing data quality at your organization.

14 most common data quality issues

Data Quality Issue How To Deal With It
Duplicate data Use rule-based data quality management and tools that detect fuzzy and perfectly matching data, quantify it as a probability score for duplication, and assist in delivering continuous data quality across all applications.
Inaccurate and missing data Automation can help to some extent, but specialized data quality solutions can offer considerably greater data accuracy.
Ambiguous data Track down issues as they emerge by continuously monitoring using autogenerated rules.
Hidden or dark data Use tools that can find hidden correlations, such as cross-column anomalies and ‘unknown unknowns’ in your data. A data catalog is a good solution as well.
Outdated data Review and update data on a regular basis, develop a data governance plan, consider data management outsourcing services, and find a machine learning solution for detecting obsolete data.
Inconsistent data Use a data quality management tool that automatically profiles the datasets, flagging quality concerns around inaccurate data.
Irrelevant data Define your data needs for a project and consider using filters to eliminate irrelevant data from huge data collections.
Orphaned data Orphaned data should be detectable by data quality management tools – identify the source of the discrepancy and correct it.
Unstructured data To deal with the challenge of unstructured data and unlock value from it, consider using automation and machine learning.
Data format inconsistencies Use a data quality monitoring solution that profiles individual datasets and finds formatting flaws.
Data downtime Continually monitor data downtime and minimize it using automated methods, and consider a a solution for assuring continuous access to reliable data.
Data overload Use a tool that delivers continuous data quality across various sources without transferring or removing any data.
Data illiteracy Consider running training sessions and data literacy workshops to explain the data to all teams working on it and help team members maximize its value for decision-making.
Human error Make sure that everyone knows how to use data management systems to limit the risk of human error.

1. Duplicate data

Modern companies are bombarded with data from all sides: local databases, cloud data lakes, and streaming data. They have multiple applications and deal with system silos. The sheer scale of data sources often causes redundancy and overlap via duplicate records. 

Data issues like duplication of contact information may impact the customer experience. Marketing initiatives suffer when certain prospects are overlooked while others are addressed repeatedly. Duplicate records also increase the likelihood of distorted analytical outcomes. On top of that, it can also generate skewed ML models when used as training data.

How to deal with this duplicate data? 

You can reduce duplicate and overlapping data using rule-based data quality management. Such rules are developed automatically and continually refined by learning from the data. 

There are tools on the market that detect fuzzy and perfectly matching data, quantify it as a probability score for duplication, and assist in delivering continuous data quality across all applications.  

2. Inaccurate and missing data

Inaccurate or incorrect data doesn’t provide a true picture of the situation and cannot be used to plan an effective response. If your customer data is inaccurate, tailored customer experiences fall short and marketing initiatives perform poorly. Data accuracy is crucial in highly regulated industries such as healthcare. 

Data inaccuracies can be attributed to a variety of sources, including human mistake, data drift, and data decay. According to Gartner, around 3% of data globally decays each month. Data quality can deteriorate with time, and data integrity can be compromised as it travels through numerous systems. 

How to deal with inaccurate and missing data? 

Automation can help to some extent, but specialized data quality solutions can offer considerably greater data accuracy. It’s important to discover data quality concerns early in the data lifecycle and proactively correct them to fuel trustworthy analytics. 

3. Ambiguous data

Even with tight monitoring, mistakes may slip into big databases or data lakes. When data values are pouring in at tremendous speed, the situation becomes much more overwhelming. Column titles can be deceptive, formatting flaws can occur, and spelling errors can go undiscovered. Such inaccurate data can lead to a slew of problems in reporting and analytics. 

How to deal with ambiguous data? 

How to eliminate ambiguity? The best approach is to track down issues as they emerge by continuously monitoring using autogenerated rules. This builds high-quality data pipelines for real-time analytics and reliable results.

4. Hidden or dark data

Most businesses use just a portion of their data, with the remainder lost in data silos or discarded in data graveyards. For example, accessible customer data from sales might never get shared with the customer care team, resulting in a missed chance to build accurate client profiles. 

Hidden data might easily lead an organization to miss out on possibilities to improve services, build novel products, and optimize procedures.

How to deal with hidden or dark data? 

To mitigate this problem, try using tools that can find hidden correlations, such as cross-column anomalies and ‘unknown unknowns’ in your data. A data catalog solution is a good solution here. According to a recent survey, best-in-class companies are 30% more likely to have a dedicated data catalog.

5. Outdated data

Collected data can soon become obsolete, unavoidably leading to data decay. All data that is no longer current, accurate, or useful is called obsolete data. 

Information about a customer, such as name, address, contact information, and so on, is a great example of information that must be kept up to date. Otherwise, you may miss marketing or sales opportunities.

The issue of obsolete data signals the organization’s delay and lack of investment and interest in database management technologies. Outdated data can lead to inaccurate insights, poor decision-making, and misleading results.

How to deal with outdated data? 

That’s why it’s important to:

  • Review and update data on a regular basis
  • Develop a data governance plan
  • Consider data management outsourcing services
  • Find a machine learning solution for detecting obsolete data

6. Inconsistent data

Mismatches in the same information across sources tend to happen when you’re working with various data sources. The differences might be in formats, units, or spellings. You might also introduce inconsistent data during mergers and acquisitions.

Inconsistencies in data values tend to accumulate and degrade the usefulness of data if they’re not continually resolved. Data-driven companies need to pay attention to data consistency – after all, they only want reliable data to power their analytics.

How to deal with inconsistent data? 

To solve this problem, you need a data quality management tool that automatically profiles the datasets, flagging quality concerns. Some may include adaptive rules that continue to learn from data, ensuring that discrepancies are resolved at the source and that data pipelines only supply trustworthy data.  

7. Irrelevant data

Many businesses assume that acquiring and retaining every customer’s data will help them at some point in the future. But since the amount of data is huge and not all of it is instantly helpful, organizations might instead confront another issue: irrelevant data.

Irrelevant data that has been retained for a long time will quickly become obsolete and lose its value, burdening IT infrastructure and occupying the valuable management time of data teams. Such data doesn’t give important insights into a company’s product sales patterns. It may even be distracting when you’re examining key data pieces.

How to deal with irrelevant data? 

To solve this problem, define your data needs for a project, such as data components, sources, etc. Consider using filters to eliminate irrelevant data from huge data collections. Data visualization tools are helpful here, as they draw your attention to important patterns.

8. Orphaned data

Orphaned data doesn’t represent any value. Most data is orphaned when it’s incompatible with an existing system or difficult to transform into a usable format. For example, an orphan is a customer record that exists in one database but not in another.

How to deal with orphaned data? 

Orphaned data should be detectable by data quality management tools. Once discovered, you can identify the source of the discrepancy and, in many cases, correct to allow full use of the orphaned data. 

9. Unstructured data

Unstructured data is not only a type of data but also a potential data quality concern for a variety of reasons. Because unstructured data refers to any form that isn’t arranged in any structure, such as text, audio, or images, it can be difficult for organizations to store and analyze data. Not to mention initiatives like data quality testing!

Unstructured data originates from numerous sources and may contain duplicates, irrelevant data, or errors. Converting unstructured data into relevant insights calls for specialized tools and integration techniques. 

How to deal with unstructured data? 

To deal with unstructured data and unlock value from it, consider using automation and machine learning. Building a strong team with professionals who have particular data administration and analytical skills is essential. Data governance policies are helpful too because they guide data management practices. To limit the introduction of unstructured data, use data validation checks.

Why data quality management is important

10. Data format inconsistencies

A large amount of data can be structured in a variety of ways. Consider the various ways you can describe a date: June 16, 2023, 6/16/2023, 6-16-23, 16.06.2023. Since diverse sources frequently utilize different formats, errors might lead to serious data quality difficulties. 

Working with different types of formats might lead to comparable problems. If one source uses metric measures while another uses feet and inches, you need to establish an internal standard and guarantee that all imported data transforms accurately. 

This is the problem NASA encountered when it lost a $125 million-worth Mars Climate Orbiter because the Jet Propulsion Laboratory used metric measures and contractor Lockheed Martin Astronautics used the English system of feet and pounds. 

How to deal with data format inconsistencies? 

To address this issue, a data quality monitoring solution that profiles individual datasets and finds formatting flaws is required. Once found, changing data from one format to another should a piece of cake.

11. Data downtime

This is one of the most impactful and common data quality issues a team may encounter. Companies that are data-driven rely on data to fuel their decisions and operations. However, there may be brief periods when their data is not trustworthy or ready – for example, during M&As, reorganizations, infrastructure improvements, and migrations. 

A data outage can have a significant impact on businesses, including consumer complaints and poor analytical results. The causes of data outages might range from schema modifications to migration concerns. The complexity and size of data pipelines can also be troublesome. 

How to deal with data downtime? 

Continually monitor data downtime and minimize it using automated methods. Implementing SLAs can help control data downtime. But what you ultimately need is a solution for assuring continuous access to reliable data.

12. Data overload

Is having too much data a problem with data quality? When looking for data relevant to your analytical initiatives, it’s easy to become lost in a sea of data. Data scientists spend 80% of their time obtaining and preparing the correct data. Other data quality challenges get more serious as data volume increases, particularly with streaming data and large files or databases.    

How to deal with data overload? 

If data overload is becoming a problem, consider using a tool that delivers continuous data quality across various sources without transferring or removing any data, using techniques such as automated profiling, outlier identification, schema change detection, and pattern analysis.

13. Data illiteracy

Despite their best efforts, organizational teams who are not data literate will make inaccurate data quality assumptions. Understanding data characteristics is difficult since the same field might mean different things in different records. You also need experience to visualize the impact of adjustments and what each attribute means. 

How to deal with data illiteracy? 

Consider running training sessions and data literacy workshops to explain the data to all teams working on it and help team members maximize its value for decision-making. 

14. Human error

Human error is one of the most prevalent sources of common data quality issues. Data entry relies on human input, so when this fails, the data is rendered essentially useless. Note that human error can occur on both the client and the company side of data entry.

For example, if consumers and prospects enter their contact information into a website form, there is a chance that they may enter it incorrectly.

How to deal with human error? 

Make sure that everyone knows how to use data management systems to limit the risk of human error.

Where to address data quality issues

The Data Lifecycle explained from ingestion, transformations, testing & deployment to monitoring & debugging

Source system

The best place for identifying data quality issues is at the source of the data. This involves addressing the systems and processes involved in data collection. Solving issues at this layer is challenging due to the high volume of interaction required at the business process layer. Other problems like inaccurate data may arise if you give data to a third party over which you have no control. 

Even if source systems are a great place to start working on data quality issues, it may be difficult to get buy-in throughout the organization for such an initiative.

ETL process

How can you use ETL to address data quality problems and make sure that the data in your data warehouse is clean, consistent, and relevant? During the ETL process, data is transported from diverse sources to a data warehouse while undergoing transformations and validations. 

Some of the steps you can execute during the ETL process are:

  • Data profiling – assessing data source structure, content, and metadata and generating statistics and summaries that define their features and quality. This is how you identify issues such as inconsistent or incompatible data types and formats, valid values and ranges, completeness and correctness, consistency and integrity, timeliness, and relevance. 
  • Data cleaning – this is where you rectify, delete, or replace any incorrect, incomplete, or inconsistent data. It’s the process of modifying or improving data using rules, functions, or algorithms to guarantee it fulfills quality requirements. 
  • Data validation – this step is all about verifying that the data is loaded into the data warehouse accurately and that it matches the source data and the desired output. Data validation is the process of comparing, testing, and confirming data prior to, during, and after the ETL process to discover and resolve any mistakes or inconsistencies. 
  • Monitoring – finally, you need to continuously track, measure, and evaluate the quality of your data warehouse. Finally, data monitoring can help to maintain and improve the quality of your data warehouse over time.

Meta-data layer

If you lack control over the ETL process and need to analyze a dataset “as is,” you can apply rules and logic within a metadata layer to deal with data quality issues. 

You can apply some of the rules and logic you would have used in an ETL process, but the underlying data is not changed. Instead, the rules are applied to the query at run time, and corrections are done on the fly to address the most pressing problems like inaccurate data.

How to fix data quality issues

Step 1: Evaluate your current data quality

All company stakeholders, from business divisions to IT to the Chief Data Officer, should be aware of the state of the data in the system at the present time. The data management team should inspect the database for mistakes, duplication, and missing records. 

Also, make sure to examine the obtained data for correctness, consistency, and completeness. To comprehend the content and structure of data, use techniques such as data profiling. This stage lays the groundwork for all subsequent data quality operations.

Step 2: Create a data quality plan

Create a data quality plan, including the strategies and procedures for improving and maintaining data quality. It will serve as a blueprint that describes data use cases, data quality requirements for each use case, and data collection, storage, and processing procedures. 

Choose the tools that you’ll use – they might range from internally generated scripts to feature-rich data quality solutions. This is also the time to define how you will deal with problems or inconsistencies that may arise at some point in the future.

Step 3: Perform preliminary data cleanup

In this state, you will clean, prepare, and rectify any data quality issues found in the data. Among the data cleansing efforts are the removal of duplicate entries, filling in missing data, and the correction of discrepancies across data sets. 

This is done to start the data quality management process from the best possible state of data.

Step 4: Implementation of your data quality plan

Now it’s time to implement the strategic plan and the data quality strategy to enhance the way data is handled throughout your organization. The idea here is to incorporate data quality norms and standards into day-to-day business operations. 

You should educate employees on the new data quality procedures, and this might imply changes to existing procedures to add data quality checks. In an ideal scenario, data quality management becomes a self-correcting, ongoing process.

Step 5: Monitor data quality

Finally, you need to keep track of how things are progressing. Data quality management is a continuous process. To guarantee that requirements are constantly met, organizations must track and analyze data quality on a regular basis. Regular audits, reports, and dashboard evaluations give insight into the consistency of data quality across time.

Important data quality checks

Setting up your quality metrics comes first. Then you’re ready for testing to find and fix. Here are a few examples of common data quality checks:

  • Detect data duplicates or overlaps to ensure uniqueness.
  • To discover and correct data incompleteness, check for necessary fields, null values, and missing values.
  • The use of formatting for ensuring uniformity.
  • Check validity by evaluating the range of values.
  • Verify data freshness by checking how recent the data is or when it was last updated.
  • Row, column, conformance, and value validation tests are great for testing data integrity.

Examples of data quality checks differ depending on the vertical. In healthcare, the freshness of patient data for the most recently delivered therapy or diagnostic might be most valuable Freshness checks for Forex trading, on the other hand, might depend on the time of testing.

Tools to resolve data quality issues

Data quality tools support successful data quality management and governance throughout business operations and decision-making by helping teams discover, analyze, and fix any data quality issues they encounter. 

Data quality tools support a variety of processes, including: 

  • Data cleansing – the process of correcting unknown data types (reformatting), removing redundant entries, and improving subpar data representations. 
  • Data monitoring – the process of monitoring and ensuring that an organization’s data quality is generated, used, and maintained. 
  • Data profiling – used to establish trends and detect irregularities in the data.
  • Data parsing – such tools determine if data adheres to known patterns. 
  • Data matching – it helps to avoid data duplication and has the potential to increase data accuracy. 
  • Data standardization – such tools support the process of transforming data from many sources and formats into a uniform and consistent format. 
  • Data enrichment – the process of adding missing or incomplete data.
  • Data version control – such a tool will help you implement data branching and versioning, working in isolation, time travel, and rollback to previous data versions.
How data version control improves data quality from data branching and versioning, isolating your work, rollback, time travel, and hooks

Conclusion: you can solve data quality issues successfully

Data quality has a significant impact on business growth. As data assets become more diversified and fruitful in terms of collected-data sources and types, organizations need to face the quality challenge.

Like any entropic system, data systems suffer from common data quality issues such as erroneous data, redundant data, duplicated data, etc. Poor data quality results in an average yearly financial cost of $15 million. That’s why underestimating data quality can have a catastrophic impact on decision-making and a company’s competitive advantage. 

lakeFS is an open source system that provides testing data quality issues with engineering best practices. If you’re ready to tackle your data quality issues using automated processes, you can launch the Quickstart here.

Git for Data – lakeFS

  • Get Started
    Get Started
  • Where is data engineering heading in 2024? Find out in this year’s State of Data Engineering Report -

    Read it here
    +