Data Quality Dimensions: The Key to Trustworthy Data
Table of Contents
According to Gartner, poor data quality costs enterprises an average of $12.9 million every year. Fortunately, teams can use quality assurance approaches that reduce the impact of faulty data. Data quality dimensions are critical for any data team because they are the best indicator of the correctness of data-driven decisions made in the organization on a daily basis.
When data quality becomes a concern, customers may lose trust in data and refrain from using it to drive their decision-making processes.
Maintaining data quality is more than just an issue for day-to-day business operations; as businesses incorporate more artificial intelligence (AI) and automation technologies into their workflows, high-quality data will be vital for the effective adoption of these tools. The old phrase “garbage in, garbage out” applies equally to machine learning and generative AI solutions.
What data quality dimensions can data teams follow during data quality assessment? Keep reading to dive into data quality dimensions to ensure robust data quality monitoring.
What is a data quality dimension?
Data quality dimensions are data features that can be evaluated or analyzed against a set criteria to determine data quality. Measuring data quality dimensions helps in the identification of data problems and the determination of whether your data is appropriate to serve its intended purpose.
Even among data quality specialists, however, the essential data quality dimensions are not uniformly agreed upon. This state of affairs has caused some consternation within the data quality community, and it is even more perplexing for individuals new to the subject and, more crucially, for business stakeholders.
Data quality has grown more important as data processing has become more tightly linked to company processes and firms increasingly employ data analytics and ML models to drive business decisions.
Data quality management is an important part of the overall data lifecycle management process (also called the master data management process), and efforts to improve data quality are often linked to data governance initiatives that ensure data is formatted and used consistently across an organization.
How to measure data quality
Before asking about “how,” it’s important to consider “when.” When should you test data quality, and how do data quality dimensions come into play?
Needless to say, data quality testing needs to take place throughout the entire data lifecycle – from ingestion to transformations, testing, deployment, monitoring, and debugging.
Data quality testing during development
It’s a good practice to test new data transformations for new data sources and business entities throughout the development cycle to ensure high data quality.
Testing the original quality of your data is a smart practice. This is where it pays to carry out the following tests:
- Primary key uniqueness and non-nullness
- Column values that satisfy basic assumptions
- Rows with duplicates
Consider using source freshness checks to ensure that source data is being updated on a frequent basis by an ETL tool.
Data quality testing during transformation
A lot of mistakes can happen when you clean, aggregate, combine, and apply business logic to raw data – all the white weaving in additional data manipulations and generating new metrics and dimensions with SQL and Python.
This is the time to measure data quality and check if:
- The primary keys are unique and non-null.
- The row counts are correct.
- Joins don’t create duplicate rows.
- Your expectations are realized via the interactions between upstream and downstream dependents.
Data quality testing during pull requests
It’s a good idea to do data quality testing during pull requests before implementing data transformation changes into your analytics code base. Contextualized test success/failure results aid in code review and act as a final check before integrating the code into production.
In practice, you’ll be testing a GitHub pull request with a snapshot of the data transformation code.
You can invite additional data team members to contribute if you use a Git-based data transformation tool. Others may help you resolve mistakes and establish a high-quality analytics foundation by reviewing your code updates.
Make sure that no new data models or transformation code enter your code base without first being reviewed and tested against your standards by your team.
Data quality testing in production
Once your data transformations and tests have been included in your main production branch, it’s critical that you run them on a frequent basis to ensure excellent data quality.
That’s because many things might happen to your data model. For example, a software engineer may implement a new feature that modifies your source data, or a business user may add a new field to the ERP system, causing the business logic of your data transformation to fail.
Your ETL pipeline might wind up dumping duplicate or missing data into your warehouse, for example.
This is when automated testing comes in handy. It enables you to be the first to discover when anything unusual occurs in your business or data. Airflow, automation servers such as GitLab CI/CD or CodeBuild, and cron job scheduling are all typical techniques for running data tests in production.
6 dimensions of data quality you need to use
|Data quality dimension||Description|
|Timeliness||Data’s readiness within a certain time frame.|
|Completeness||The amount of usable or complete data, representative of a typical data sample.|
|Accuracy||Accuracy of the data values based on the agreed-upon source of truth.|
|Validity||How much data conforms to acceptable formats for any business rules.|
|Consistency||Compares data records from two different datasets.|
|Uniqueness||Tracks the volume of duplicate data in a dataset.|
How do you rate the quality of your data? Use these six data quality dimensions:
Data is termed “complete” when it meets comprehensiveness criteria. Imagine that you ask the consumer to provide their name. You may make a customer’s middle name optional, but the data is full as long as you have the first and last name.
There are things you can do to improve this aspect of data quality. You should determine whether all of the necessary information is available and whether any pieces are missing.
This is a key one among dimensions of data quality because it indicates accuracy – how well information represents an event or thing depicted. For example, if a consumer is 32 years of age but the system thinks they are 34, the system’s information is incorrect.
What can you do to enhance your accuracy? Consider if the data accurately depicts the situation. Is there any inaccurate data that should be corrected?
The same data may be maintained in several locations at many businesses. If that information matches, it’s said to be “consistent.” Consistency is a data quality dimension that plays a key role especially in environments with multiple data sources.
For example, if your human resources information systems indicate that an employee no longer works there but your payroll system indicates that they’re still receiving a paycheck, your data is inconsistent.
To tackle inconsistency concerns, examine your data sets to check if they are the same in every occurrence. Is there any evidence that the data contradicts itself?
Data validity is an attribute that shows the degree to which data meets business standards or conforms to a specified format. Birthdays are a common example: many systems require you to input your birthdate in a specified format. If you fail to do that, the data will be invalid.
To achieve this data quality dimension, you must ensure that all of your data adheres to a certain format or set of business standards.
“Unique” data is data that appears just once in a database. Data duplication is a common occurrence, as we all know. For example, it’s possible that “Daniel A. Lawson” and “Dan A. Lawson” are the same person – but in your database, they’ll be treated as unique entries.
Meeting this data quality dimension entails examining your data to ensure that it is not duplicated.
Is your data readily available when it is required? This data quality metric is known as “timeliness.” Assume you want financial data every quarter; if the data is available when it should be, it can be considered timely.
The timeliness component of data quality relates to specific user expectations. If your data isn’t available when you need it, it does not meet that dimension.
Ensuring data quality and integrity
|When should we test data quality?||Why?|
|Data collection||Data from operational systems may be incorrect owing to human error or late due to a fault in the system that stores or saves it to the data lake.|
|Storage||Consistency issues become the norm when data is managed in silos and storage is spread.|
|Processing||We must ensure that the data satisfies our expectations after preprocessing.|
|Evaluation||This is where you obtain the true value from the data that drives decision making.|
|Application||Quality monitoring guarantees data quality remains good while your analytic models are in production.|
When data is sent to teams through data analytics projects like business intelligence dashboards, machine learning, and generative AI-based apps, data quality management helps to build trust and confidence among customers.
A business may suffer severe consequences if consumers make decisions based on poor quality data, whether it’s managers running a business, drivers using self-driving functions, or doctors using machine learning to help them diagnose or treat patients.
A basic data quality evaluation strategy won’t be enough for your company. To correctly test and maintain data quality using data quality tools, you will most likely require numerous tools and procedures to function in tandem.
How do you begin assessing, monitoring, and evaluating data quality dimensions? The following are the advantages that teams gain from guaranteeing data quality at all stages of the data lifecycle:
1. Data collection
Data ingestion, collection, and data input basically denote the same thing. At the beginning of the data lifecycle, you acquire consumer data from different internal and external sources.
This is the most susceptible point in terms of quality since, in most circumstances, teams don’t own the sources of this data. You wouldn’t know if something went wrong during the gathering procedure before the data entered the data lake. Unless, of course, we confirm the data quality.
Data from operational systems, for example, may be incorrect owing to human error or late due to a fault in the system that stores or saves it to the data lake. As a result, it’s vital to evaluate data quality and ensure that data quality issues such as erroneous or inconsistent data don’t cascade into your ETLs.
The following step is data storage. Many businesses are already falling into the trap of spreading data among numerous teams and platforms, a phenomenon known as data silos.
Consistency difficulties become the norm when data is managed in silos and storage is spread. That’s why measuring data quality dimensions here is a smart move.
Once we have moved data to a single source of truth, we must test the consistency of the data from the various sources and ensure that any consistency concerns are resolved before moving on to the next phases of the lifecycle.
The next stage is to prepare your data for usage by curating, deduplicating, and performing any other preprocessing necessary by the application. There’s space for data quality assessment at this stage too.
Since such preprocessing operations are intended to improve data quality and provide data sets suitable for analysis, we anticipate outcomes in terms of both data and metadata. We must ensure that the data satisfies our expectations after preprocessing.
An ideal practice would be to validate each stage of data preparation – in some businesses, this may be tens of processes.
At this time, some of the techniques available include machine learning, statistical modeling, artificial intelligence, data mining, and algorithms.
No matter which vertical you’re in or what sort of data you analyze, this is where you obtain the true value from the data that drives decision making, improves business outcomes, and delivers value to data consumers.
We design and execute data pipelines at this point, and when we develop those pipelines for machine learning or business intelligence purposes, we must be able to assess the quality of data and models throughout the development or improvement phases. This is where we need to assess data using data quality dimensions.
Data validation, exchange, and use occur during the deployment stage. Prepare for disaster if you leave data validation to the end, i.e., the process of confirming the accuracy, structure, and integrity of your data.
However, if you’ve completed data quality validation at all phases of the data lifecycle, you must still include those tests here. We’re not only talking about before deployment into production but also afterwards – as a type of monitoring to guarantee data quality remains good while your analytic models are in production.
Metadata testing for validation
Metadata is data that describes data rather than data itself.
For instance, if the data is a table, as is frequently the case in analytics, the metadata may include the schema, such as the number of columns and the name and type of variable in each column. If the data is in a file, the metadata may include the file format and other descriptive features such as version, configuration, and compression method.
The definition of metadata testing is relatively straightforward:
There is an expectation for each value of metadata that is generated from the organization’s best practices and the standards it must follow.
This form of testing is quite similar to unit testing a piece of code if you’re a software developer. As with unit test coverage, creating all of those tests may take some time, but achieving high test coverage is both doable and recommended.
It’s also key to keep the tests running whenever the metadata changes. Data quality expectations are frequently out of sync.
While we’re accustomed to upgrading unit tests when we modify the code, we must be prepared to devote the same time and care to maintaining metadata validation as our schemas expand. Metadata testing needs to be part of our overall data quality assessment framework.
Data quality dimensions are indispensable for data teams
The quality of the data you contribute to your data lake will influence its reliability, and poor data quality quickly becomes a serious risk to business operations. High-quality data and services are created during the ingestion process, calling for continuous testing of freshly imported data to verify that it meets data quality requirements.
Even if you put your data lake on object storage, that comes with advantages in terms of scalability and throughput, but adhering to best practices and ensuring great data quality (across all data quality dimensions) remains a challenge.
How do you ensure data quality in this situation? The only hope is to include an automation solution in your data quality tools.
Continuous integration and continuous data deployment are automated procedures that rely on the ability to detect and prevent data errors from reaching the production environment. You may quickly achieve high data quality by building this feature with a range of open-source alternatives.
One of them is the open-source data versioning tool lakeFS. It offers hooks for zero-copy isolation, pre-commit, and pre-merge to aid in the automated process. As a result, lakeFS provides teams with a solution for evaluating data quality technologies in accordance with the best practices outlined above.
Table of Contents
Table of Contents