Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Idan Novogroder
Idan Novogroder Author

Idan has an extensive background in software and DevOps engineering....

Last updated on June 27, 2024

Decisions based on data can only make a positive impact as long as the data itself is accurate, consistent, and dependable. High data quality is critical, and data quality checks are a key part of handling data at your organization.

This is where dbt comes in. dbt (data built tool) provides a complete framework for executing data quality checks via its testing capabilities.

By implementing data quality testing into your dbt workflow, you create a process that not only protects the trustworthiness of your data but also boosts confidence in the insights and decisions that result from it.

Running these tests regularly helps to identify data quality concerns early on, decreasing the risk of flawed analysis and improper decision-making based on inaccurate or inconsistent data.

What exactly are dbt quality checks, and how do you implement them? Keep reading to find out.

What are dbt Data Quality Checks?

dbt (data build tool) is a popular open-source command-line utility built in Python. dbt enables data teams to build, test, and deploy complicated data transformations and run built-in data quality checks. These tests let you confirm the accuracy and consistency of your data across the data pipeline. With dbt data quality tests, you can ensure that your data is correct before using it for downstream analysis.

In general, tests are operations described in the dbt codebase that audit the data on your data platform. The dbt ecosystem expands on the out-of-the-box test logic with extra packages such as the traditional dbt expectations. 

When a dataset is generated in the dbt pipeline, the tool carries out an audit and decides whether to build the next dataset based on the test result (pass, fail, warn). If a check fails, you may instruct dbt to issue a warning and continue building downstream models, or specify that the run should terminate and throw an exception.

Source: dbt 

dbt Testing vs. Data Observability

You’ve likely heard of data observability and might suspect that it’s not the same as data testing. Data observability is the continual monitoring of the condition of data that is already in production and the identification of abnormalities.

Observability helps uncover live data quality concerns in real time, such as checking if:

  • A column has an extraordinary percentage of NULL values
  • An analytics event has stopped sending data
  • Rolling up a revenue column yields half of the expected value

While testing and observability are critical components of a data quality plan, concentrating on data quality prevention via testing helps remove most problems. It enhances data monitoring in production by decreasing the noise you get to deal with when production issues arise.

Types of dbt Tests for Data Quality Checks

Generic Tests

Generic tests are built-in modules that come with the basic dbt installation. The core dbt installation includes four generic tests: 

  1. unique
  2. not_null
  3. accepted_values
  4. relationships

Generic tests are relatively straightforward to include in your testing approach. You can describe generic states in a schema.yml file in the project folder. 

What if you need more testing capabilities? You can add more tests using external packages, such as Great Expectations.

Custom Generic Tests

dbt allows users to write their own custom tests to satisfy certain use cases. However, before building your own tests, check if they’re available in open-source solutions. If you find the test you need, you can easily import them from various packages.

To import a package—in our example, calogica/dbt_expectations—create a packages.yml file in the project folder and include the following:

packages:
- Package: calogica/dbt_expectations Version: [">=0.8.0","<0.9.0"]

To install this package, use the following command:

If the package was properly installed, you’re ready to import tests from it. To use these tests, just add them to the schema YAML file you prepared for the core dbt generic tests.

Singular Tests

Singular Tests are useful when you need to build our own custom tests. They are SQL definitions that relate to a specific model. If the test is successful, it should produce an empty result set. Singular tests allow users to develop tests with their own use cases.

dbt also allows users to convert their individual tests into generic tests. These macros are comparable to functions in programming languages such as Python and Java. By using macros in the Jinja templating language, the tool avoids repeating code across different models.

Benefits of Using dbt Data Quality Checks

Using dbt data quality tests comes with various advantages, including:

  1. Increased trust in data correctness – Running automated checks on your data allows data consumers to be more confident in its quality and completeness.
  2. Early identification of data quality concerns – By using dbt data quality checks, you can identify errors early in the data pipeline before they cause difficulties downstream.
  3. Faster resolution of data quality issues – By using automated testing, you can rapidly detect and handle data quality issues, freeing up your team’s time for more strategic activities.

How to Implement dbt Data Quality Checks

Define Metrics

The first step is to identify the metrics that are important to you for each model – for example, completeness, accuracy, and consistency. These metrics will assist you in ensuring that your data satisfies all the important standards.

Once you’ve discovered your metrics, you can validate them using SQL tests in dbt.

Identify the Data That Needs to Be Tested

Start by identifying the tables or views that you’d like to evaluate for data quality.

Define Your Testing Criteria

Using SQL queries, specify the tests you wish to perform on your data. These might include checks for missing values, data kinds, or completeness.

Set Up Your dbt Project

Configure your dbt project to perform the data quality checks you’ve established.

Run Your Tests

Once your checks are in place, run them manually or on a schedule to confirm that your data fulfills your quality requirements.

Use dbt Extensions for Enhanced Data Quality Checks

dbt offers extensions in the form of packages that you can use to expand the available pre-built tests. One such package is ‘dbt-expectations,’ inspired by the open-source Great Expectations (described above). 

Using dbt-expectations, you can add a range of new tests to your data models, including data distribution, row counts, and more.

Implementing tests for standard data quality checks, bespoke business rules, and packages with expanded testing capabilities is bound to dramatically increase the quality and trustworthiness of your data.

7 dbt Data Testing Best Practices

1. Shift Data Quality to the Left with dbt Testing

Testing allows data quality to be moved to the left in the development pipeline. dbt testing allows you to evaluate your data transformations as part of the development cycle, ensuring that any mistakes or anomalies are recognized before propagating downstream. 

Testing data pipelinses proactively increases their reliability but also streamlines the debugging process, saving time and money in the long run.

As an added advantage, including testing into your development workflow fosters a culture of better data quality standards inside and beyond your data team. 

2. Add Essential dbt Tests

As in traditional software development testing, it’s common to start with the most complicated instances and tests in data testing. 

Begin with the most basic tests and progress to more complicated tests dealing with business logic. Verify the data’s structure and assumptions. If the data’s underlying assumptions are incorrect, all sophisticated tests constructed on top of these assumptions will be rendered ineffective.

3. Add Unit Tests for Complex SQL Logic

dbt unit tests check model logic using predefined inputs and outputs. Unlike a “classic” dbt data test, which runs an assertion against a model generated using database data, unit tests require the developer to construct “test cases” that include input data and anticipated results. 

This is how they enable more detailed testing and separate the evaluation of source data from the validation of the model code itself.

4. Implement Data Diffing to Achieve 100% Test Coverage

Data diff is a value-based and statistical comparison of datasets. When creating and deploying dbt models, you can use data diff  to compare data in production (the ground truth) and development/staging. This lets the developer and code reviewer properly analyze the impact of changes to the business logic.

Implemented in CI and combined with a column-level lineage that extends to BI integrations, data diff protects against unexpected changes, lowering the time required to review and test the code and increasing overall team velocity.

5. Apply Each Test Type Smartly

Knowing when each testing methodology works best, as well as its strengths and limitations, is critical for designing an effective testing plan that uses all three forms of dbt testing to optimize each method’s impact and ROI.

Let us compare each testing method regarding implementation effort, coverage, specificity, and optimal application.

dbt Data Tests dbt Unit Tests Data Diff
Implementation effort Medium

Test cases must be written
High

Needs test cases and input/output dataset curation
Low

Requires no manual test setup
Expected code coverage Medium

It is quite simple to implement generic tests that address data quality fundamentals
Low

Because of the effort necessary to construct them, unit tests are frequently only applied to the most complicated or business-critical reasoning logic
High

Automatically detects changes in your data and assesses the clarity of test findings
Result clarity (specificity) Medium

Highly responsive to code and data changes
High

Focused on testing code with fixed input
Medium

Requires the user to interpret acceptable data disparities
Use cases Testing generic principles such as not-nullness, uniqueness, value ranges, and referential integrity Testing sophisticated business logic Understanding how your code update may affect your data and downstream assets
Scalability for data volume Low

Can slow down the project if test rows are too large
Great

The user-defined test rows ensure independence from data volume
Great

Scales nicely with filtering and sampling

Given the relative benefits of each testing approach, the following simple criteria aid in developing an effective data quality strategy:

  • Write unit tests for a tiny portion of critical/complex business logic; they are more difficult to build but are the most exact testing.
  • Write data tests to ensure critical quality checks, such as uniqueness, non-nullity, referential integrity, and other model assumptions. Data tests are straightforward and can be incorporated into simple testing scenarios.
  • Use data diff for most “unknown unknowns” that contribute to data quality concerns. Data diff lets data teams identify the long tail of unanticipated quality issues and regressions that typical data or unit tests would miss. It emphasizes how the data changes when the code is adjusted.

6. Run Tests in CI to Provide Guardrails

Continuous Integration (CI) is a software development method in which developers often integrate code changes into a common repository. Each suggested integration (pull/merge request) can then be validated using an automated build and tests. 

In data and analytics engineering, continuous integration increases data quality and team velocity while also reducing the time required to validate and release dbt code modifications.

CI is an automated process that performs all tests and validations for each proposed code change before merging and deploying it to production.

While dbt data and unit tests are examples of validation techniques that can be used to validate code changes, having those tests defined in the dbt project without CI means that each developer must manually execute all required tests. Slips are unavoidable when the complexity of a dbt project goes beyond a few models and participants, with major negative consequences.

Running your dbt build (which will execute your dbt tests) and data diffs throughout your CI process guarantees that every PR receives the same testing, resulting in a more manageable system for code changes.

7. Never Deploy With Failing Tests

The team should always guarantee that tests pass and never release tests that fail. It is easier said than done since in a big dbt project with hundreds/thousands of models, the likelihood of a specific test failing is considerable. 

However, if you proceed with merging despite a failing test because “we all know this one test is noisy, so we have just been ignoring it,” you (and your team) will be on a slippery slope of allowing more and more tests to fail, eventually leading to tests not being taken seriously and significant data quality issues slipping through.

How to Avoid Broken dbt Tests

When a data or unit test fails, you have the following choices:

  • Investigate the test – check whether the test has failed due to a change in the underlying data. If it’s a dbt unit test, check if you made a code change that violates earlier assumptions about how the code should function
  • Ask your team – Check with the person who developed the original code and test

Consider making changes based on the test condition to improve the code or test.

If the anticipated outcome is no longer the same, modify the test. Remove the tests that don’t  not provide incremental benefit.

How to Fix Data Quality Issues With dbt and lakeFS

When it comes to handling numerous environments, the dbt documentation is straightforward: you can set up multiple targets inside a profile to generate distinct output places for models, usually called “dev” and “prod.”

It’s recommended to employ several schemas within a single data warehouse to keep your environments independent.

We’ve seen that once a data organization reaches a certain magnitude, this strategy becomes less practical. A team of a hundred analysts and engineers cannot each generate their own copy of a petabyte-sized app_clickstream_fact_table on which to execute database tests.

At the opposite end of the spectrum, having a single development environment to accommodate a large number of users would always result in competing modifications that must be coordinated. Even with a single development environment, it’s difficult to keep it up to date with the most recent data from production; a prerequisite if you want to use unit and integration tests to detect possible issues before logic changes are deployed in production. 

What you need is a system that lets you produce logical copies of production data without incurring the costs associated with data replication. This capability opens the door to creating completely duplicated development environments in which we can construct models and run dbt tests before exposing the results to any downstream customers.

This is where lakeFS can help.

By incorporating lakeFS into the design, we can add the following abstractions and achieve the aforementioned functionality:

  • Branches – creating a new branch from production results in an exact duplicate of the data without duplicating it.
  • Pre-merge hooks – to link the addition of fresh data into production with the passing of data validation tests.
  • Protected branches – to prohibit users from writing straight into a branch, mandating the usage of pre-merge hooks, as previously stated.

Conclusion

Data quality is critical in every data-driven enterprise. Using dbt data quality tests, data teams can ensure data correctness and dependability across the data processing pipeline. By combining various tools and technologies like Great Expectations or Airflow, you can build a data architecture that assures data quality and dependability. If you don’t already use data quality checks in your data processing pipeline, start now to build stakeholder trust in your data.

Take a look at this guide to learn how to implement dbt and lakeFS  in development and staging environments for dbt-managed data pipelines: dbt Tests – Create Staging Environments for Flawless Data CI/CD

Git for Data – lakeFS

  • Get Started
    Get Started
  • Did you know that lakeFS is an official Databricks Technology Partner? Learn more about -

    lakeFS for Databricks
    +