Using dbt Tests with Metaplane

dbt tests are suited for catching known issues, by testing binary criteria on critical tables. Metaplane is suited for catching unknown issues, by testing scalar criteria on a wide range of assets. Use each in the right circumstances. Most Metaplane customers use both dbt test and Metaplane in tandem.

January 3, 2023

Head of Data at Metaplane

Co-founder / Data and ML

January 3, 2023
Using dbt Tests with Metaplane

What are dbt tests?

As data and analytics engineers, many of us are familiar with the concept of tests in the dbt framework for data transformation.

These tests allow you to assert boolean outcomes to define quality and they are excellent for ensuring:

  • That an id is unique
  • That a boolean field is never null
  • That an order amount is never above $10000
  • Referential integrity

Where you know what should be the case, and it’s black and white, you should be using dbt tests.

However, where the truth is gray, they are not always the best solution. For example, a scenario where a boolean outcome is expected requires one test, but a scenario that expects a scalar outcome can require many tests.

I firmly believe that data teams need to invest in test suites, but they should be careful about building too much. This is especially true where the test outcomes aren’t so clear and data drift can make the tests fail without there being a real problem. Alert fatigue is real and to be avoided at all costs.

What is Metaplane?

Metaplane is a data observability tool that helps teams to be the first to know of data issues. Specifically, Metaplane learns from historical metadata to identify anomalies in scalar outcomes that might change over time. Examples of scalar outcomes include:

  • What % of the time a foreign key is null
  • How many possible values a categorical field can take
  • The distribution of a numeric value - min/max/avg
  • The increase in row count of an incremental table
  • The time between updates of a table

Modeling scalar outcomes are important for three reasons. First, scalar outcomes often have temporal dependencies such as components of seasonality, growth trends and white noise that affect their value on any given day. Making rules and thresholds to monitor them is difficult and most likely will lead towards alert fatigue.

Furthermore, data is not perfect: a field that is supposed to be unique may have a small number of duplicates that would continually set off alerts when testing for a binary is-unique outcome. By measuring the scalar uniqueness of the column, rather than binary is-unique criteria, we can get a stronger sense of where the data is trending.

Lastly, data is dynamic. An expected criteria today could change tomorrow. For example, if a table was previously refreshed every week, but now is refreshed on a daily cadence, testing for binary is-refreshed-within-week outcome would require shipping a PR. This means that making changes to your tests can be a slower process and it can require coordination with other members of your team.

Modeling scalar outcomes is where data observability tools, like Metaplane, excel, because they use unsupervised learning to make a forecasted confidence interval of the scalar trend. The algorithms adapt for the three components above, and can be retrained automatically to fit to new data. As a result, Metaplane is able to adapt to dynamic, imperfect, temporally dependent data, without requiring effort from your team.

How dbt tests and Metaplane complement each other

Where teams rigidly stick to only using dbt tests, including to test scalar outcomes, they will find these test suites expand to hundreds, if not thousands, of tests. It will become the case that it’s rare for any dbt job to run without multiple alerts, and most of these alerts will be ignored as really they are within an expected tolerance. This expected tolerance is based on the intuition of data engineers and won’t necessarily account for trends in the data. Soon enough, alert fatigue sets in and the test suite is a hindrance rather than a help.

Observability tooling saves the data team from having to write so many tests and avoids experiencing alert fatigue from testing scalar outcomes with hard set rules. Data observability alerts them to a myriad of changes to these scalar outcomes, with little effort to set up and maintain and no need for expansion of their test suite.

Many Metaplane customers, such as Imperfect Foods, Vendr, and Mux, all use dbt tests in tandem with Metaplane, to achieve full test coverage between binary and scalar outcomes and between known and unknown issues.

dbt tests are a powerful tool for verifying the accuracy and consistency of your data. They are particularly useful for testing for a small set of known, static issues, such as referential integrity and duplication in primary keys. These types of tests are easy to set up and automate, and they can help you ensure that your data is correct and consistent.

One potential downside to using dbt tests is that they require a pull request to add or change tests. This means that making changes to your tests can be a slower process and it can require coordination with other members of your team.

In contrast, data observability allows you to monitor and understand the data flow in your system without requiring any changes to your code. This can be useful for identifying potential issues or anomalies in your data, as well as for gaining insights into how your system is performing. For example, you can use data observability to monitor trends in row counts or the uniqueness of data, which may not be detectable using dbt tests.

Overall, using data observability in addition to dbt tests can provide a more comprehensive view of your data and help you ensure that your system is functioning as expected. While dbt tests are effective for testing a small set of known issues, data observability is better suited for testing a large set of unknown, dynamic issues. Many customers of Metaplane use both dbt tests and data observability, to ensure the quality and reliability of their data.

Contents

    Start monitoring your data in minutes.

    Connect your warehouse and start generating a baseline in less than 10 minutes. Start for free, no credit-card required.