Great Expectations and Metaplane
- 1. Great Expectations is suited for catching known issues by testing binary criteria on critical tables.
- 2. Metaplane is suited for catching unknown issues by testing scalar criteria on a wide range of assets.
- 3. Use each in the right circumstances. Some Metaplane customers use both Great Expectations and Metaplane in tandem.
I previously covered how to use dbt tests with a data observability tool like Metaplane.
What is Great Expectations?
“Great Expectations is a shared, open standard for data quality. It helps data teams eliminate pipeline debt through data testing, documentation, and profiling.”
What are the expectations?
📖 Expectations are assertions for data. They are the workhorse abstraction in Great Expectations, covering all kinds of common data issues.
Expectations are declarative, flexible and extensible. They provide a rich vocabulary for data quality.”
By this definition, “expectations” are very similar to dbt tests - they are assertions that can be declared about your data. There is even a dbt_expectations dbt package (partly funded by re_data) which implements the functionality of Great Expectations inside dbt, to avoid having to run tests elsewhere on your data warehouse.
They are declarative rather than imperative, as a user declares what they want to be tested and this is interpreted into one or many tests, where the implementation is in SQL (itself a declarative language) and not defined by the user.
So why use Great Expectations or dbt expectations over dbt tests? They allow you to build many tests or complex tests with very little code. With dbt tests, complex tests often require you to build them as custom tests (like this one I made) or import them from packages like dbt_expectations. Great Expectations is an acceleration and expansion of simpler tests like dbt tests.
For example, you can make an expectation like this:
The simplicity of Great Expectations assertions means that you can make true or false assertions, which are best tested in this way and not by a ML algorithm. This is particularly useful when testing for known, explicit, static issues of data frames in a notebook environment or outputs of components in a data pipeline.
What is Metaplane?
Metaplane is a data observability tool that helps teams to be the first to know of data issues. Specifically, Metaplane learns from historical metadata to identify anomalies in scalar outcomes that might change over time. Examples of scalar outcomes include:
- What % of the time a foreign key is null
- How many possible values a categorical field can take
- The distribution of a numeric value - min/max/avg
- The increase in row count of an incremental table
- The time between updates of a table
Modeling scalar outcomes is important for three reasons. First, scalar outcomes often have temporal dependencies like components of seasonality, growth trends and white noise that affect their value on any given day. Making rules and thresholds to monitor them is difficult and most likely will lead towards alert fatigue.
Furthermore, the data we want to test does not always have a straightforward code entry point. Sometimes you’re writing an Airflow operator and can insert an expectation. But at other times, like when testing data in a data warehouse, you have to run code on a schedule. As a result, tools such as Metaplane, that come with orchestration built into service, help you avoid having to run another scheduler.
Lastly, data is dynamic. An expected criteria today could change tomorrow. For example, if a table was previously refreshed every week, but now is refreshed on a daily cadence, testing for binary is-refreshed-within-week outcome would require shipping a PR. This means that making changes to your tests can be a slower process, and it can require coordination with other members of your team.
Modeling scalar outcomes is where data observability tools, like Metaplane, excel, because they use unsupervised learning to make a forecasted confidence interval of the scalar trend. The algorithms adapt for the three components above and can be retrained automatically to fit to new data. As a result, Metaplane is able to adapt to dynamic, imperfect, temporally dependent data, without requiring effort from your team.
How Great Expectations and Metaplane complement each other
Great Expectations is a great tool for testing data frames and data pipelines for known, explicit issues that can be codified in an assertion and can have a clear code entry point. Metaplane is particularly suited for testing data warehouses for unknown, dynamic issues, with test orchestration handled by an external service.
In the absence of data observability tools like Metaplane, Great Expectations has been lent on, perhaps too much, to deal with complex data trends where a ML would be better-suited to cope with the complexity.
Observability tooling saves the data team from having to experience alert fatigue from testing scalar outcomes with hard set rules. Data observability alerts them to a myriad of changes to these scalar outcomes with little effort to set up and maintain, and no expansion of their test suite.
Many Metaplane customers use Great Expectations in tandem with Metaplane, to achieve full test coverage across data pipelines, notebooks, and warehouses, and between known and unknown issues.