Edgar Codd, creator of the first relational database

Edgar Codd, whose work at IBM in the 1960s and 1970s laid the foundation for modern databases, did not believe databases were the total solution for all data problems, but that “numerous complementary technologies” were needed to unlock the promise of improved productivity.

Quality problems
The Big Bang
Are spreadsheets cool again?
Enter the modern data stack
The end of data history?

The State of Data Quality Monitoring in 2023

Where there is data, there is the risk of bad data. And bad data is expensive. As a result, data people today often have more tools in their stack than friends in their bubble to ingest, store, transform, and visualize data.

This overview focuses specifically on data observability tools by describing where they started, how they co-evolved with the modern data stack, and where they might be going.

✊

Who are data people? Their job title might be data engineer, analytics engineer, data analyst, data scientist, or not include the word "data" at all. When in doubt, just ask if SQL column names should have periods in them. If you get a strong response, you're probably talking to a data person.

Quality problems

The problems that data observability tools solve are simple to state:

Time: Debugging in the dark costs time and peace of mind.
Trust: Trust in data, as in all things, is easy to lose and hard to regain.
Cost: The cost of bad data is the cost of bad decisions.

But before getting into how today’s ecosystem addresses these problems, it’s worth exploring how we got here.

The Big Bang

Market hype can make it seem like databases did not exist before Snowflake, Redshift, and BigQuery. But companies ran for decades (and continue to run) on mainframes from the 1960s, then on the main combatants in the 1990s database wars: Oracle databases, IBM DB2, Microsoft SQL Server. In this ecosystem, data teams used all-in-one or vendor-specific database integration tools to manage data quality:

Informatica Data Quality - Informatica provides a suite of tools for Doing All The Things with databases, one of which is a Data Quality solution that provides "end-to-end support for growing data quality needs across users and data types with AI-driven automation."
Talend Data Quality - Like Informatica, Talend provides a Data Quality tool to "profile, clean, and standardize data across your systems."
Oracle Data Quality "provides a comprehensive data quality management environment, used to understand, improve, protect and govern data quality."
Microsoft SQL Server Data Quality Services "enables you to build a knowledge base and use it to perform a variety of critical data quality tasks, including correction, enrichment, standardization, and de-duplication of your data."
IBM InfoSphere Information Server for Data Quality "enables you to cleanse data and monitor data quality on an ongoing basis, helping to turn your data into trusted information."

Until the arrival of cloud databases like RDS in 2009 and data warehouses like Redshift in 2012, “data” work within companies was typically done by IT teams using on-premise enterprise software. Transformation was either performed before arrival in the database, or by a downstream business intelligence (BI) application.

🤖

All warehouses are databases, but not all databases are warehouses. Unlike transactional databases optimized for reading and writing entries quickly and reliably, warehouses are ideal for running analytic queries like getting the average revenue of a customer segment. The big three warehouses are Snowflake, Amazon Redshift, and Google BigQuery. Data lakes warrant an article of their own, as does their convergence with warehouses.

Warehouses took their name from the warehouse architecture pioneered by Ralph Kimball and Bill Inmon in the 1990s, which centralized business data in a single source of truth instead of separate systems. 2010s technology fit the 1990s architecture perfectly.

Are spreadsheets cool again?

Like the color black, spreadsheets will always be in, whether or not they’re the hot new item. In the early 2010s when the “big data” buzzword was taking off, many analysts who bugged their database admins for extracts were confronted with the realization that their data was not clean. Enter the universal API: tabular data.

Trifacta - Founded in 2012 based on Stanford research, Trifacta is a tool for interactively transforming data that helps you by automatically inferring relevant transformations. Trifacta originally focused on the enterprise, but has since launched a freemium version.
Note that if you prefer using AWS tools, AWS released the Data Wrangler tool that looks to replicate functionality some of Trifacta’s functionality.
OpenRefine (Github) - An open source desktop application released in 2010, OpenRefine helps users transform data by providing functionality like reconciliation, faceting, clustering. Activity on the GitHub repository seems to be depressed.

There will always be a place for an interactive tool for cleaning the last mile of data. But as a new pattern called the modern data stack takes off, there are new upstream opportunities to make sure that the data is ingested properly, transformed into usable forms, and validated before the last mile.

🛣️

The last mile of data refers to how the data in a warehouse is used. Internal use cases include powering reporting dashboards and operational tools like Salesforce. External use cases include triggering in-product experiences, training machine learning models, and sharing data with partners or customers.

Enter the modern data stack

By the mid-2010s, the cloud data warehouse war between Google BigQuery, Amazon Redshift, and Snowflake was in full force. As data warehouses emerged at the top of the food chain, an ecosystem of tools co-evolved alongside them, including easy ways to extract and load data from sources into a warehouse, transform that data, then make it available for consumption by end users.

💡

The Modern data stack is a cutting-edge setup for centralizing and using data within a company. Tools move data from sources like Salesforce into data warehouses like Snowflake that make it very cheap to store gobs of data and easy to analyze it quickly. As a result, data teams have the leverage to do more with less.

The TLDR? Because data storage is dirt cheap, let’s put it all into one place and transform it later.

With the adoption of the modern data stack, more and more data is centralized in one place, then being used for critical applications like powering operational analytics, training machine learning models, and powering in-product experiences. And, importantly, we can keep changing the data even after it’s in the warehouse, lending flexibility to data that was once rigid.

Increasing amounts of data. Increasing importance of data. Increasing fragmentation of vendors. These trends, coupled with investor appetite to fund the next Snowflake, makes it no surprise that we’re witnessing what CEO of Fishtown Analytics calls a Cambrian explosion of new tools supporting data quality.

Some of these tools are open source:

dbt tests (Github) - dbt is a tool for transforming data within warehouses. Described as the "T in ELT", dbt lets you write composable SQL that can be executed in sequence (specifically a DAG). A fan-favorite feature of dbt is the ability to specify tests like ensuring that a column is unique or that every value in one column refers to a value in another (referential integrity). Created by Fishtown Analytics, dbt has seen massive growth in the past year, and we highly recommend checking it out. It just makes sense.
deequ (Github) - Created by AWS labs, deequ is a library that lets you define "unit tests for data" that are translated into Spark jobs. The main components are Profilers and Analyzers that compute metrics like Completeness and Uniqueness of a column. These metrics are then verified against constraints, the results of which are stored in a Deequ repository. Deequ is written in Scala, though there is a Python wrapper as well.
Great Expectations (Github) - GE is an open-source Python package for validating data in a declarative and extensible way. The main abstraction is an Expectation such as expect_column_values_to_not_be_null, which GE compiles into code that runs in the context of your data, whether it's a Pandas dataframe or a Snowflake table.
Apache GriFFin (Github) - Griffin is an open source tool for defining data quality ("accuracy, completeness, timeliness, profiling") on both batch and streaming sources, measuring it on a schedule, then reporting output metrics to a destination.
Elementary Data (Github) - Elementary is an open source data observability too, built for dbt users. It offers immediate visibility into data quality and performance. Features include data observability reports, anomaly detection, enriched test results, model performance visibility, data lineage, and Slack alerts. It operates by creating tables of metadata and test results in your data warehouse, with the CLI tool generating the user interface and alerts.
re_data (Github) - re_data is an open-source data reliability framework built for the modern data stack. Its main aim is to assist data teams in maintaining data reliability. It achieves this by observing the dbt project and providing features such as alerts about bad data, computation of data quality metrics, and the capability to write your own data assertions. One of its prime functionalities is enabling users to easily collaborate on data issues discovered in various open-source apps.

Many more tools are young commercial offerings. Recently, many tools (including ours) have described themselves as offering “data observability,” possibly inspired by the success of the Datadog, a public company that provides observability for software engineers.

🤖

Observability is a concept from control theory that describes how well the state of a system can be inferred from outputs. Applied to software systems like an EC2 instance hosting a web server, poor observability might be a health check ping, while strong observability could be a Datadog agent sending system metrics like CPU utilization, network performance, and system logs.

One way we like to think of observability is to ask: How many questions do you need to ask before you're confident in the state of the system?

Applied to a modern data stack, “data observability” tools aim to bring to provide the data needed to answer questions of this sort:

Are the EC2 instances running my Airflow jobs maxing out on CPU?
How often are rows loaded into my database from Fivetran?
Do the tables in my database contain as many rows as I expect?
Are there unexpected outliers in this column?
Which dashboards rely on which tables?

The new vendors in this space, which may outnumber galaxies in the universe, include:

acceldata - Acceldata’s tools “measure, monitor, and model data that constantly moves across complex pipelines.”
Anomalo - Founded by the former VP of data science at Instacart, Anomalo is “is the easiest way to validate and document all the data in your data warehouse. All without writing a single line of code.”
Bigeye - Bigeye is “automatic data monitoring for modern data teams” focusing on the freshness, volume, formats, categories, outliers, and distributions of data within a warehouse.
Datafold - Datafold “gives you confidence in your data quality through diffs, profiling, and anomaly detection.” Notably, they provide a “data diff” feature that tests changes within the data before and after code changes and new data.
Databand - Databand helps “monitor your data health and pipeline performance.” Unlike other tools in this list, Databand looks to be focused on data pipelines over warehouses.
Decube - Decube is a unified data observability platform that offers features like dynamic masking, automated data discovery, end-to-end lineage tracking, and simplified governance.
Metaplane - Metaplane integrates across the data stack from source systems to warehouses to BI dashboards, then identifies normal behavior and alerts the right people when things go awry.
Monte Carlo - Monte Carlo helps “increase trust in data by eliminating and preventing data downtime” by detecting the freshness, volume, distribution, schema, and lineage of data.
Soda - Soda “finds the data worth fixing” through automated data monitoring.
Sifflet - Sifflet is a "Full Data Stack Observability" platform that provides comprehensive data and metadata monitoring and ML-based anomaly detection.
Telm.ai - Telmai is a data quality analysis and monitoring platform that detects and resolves data anomalies in real-time, using statistical analysis and machine learning, without the need for coding or predefined rules.
Telm.ai - Telmai is a data quality analysis and monitoring platform that detects and resolves data anomalies in real-time, using statistical analysis and machine learning, without the need for coding or predefined rules.

How do you choose between these vendors as a consumer? We found that it’s useful to segment observability tools along four dimensions:

What does the tool monitor? Pipelines, warehouses, or BI tools? Or from end-to-end?
Are tests automatically generated or manually set? Simple manual tests can provide a lot of mileage and are much better than nothing. Automatic tests can help you move past the low hanging fruit onto the hairier, harder-to-catch issues.
How rich is the observability? Does the tool provide simple metrics, like the uptime of a pipeline, or richer metadata, like the lineage between dashboards and tables?
How deep is the workflow integration? The job of an observability tool is to make itself invisible. The less you use it, the better you use its job. But it ideally is integrated within tools your team already uses, like Slack and PagerDuty.

The most important question when evaluating a vendor, of course, is:
‍Does this tool solve a real problem for me? How frequently does a downstream business user ping you on Slack because a dashboard is broken? How long does it take to peel away layers of logs to identify the root cause?

❓

Why can’t I just use Datadog? Datadog is the default tool to use when monitoring infrastructure and applications, and if you’re mostly concerned with your data pipelines themselves, Datadog can probably do that for you. Datadog is not exactly designed for the data flowing through those pipelines and within the data warehouse at rest. You could probably do it, but you’d be shoehorning abstractions.

The end of data history?

Is the modern data stack the final configuration of systems that maximizes utility for data producers and consumers alike? Or is it one apex of the pendulum swinging between bundling and unbundling? One thing is certain: the grooves cut by the flow of data within organizations are here to stay, as are the people responsible for ensuring that flow. And we find it hard to imagine a future in which data teams continue to fight an uphill battle to build trust in data.

Now that data warehousing and extraction/loading are becoming solved problems, we’re eager to see the new technologies that will emerge as warehouses pull more and more data into its gravitational field. Just as the first filmed movies were simply plays in front of a camera, “data observability” can feel like reproducing old concepts in a new domain. But once technologists gain their bearing in a new space, we’re excited to see a future in which data can be delivered faster, more reliably, and in a more usable way to where it needs to be.

We believe data observability tools will play a critical role because, unlike the physical universe, entropy in our data warehouses does not have to tend toward disorder.

CREATED_AT 2023-01-20T20:12:13Z
UPDATED_AT 2023-05-22T13:34:58Z

Metaplane is a data observability tool that helps everyone trust the data that powers your business. Talk to us to learn more.

Contents