The Four Pillars of Data Observability
Answering questions like “is our data anomalous?” and “how has this metric trended over time?” requires historical knowledge of our data over time. Until storing complete historical snapshots of our data becomes feasible, we’re left with storing approximations.
How much data do we need to reconstruct a useful picture of our data? That’s the question we attempt to answer in this article. The groupings of questions we need to answer in order to describe our data are the “pillars” that underlie data observability in 2023.
Why do we need yet another Pillars of X post?
Pillars are commonplace in the world of observability. For example, Software Observability is built on top of the three pillars of metrics, traces, and logs.
Metrics are numeric values that describe components of a software system over time, like the CPU utilization of their micro-services, the response time of an API endpoint, or the size of a cache in a database.
Traces describe dependencies between pieces of infrastructure, for example the lifecycle of an application request from an API endpoint to a server to a database.
Logs are the finest grained piece of information describing both the state of a piece of infrastructure and its interaction with the external world. With these three pillars in mind, software and devops engineers can gain increased visibility into their infrastructure throughout time.
The purpose of the three pillars is to describe the three categories of information that can be used to reconstruct, in as much detail as is relevant to a use case, the state of software infrastructure.
The goal of approximation is closely tied to why the term “observability” was borrowed from control theory, where observability is defined as a “measure of how well internal states of a system can be inferred from knowledge of its external outputs.” The pillars, together, should let users infer knowledge of the internal state of a system at any point in the past.
Data Observability draws inspiration from Software Observability, though there are important differences like the lineage between pieces of data and the components of a data system.
Because of those differences, the three pillars of Software Observability don’t quite address the needs of data teams, whether they’re DataOps, data engineering, data science, data analytics, or analytics engineering teams. Like my old computer science professor would say, compared with software, data management is "the same, but completely different." The overarching similarity, though, is the goal of increasing visibility into their data systems over time.
How can we derive the four pillars?
In that spirit, we ask: what can we know about your organization's data to derive its state at any point in time? We add two additional constraints: we want to minimize the number of pillars to be maximally concise, while making sure that they’re orthogonal to maximize information value of each.
We look at a concept of thermodynamics: the intensive and extensive properties of materials. Intensive properties do not depend on the size of material. For example, the temperature, density, pressure, and viscosity of a material does not depend on how much material there is. One cup of water can have the same temperature as an entire ocean.
Metrics: internal characteristics of the data
In the world of data, the analogy to intensive properties are properties of the data itself.
If the data is numeric, properties include summary statistics about the distribution like the mean, standard deviation, and skewness . If the data is categorical, summary statistics of the distribution can include the number of groups, the uniqueness. Across all types of data, metrics like completeness, whether it includes sensitive information, and accuracy can be computed to describe the data itself.
These are all different data quality metrics that describe some aspect that summarizes the underlying data, whether they’re calculated for data tables at rest in a warehouse or data in transit in data pipelines.
Metadata: external characteristics about the data
Extensive properties, in contrast, are independent of the material itself, like mass and volume and heat capacity. You can have a pound or a liter of both water and gold. Note that the analogy isn’t exact, because in thermodynamics, extensive properties depend on the amount of material, which freshness and structure do not depend on.
Metadata is frequently defined as “data about data,” but we’d add that metadata is “data about data that is independent of the data itself.” Direct analogies to the world of data include properties like data volume (number of rows), the structure of data (schema), and the timeliness of data (freshness).
While the volume, schema, and freshness of data have an impact on the internal metrics, they can be scaled independently while preserving the statistical characteristics. Conversely, the internal characteristics of data can change without impacting the volume, schema, or freshness. Together with metrics, metadata can be used to identify data quality issues.
Lineage: dependencies between data
Using metrics and metadata, we can describe a single dataset with as much fidelity as we desire. However, datasets in the real world often do not exist in isolation, landing in a data warehouse with no relationship to each other.
We can draw another analogy from the physical sciences, where systems can be modeled within themselves, but our understanding can be enriched by modeling interactions. For example, thermodynamic systems have smaller components with internal interactions, and also have interactions with the external environment.
Within the data world, the primary internal interaction is the derivation of one dataset from another. Datasets are derived from upstream data, and can be used to derive downstream data. These bidirectional dependencies are referred to as the lineage of data (also called the provenance), and range in level of abstraction from lineage between entire systems (this warehouse depends on those sources), between tables, between columns in tables, and between values in columns.
Logs: interactions between data and the real world
With metrics describing the internal state of data, metadata describing its external stage, and lineage describing dependencies between pieces of data, we’re only missing one piece: how that data interacts with the external world. We break these interactions into machine-machine interactions and machine-human interactions.
Machine-machine interactions with data include movement, like when data is being replicated from data sources like transactional databases or external providers to an analytical warehouse by an ELT tool.
Interactions also include transformations, for example when a dbt job transforms source tables into derived tables. Logs also document attributes of these interactions, for example the amount of time that a replication or transformation takes, or the timestamp of that activity.
Crucially, logs capture machine-human interactions between data and people, like data engineering teams creating new models, stakeholders consuming dashboards for decision making, or data scientists creating machine learning models. These machine-human interactions contribute an understanding of who is responsible for data and how data is used.
Putting it all together
With metrics describing the internal properties of data, metadata describing the external properties, lineage describing the dependencies, and logs describing the interactions, we have four levers that we can pull in order to fully describe the state of our data at any point in time.
Without any one of the pillars, our ability to reconstruct the state of data is incomplete. Without metrics, we do not have knowledge about the internal properties of the data itself, making alerting based on real-time anomaly detection on metrics impossible. If we only had metadata, we would know the shape, structure, and timing of data, but not necessarily whether we had bad data.
Without metadata, we do not know the structure, structure, or timing of data, making use cases like schema change detection or satisfying Service Level Agreements (SLAs) by tracking outages impossible. These use cases are critical to improve data reliability and decrease data downtime. If we only had metrics, we would know whether the data is correct, but not necessarily if it was refreshed in an appropriate amount of time.
Without lineage, we do not know how different pieces of data depend on each other, making it difficult to conduct triage workflows like upstream root cause analysis and downstream impact analysis. If we only had metrics and metadata, we have a holistic view of the health of data, but an incomplete picture of how issues are related.
Without logs, we do not know how external systems like ELT / ETL and transformation tools impact our data and how external users are impacted by our data. With only data lineage, we know how data is related, but not necessarily how important those relationships are (does anyone use this data?) or who is responsible for upstream changes.
Missing any one of four pillars would be a meaningful gap in a data observability tool or program. But at the same time, other categories of information are redundant with one of the four. Therefore, these four pillars are necessary and sufficient.
Putting the pillars to work
Getting started with building up these pillars is a project of not letting the perfect get in the way of good. Most data teams in 2023 have little information about their systems, so the first step is to just get started.
To start collecting metrics, you can start by identifying the most important tables and the data quality metrics to be monitored from those tables. A simple implementation is periodically querying properties like the `nullness`, `mean`, and `standard_deviation` that codify those metrics through SQL. Typically the easiest place to start is with data at rest within a data lake or data warehouse, then extending upstream into data in motion. Like with the rest of the pillars, profiling at one point in time is useful but insufficient for spotting poor data quality, requiring periodic checks over time.
Metadata is often provided out of the box by your data warehouse, with warehouses like Snowflake and Google BigQuery providing snapshots of the row count, schema, and last update time of tables in `INFORMATION_SCHEMA`. You can begin by storing that metadata into a separate table then tracking changes over time.
Lineage can be a bit more challenging. Inferring the lineage between tables and columns is a difficult task that involves detailed query parsing, but there are some open source libraries that provide a starting point. If you use a workflow tool like dbt or dagster or Prefect, you’re in luck, and can start ingesting the transformation metadata generated by those systems. Data integration tools like Fivetran also increasingly provide lineage metadata.
Storing and parsing logs is highly dependent on the tools in your data stack, though often ELT/ETL and BI tools in the "modern data stack" will provide an API for accessing those logs, and your warehouse will likely store query history. Sometimes, these APIs will even provide you with analytics about when and how data analysts and stakeholders are using your data products.
Taken as a group of four, these pillars provide as much visibility as desired into your organization's data health and provide the information needed for data governance programs. While maintaining high-quality data is one purpose of data observability, there are many adjacent use cases like usage analytics, effective onboarding, and spend management as well.
If you don’t have the bandwidth to build systems that collect metrics, metadata, lineage, and logs of your data assets, that’s where data observability platforms come into the picture. The is a lively ecosystem of commercial and open source data observability solutions that automate the collection of this information, synthesize it into a usable form, and integrate with the tools you already use. Metaplane is one option to consider if you want to start using a tool out-of-the-box within 30 minutes.
- Until it’s feasible to store snapshots of data over time, we rely on storing approximations of our data, ala lossy compression
- Four categories of characteristics about our data form the four pillars of Data Observability
- The Metrics and Metadata pillars describe the internal characteristics of our data itself and its external characteristics
- The Lineage and Logs pillars describe internal dependencies within our data and its interactions with the external world
- Without any of our these four pillars, we’re left with an incomplete picture of our data at a given point in time
Table of contents