The Origins, Purpose, and Practice of Data Observability

Everything old is new again, especially in the data world. With hype around the modern data stack and even post-modern data stack, you’d be right to step back and ask “haven’t we dealt with data since the first computer? What changed since Edgar Codd formalized the relational database in 1970?”

‍

Kevin Hu, PhD

and

May 15, 2023

Kevin Hu, PhD

Co-founder / Data and ML

May 15, 2023

The Origins, Purpose, and Practice of Data Observability

Our present is indeed a continuation of that moment in time, but has a lineage extending all the way back to when humans first developed writing.

One of the oldest extant pieces of writing is a cuneiform tablet created in Sumeria about 5,000 years ago. The writing isn’t an epic poem or religious rite, but a sales receipt signed by a person named Kushim.

Specifically, this clay fragment records the delivery of 29,805 measures of barley over 37 months. While Kushim didn’t have access to enterprise software back then, this is essentially a Salesforce transaction. Good thing his data storage medium had auto-save capabilities.

While our technologies have progressed beyond stone tablets, customer complaints will never go away. Image courtesy of author. Images courtesy of the Metropolitan Museum of Art and The Trustees of the British Museum.

The Origins of Data Observability

While data is thousands of years old, observability is hundreds of years old. “Data observability,” a concept describing the degree of visibility we have into our data systems, is a riff on “software observability,” which itself draws from control theory.

In the 17th century, when Dutch polymath Christiaan Huygens wasn’t busy studying the rings of Saturn, he was thinking about windmills. At the time, windmills were often used to grind flour.

But one problem plagued millers: wind blows at variable speeds, and when it blows too fast, the millstones cause too much friction. This friction could scorch flour or even cause sparks — which then ignites the flour in the air and could cause entire mills to burn down.

The problem was regulating rotational velocity, and Huygens’ solution was the centrifugal governor. Like an ice skater extending their arms out to slow down and pulling their arms in to speed up, the governor has two swinging pendulums that keep the speed of the millstones within an acceptable range.

In other words, Huygens acknowledged he couldn’t control the input (wind speed) and opted to better control the system (milling speed). Two hundred years later, Scottish engineer James Watt adapted the centrifugal governor for industrial uses, including use in steam engines.

Diagram of a centrifugal governor used to control flow speed out of a valve. From “Discoveries & Inventions of the Nineteenth Century” by R. Routledge.

Like the steam engine, the technology of the centrifugal governor came before the theoretical understanding. In this case, the pioneer was fellow Scotsman James Clerk Maxwell, who also had a main hustle studying electromagnetism.

His idea was: given this complex system, it’s untenable to observe every piece of information being processed at any given time, so we satisfy ourselves by observing a few key outputs and making inferences. This became the concept of “observability” in the field of control theory.

Skip ahead another hundred years, and observability was taking a more familiar shape. In the late 1960s and early 1970s, Edward Codd at IBM introduced the relational database model. Codd had a notion that much of the data being collected on physical artifacts like book pages, carbon copies, and punch cards could be represented digitally in the form of tables with relationships between them.

This relational model and algebra laid the groundwork for the relational database. Furthermore, these tables and relations could be acted upon by a well-defined set of operations, which became SQL. These early database prototypes gave rise to the enterprise databases we still use today.

My point is that data is not new and neither is observability. Here’s what is new: In Codd’s time, data fed into a single artery running through an organization. Data was a supplement to business data strategy by supporting decisions.

Today, data teams centralize, transform, and expose an increasing amount of data from an increasing number of data sources. This data serves a growing number of stakeholders in use cases that are more and more diverse. The potentially real-time data flowing through these data pipelines powers businesses by training machine learning or AI models that improve operations and product experiences, feeding back into software tools for operational purposes, and driving automated actions.

Five thousand years after Kushim recorded that barley transaction in ancient Sumeria, we are finally fulfilling the promise of data.

The Purpose of Data Observability: Beyond Data Quality

Moore’s law has driven down the cost of computation and data storage exponentially, which led to the rise of the players we know and love in the modern data stack. Increasing amounts of data are replicated by tools like Fivetran and Airbyte into cloud warehouses like Snowflake, BigQuery, and Redshift, which are then transformed via tools like dbt.

Progress is often a double-edged sword, and the explosion of data is no different. The problem is that more data means more surface area to maintain. As maintenance overhead increases, data breakage becomes more frequent and severe. And as data breaks down, the use cases that depend on data begin to falter. More often than not, this leads to degrading trust in data by stakeholders.

No matter how many zettabytes of data you have, it’s useless if it can’t be trusted.

Just as entropy tends to increase over time, creating more data without increasing your capacity to manage it will only lead to greater chaos and disorder. The flood of data is like the wind blowing too hard on Huygens’ windmill. In this example, data observability is the governor that keeps the mill from burning down by preventing lapses in data quality and unacceptable data downtime.

Modern data technologies make it easier than ever to ingest, transform, and activate data. But more surface area leads to higher likelihood of breakage, which can result in lost trust from stakeholders.

This fight against entropy is why data observability is not the same as data quality monitoring of your data assets, though some people speak about them interchangeably. Data quality monitoring alerts us to problems, like a burglar alarm. Data observability strives to understand why the problem occurred and how it can be prevented. It’s like a burglar alarm with security camera footage we can review.

The potential use cases for data observability extend beyond data quality monitoring of datasets to include incident management, root-cause analysis, and usage analytics. These are the areas where organizations can apply data observability concepts to solve jobs that plague data teams.

The Pillars of Data Observability

What are the core components of data observability? Data engineering is still a relatively young discipline. Throughout its development, it has often looked to the example of its big brother software engineering for guidance. “Pillars of observability” will sound familiar if you know the three pillars of software observability: metrics, traces, and logs. Data observability is following a similar path, albeit with a different spin.

A decade ago, software engineers were in the same position data engineers find ourselves today. The shift from physical disks to cloud storage meant people were placing greater expectations on software. The accelerated pace of updates meant a danger of entropy. This led to the development of software and DevOps observability tools like Datadog and Sumo Logic for monitoring virtual machines, microservices, and more.

Now we’re at the dawn of the data observability era. While they have many things in common, software and data have a few fundamental differences that mean we can’t blindly replicate principles from software observability to data. We need to take the observability concept and tailor it to our specific domain. The four pillars of data observability are: metrics, metadata, lineage, and logs:

Metrics: internal characteristics of data. Examples: distribution of a numeric column, number of groupings in a categorical column.
Metadata: external characteristics of data. Examples: schema of a table, time elapsed since a view was materialized, number of rows in a table.
Lineage: dependency relationships between pieces of data. Example: column-level lineage between database columns and downstream dashboards.
Logs: interactions between data and the external world. Example: number of times a dashboard has been viewed, who pushed the latest dbt commit.

Taken together, these four pillars are necessary and sufficient to describe the state of a data system at any point in time, in arbitrary detail.

The four categories of information that approximate the state of our data system to arbitrary precision.

The Five Goals of Data Observability

Like data quality, efforts to bring on data observability are most successful when there is a goal in mind. Through working with data teams ranging from small startups to large enterprises across every vertical, we’ve come across five consistent goals when adopting data observability:

Saving engineering time: If your engineering team is spending an inordinate amount of time dealing with inbound service tickets, data observability is one way you can solve problems faster and prevent them from recurring.
Reducing data quality outages: Especially when data is operationalized beyond business intelligence to power go-to-market operations, automations, and customer experiences, the cost of poor data quality can directly affect the top-line and bottom-line of a company. Conversely, improved data reliability can improve the commercials of a company that depends on data.
Increasing the leverage of the data team: The biggest problem facing most data teams is hiring great talent. Once you have great people on board, you don’t want their time wasted on menial tasks that could be automated. Instead of putting out fires, data teams can spend their time developing data pipelines and delivering data products to the business.
Expanding data awareness: There is a school of thought that data practitioners should help other teams make better decisions, automate workflows, and understand the impact of their work. But what about data teams? They need the ability to optimize their own workflows. Data observability provides the metadata they need to do that.
Preserving trust: Trust is easy to lose and almost impossible to regain. Once an executive loses trust in data, then their team begins losing trust in data. We’ve seen teams spin up alternative “shadow” data stacks that invalidates the efforts of the data team, or worse, stop using data altogether. Recovering that trust is an uphill battle of constantly providing assurances that your product is correct.

The purpose of data observability, from most concrete (but not necessarily most valuable) to most abstract.

‍

Putting Data Observability into Practice

These five goals are listed in order from most concrete to most abstract. However, no one goal is necessarily more important than the others — it depends on your business. Once a goal is identified though, the next step is to gain alignment on metrics, people, processes, and tools:

Metrics: We’re preaching to the choir here, but to ensure progress against a goal, an effective approach is defining clear and measurable metrics that are proxies for that goal. For example, if your goal is to save engineering time, you could track the number of data quality issues, the average time-to-identify issues, or the average time-to-resolve issues.
People: Data comes into the system from a myriad of tools, and outputs are used by just about everybody in the organization. So who is responsible for data quality? We’ve seen teams that fall everywhere along the spectrum between “only the data team” and “the entire organization.” All can work, but the most important piece is alignment. As long as the organization is bought into the importance of data quality and feels like their needs are being heard, then the data team is empowered to own the process of measuring and improving data quality.
Processes: With a goal in mind, metrics that track progress towards those goals, and people aligned around those metrics, the next piece of the puzzle is implementing processes for systematic measurement and improvement.

These processes include passive observability that continuously monitors the four pillars, active observability that is integrated into CI/CD workflows, incident management playbooks, and service level objectives/agreements (SLOs/SLAs).

Tools: The three main options for bringing on a data observability tool are to build one in-house, bring on an open source solution, or adopt a commercial tool. These tools often have functionality like anomaly detection on metadata, automated lineage extraction, and CI/CD integrations. Each approach has its pros and cons in terms of customizability, support, time cost, and financial cost.

Summary

What is old is new again, but with a vengeance. Humans have recorded data since we could record language onto stone tablets, and we always tried to apply inputs to improve outcomes. Moore’s law has made it possible to move, store, and activate data in greater quantity and in greater effect.

With this increased surface area of data, the entropy of our data systems tends to increase and lead to lost trust, unless we put practices in place to maintain order. Data observability is one emerging technology to help maintain order within our data.

—

This piece is adapted from talks by the author at Open Data Science Conference 2022 and DataOps Unleashed 2022.