Get the essential data observability guide
Download this guide to learn:
What is data observability?
4 pillars of data observability
How to evaluate platforms
Common mistakes to avoid
The ROI of data observability
Unlock now
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Sign up for a free data observability workshop today.
Assess your company's data health and learn how to start monitoring your entire data stack.
Book free workshop
Sign up for news, updates, and events
Subscribe for free
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Getting started with Data Observability Guide

Make a plan to implement data observability across your company’s entire data stack

Download for free
Book a data observability workshop with an expert.

Assess your company's data health and learn how to start monitoring your entire data stack.

Book free workshop

Column-Level Lineage: An Adventure in SQL Parsing

and
March 22, 2023

Founding Engineer

March 22, 2023
Column-Level Lineage: An Adventure in SQL Parsing

Column-level lineage is one of the hottest buzzwords in the data space. And it’s easy to see why: tracing data not just from table to table, but from column to column unlocks enormous insights into how your data moves. Those insights can enable things like finer-tuned root cause analysis and more robust downstream issue prevention. 

This post will cover Metaplane’s approach to column-level lineage. We’ll cover why it’s important and our technical journey towards building a fully-automated column level lineage solution for our customers.

Early Days

Until recently, Metaplane, like many other data-focused companies, only supported table-level lineage. There’s a simple reason for it: technically, it’s a whole lot easier to parse out referenced tables from a SQL query. Let’s start with a relatively simple example of a customers model in the classic dbt-labs jaffle_shop:

create or replace view db.analytics.customers as (
  with customers as (
  	select * from db.analytics.stg_customers
  ),orders as (
	  select * from db.analytics.stg_orders
  ), payments as (
    select * from db.analytics.stg_payments
  ), customer_orders as (
    select
      customer_id,
      min(order_date) as first_order,
      max(order_date) as most_recent_order,
      count(order_id) as number_of_orders
    from orders
    group by customer_id
  ), customer_payments as (
    select
      orders.customer_id,
      sum(amount) as total_amount
    from payments
    left join orders on
      payments.order_id = orders.order_id
    group by orders.customer_id
  ), final as (
    select
      customers.customer_id,
      customers.first_name,
      customers.last_name,
      customer_orders.first_order,
      customer_orders.most_recent_order,
      customer_orders.number_of_orders,
      customer_payments.total_amount as customer_lifetime_value
    from customers
    left join customer_orders
      on customers.customer_id = customer_orders.customer_id
    left join customer_payments
      on customers.customer_id = customer_payments.customer_id
  )
  select * from final
);

To parse out the table names, you could imagine a few simple heuristics that would work pretty well. You could write a regex like FROM (tableName) to try and capture all the tables the query is selecting from. It would work pretty well in this case, except that it would also capture a bunch of the Common Table Expression(CTE) names as well. That’s not a problem; just write another regex to parse out and exclude the CTE names. 

Then you’re (mostly) done: you have the full table level lineage for this model. There are other complexities—joined tables, aliased tables, and so on—that might need special attention. But it’s possible to get pretty far with simple(ish) string parsing.

The very first SQL parser at Metaplane more or less used this technique. Unfortunately, table names are the easiest case, since there are only so many ways that a SQL statement can reference them. 

But let’s say we want to try the same thing for columns: we try to parse a list of the column names. Even in the example above, our parsing logic would get very complex very quickly, based simply on the number of ways columns can be referenced. Our first parser, written by our CEO Kevin Hu, did just that, and became infamous as a scary yet impressive piece of string “parsing” magic. 

And it (generally) worked—most of the time it was able to pull out referenced tables and columns in a SQL statement. In time, we migrated to an open source SQL parser that gave us more comprehensive coverage. With these building blocks, plus the lineage metadata we got from dbt, we were able to support table-level lineage, as well as fine-but-not-great and definitely-not-comprehensive column-level lineage.

But as we started thinking about building out robust column-level lineage support, we realized that we’d need to go back to the drawing board.

Building a Complete SQL Parser

There’s one key difficulty when it comes to column-level lineage: extracting names from SQL statements simply isn’t enough. We need to understand not just the columns involved, but the structure and relationships between tables and columns within a SQL statement. 

Let’s take another look at the customers model above and try to walk through the lineage for the first_order column. Doing so, we get path roughly like:

customer_orders.first_order →
min(order_date) as first_order from orders →
select * from db.analytics.stg_orders  

Even if we parsed and pulled out all those referenced columns, we don’t actually know how, or in what order, they reference each other. To get that information we’d have to understand the SQL statement as a full program, not just a string. While our open source SQL parser did some of this work, we ran into a few challenges:

  • Warehouses like Snowflake, BigQuery, Redshift all have an ANSI SQL core, but each provider has its own “dialect,” with a long list of extensions and quirks on top of the spec. Open source parsers don’t always support these extensions, and it can be a lot of work for both us and the engineers who maintain the SQL parsing library to manage all of those intricacies. 
  • When we built our own parser, we wanted it to be permissive. This means that we relied on the warehouse to validate if a SQL statement was valid, and attempted to parse the metadata no matter what. But many existing open source solutions do care about validation, which made them slow to evolve and unlikely to support multiple dialects.
  • We were reliant on whichever data parser we chose for the completeness of our column level lineage solution. We’d only be able to move and iterate as quickly as our parser library.

After considering all the shortcomings of the existing parsing libraries on the market, we came across the ANTLR parser generator library. This was the push we needed to take another shot at building our own parser. 

Using the ANTLR library, our parser is able to take a SQL statement and convert it into an abstract syntax tree (AST) based on the grammar we define. This was the key, because that tree structure allows us to understand how statements, substatements, CTEs, and so on relate to one another. While we were only parsing table-level lineage, it became clear that the AST was fundamental for building column-level lineage.

Parsing Column Relations

With our own parser in place, we decided to start adapting it to support column-level lineage. Let’s walk through the whole parsing pipeline using the same customers model above.

Step 1

The first step is to take the raw SQL and parse out an AST. Below is a partial view of the AST generated for the customers model.

Step 2

Now that we have an AST, we need to write code that actually walks the AST and pulls out the relevant data and context into an intermediate representation (IR). While this code is fairly complex, we’re essentially looking for columns and trying to answer questions like: 

“What tables reference this column?”

“Is this column inside a CTE?”

“Does this column have an alias?” 

…and so on. The IR output is in a format that’s more easily processed, and which contains all the information we need to determine column-level lineage relationships.

Step 3

Given the output IR of Step 2, we run further preprocessing to normalize and improve the data even further. As an example, we effectively remove all table aliases and replace them with a fully qualified column name.

Step 4

Looking back our example lineage above, let’s walk through how we could programmatically determine this lineage chain:

customer_orders.first_order →
min(order_date) as first_order from orders →
select * from db.analytics.stg_orders  

The first step is to figure out what the “root” column is. We then have to walk through several CTEs to get to the real stg_orders table. This involves taking each of the top-level columns and recursively resolving all subpaths until we get to a real table, and once we do so, we add it to our list of collected lineage. We repeated this for every top-level column in the query until we have all the column-level lineage for the statement.

Step 5

Phew—all the complex parsing is finally done. We now have a simple map of column → [lineage-columns]. The final step is post-processing, where we match the found columns to the actual warehouse schema. Whenever we find a match, we insert the column → column mapping into our lineage table. Once the data is in our lineage table, it becomes a part of the lineage graph that the rest of Metaplane can query and display.

While this specific example was for a create statement, the algorithm is much the same for insert, update, and merge statements. The trick for all lies in figuring out which are the “root” columns.

Challenges

While the above steps may seem simple, they hide enormous challenges. These include:

  • In the customers example above, there are a many select * statements in the query. This means that we must query the schema of the table to determine which tables are being selected, and dynamically add those back into the IR.
  • It can be difficult to distinguish between “direct” lineage that comes from selected data vs “filter” lineage that comes from where clauses. This is only solvable at the AST level, as we must parse out which type of lineage the column would be.
  • The logic to traverse CTEs is quite complex. CTEs can also have select * statements. Additionally, you can actually have nested CTEs. It’s tricky to recursively search through layers of CTEs to find real lineage.
  • Our customers example above is quite simple, but in the real world SQL statements can be thousands of lines long, and have hundreds of CTEs and thousands of columns. Performance concerns become very relevant at this scale.
  • There are interesting challenges when covering insert statements, as the position of the columns becomes important and must be carried through many layers.

Learnings

  • Building our own SQL parser was key to building column-level lineage. We found that many problems could only be or were mostly easily solved at the parser level when building our AST. The whole parser → AST → IR pipeline is unavoidably coupled. Having control over the whole stack allowed us to iterate quickly and solve problems at the right level.
  • Building a comprehensive test suite was key to making sure we could confidently make changes and iterate on lineage parsing stack. There’s a ton of complex logic and edge cases to solve, so we have extensive test coverage to ensure that new changes don’t break things. To make sure we would actually write tests, we made it easy to add a new test case by just plopping in a raw query and allowing people to define high level assertions.
  • Testing on real world data is key. The queries we see every day are significantly more complex than anything we encountered in documentation or examples online.

Future Work

While our lineage solution is comprehensive now, we are constantly improving our parser and coverage. We are also working to handle more complex update chains that pass through temporary tables and schemas.

We’re hard at work helping you improve trust in your data in less time than ever. We promise to send a maximum of 1 update email per week.

Your email
Ensure trust in data

Start monitoring your data in minutes.

Connect your warehouse and start generating a baseline in less than 10 minutes. Start for free, no credit-card required.