Column-Level Lineage: An Adventure in SQL Parsing
Column-level lineage is one of the hottest buzzwords in the data space. And it’s easy to see why: tracing data not just from table to table, but from column to column unlocks enormous insights into how your data moves. Those insights can enable things like finer-tuned root cause analysis and more robust downstream issue prevention.
This post will cover Metaplane’s approach to column-level lineage. We’ll cover why it’s important and our technical journey towards building a fully-automated column level lineage solution for our customers.
Until recently, Metaplane, like many other data-focused companies, only supported table-level lineage. There’s a simple reason for it: technically, it’s a whole lot easier to parse out referenced tables from a SQL query. Let’s start with a relatively simple example of a customers model in the classic dbt-labs jaffle_shop:
To parse out the table names, you could imagine a few simple heuristics that would work pretty well. You could write a regex like FROM (tableName) to try and capture all the tables the query is selecting from. It would work pretty well in this case, except that it would also capture a bunch of the Common Table Expression(CTE) names as well. That’s not a problem; just write another regex to parse out and exclude the CTE names.
Then you’re (mostly) done: you have the full table level lineage for this model. There are other complexities—joined tables, aliased tables, and so on—that might need special attention. But it’s possible to get pretty far with simple(ish) string parsing.
The very first SQL parser at Metaplane more or less used this technique. Unfortunately, table names are the easiest case, since there are only so many ways that a SQL statement can reference them.
But let’s say we want to try the same thing for columns: we try to parse a list of the column names. Even in the example above, our parsing logic would get very complex very quickly, based simply on the number of ways columns can be referenced. Our first parser, written by our CEO Kevin Hu, did just that, and became infamous as a scary yet impressive piece of string “parsing” magic.
And it (generally) worked—most of the time it was able to pull out referenced tables and columns in a SQL statement. In time, we migrated to an open source SQL parser that gave us more comprehensive coverage. With these building blocks, plus the lineage metadata we got from dbt, we were able to support table-level lineage, as well as fine-but-not-great and definitely-not-comprehensive column-level lineage.
But as we started thinking about building out robust column-level lineage support, we realized that we’d need to go back to the drawing board.
Building a Complete SQL Parser
There’s one key difficulty when it comes to column-level lineage: extracting names from SQL statements simply isn’t enough. We need to understand not just the columns involved, but the structure and relationships between tables and columns within a SQL statement.
Let’s take another look at the customers model above and try to walk through the lineage for the first_order column. Doing so, we get path roughly like:
Even if we parsed and pulled out all those referenced columns, we don’t actually know how, or in what order, they reference each other. To get that information we’d have to understand the SQL statement as a full program, not just a string. While our open source SQL parser did some of this work, we ran into a few challenges:
- Warehouses like Snowflake, BigQuery, Redshift all have an ANSI SQL core, but each provider has its own “dialect,” with a long list of extensions and quirks on top of the spec. Open source parsers don’t always support these extensions, and it can be a lot of work for both us and the engineers who maintain the SQL parsing library to manage all of those intricacies.
- When we built our own parser, we wanted it to be permissive. This means that we relied on the warehouse to validate if a SQL statement was valid, and attempted to parse the metadata no matter what. But many existing open source solutions do care about validation, which made them slow to evolve and unlikely to support multiple dialects.
- We were reliant on whichever data parser we chose for the completeness of our column level lineage solution. We’d only be able to move and iterate as quickly as our parser library.
After considering all the shortcomings of the existing parsing libraries on the market, we came across the ANTLR parser generator library. This was the push we needed to take another shot at building our own parser.
Using the ANTLR library, our parser is able to take a SQL statement and convert it into an abstract syntax tree (AST) based on the grammar we define. This was the key, because that tree structure allows us to understand how statements, substatements, CTEs, and so on relate to one another. While we were only parsing table-level lineage, it became clear that the AST was fundamental for building column-level lineage.
Parsing Column Relations
With our own parser in place, we decided to start adapting it to support column-level lineage. Let’s walk through the whole parsing pipeline using the same customers model above.
The first step is to take the raw SQL and parse out an AST. Below is a partial view of the AST generated for the customers model.
Now that we have an AST, we need to write code that actually walks the AST and pulls out the relevant data and context into an intermediate representation (IR). While this code is fairly complex, we’re essentially looking for columns and trying to answer questions like:
“What tables reference this column?”
“Is this column inside a CTE?”
“Does this column have an alias?”
…and so on. The IR output is in a format that’s more easily processed, and which contains all the information we need to determine column-level lineage relationships.
Given the output IR of Step 2, we run further preprocessing to normalize and improve the data even further. As an example, we effectively remove all table aliases and replace them with a fully qualified column name.
Looking back our example lineage above, let’s walk through how we could programmatically determine this lineage chain:
The first step is to figure out what the “root” column is. We then have to walk through several CTEs to get to the real stg_orders table. This involves taking each of the top-level columns and recursively resolving all subpaths until we get to a real table, and once we do so, we add it to our list of collected lineage. We repeated this for every top-level column in the query until we have all the column-level lineage for the statement.
Phew—all the complex parsing is finally done. We now have a simple map of column → [lineage-columns]. The final step is post-processing, where we match the found columns to the actual warehouse schema. Whenever we find a match, we insert the column → column mapping into our lineage table. Once the data is in our lineage table, it becomes a part of the lineage graph that the rest of Metaplane can query and display.
While this specific example was for a create statement, the algorithm is much the same for insert, update, and merge statements. The trick for all lies in figuring out which are the “root” columns.
While the above steps may seem simple, they hide enormous challenges. These include:
- In the customers example above, there are a many select * statements in the query. This means that we must query the schema of the table to determine which tables are being selected, and dynamically add those back into the IR.
- It can be difficult to distinguish between “direct” lineage that comes from selected data vs “filter” lineage that comes from where clauses. This is only solvable at the AST level, as we must parse out which type of lineage the column would be.
- The logic to traverse CTEs is quite complex. CTEs can also have select * statements. Additionally, you can actually have nested CTEs. It’s tricky to recursively search through layers of CTEs to find real lineage.
- Our customers example above is quite simple, but in the real world SQL statements can be thousands of lines long, and have hundreds of CTEs and thousands of columns. Performance concerns become very relevant at this scale.
- There are interesting challenges when covering insert statements, as the position of the columns becomes important and must be carried through many layers.
- Building our own SQL parser was key to building column-level lineage. We found that many problems could only be or were mostly easily solved at the parser level when building our AST. The whole parser → AST → IR pipeline is unavoidably coupled. Having control over the whole stack allowed us to iterate quickly and solve problems at the right level.
- Building a comprehensive test suite was key to making sure we could confidently make changes and iterate on lineage parsing stack. There’s a ton of complex logic and edge cases to solve, so we have extensive test coverage to ensure that new changes don’t break things. To make sure we would actually write tests, we made it easy to add a new test case by just plopping in a raw query and allowing people to define high level assertions.
- Testing on real world data is key. The queries we see every day are significantly more complex than anything we encountered in documentation or examples online.
While our lineage solution is comprehensive now, we are constantly improving our parser and coverage. We are also working to handle more complex update chains that pass through temporary tables and schemas.