Get the essential data observability guide
Download this guide to learn:
What is data observability?
4 pillars of data observability
How to evaluate platforms
Common mistakes to avoid
The ROI of data observability
Unlock now
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Sign up for a free data observability workshop today.
Assess your company's data health and learn how to start monitoring your entire data stack.
Book free workshop
Sign up for news, updates, and events
Subscribe for free
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Getting started with Data Observability Guide

Make a plan to implement data observability across your company’s entire data stack

Download for free
Book a data observability workshop with an expert.

Assess your company's data health and learn how to start monitoring your entire data stack.

Book free workshop

The Definitive Guide to Snowflake Data Lineage

Discover best practices for extracting data lineage in Snowflake, the importance of table-level and column-level lineage, and how to make it an essential part of your data workflows. The goal of this guide is to help data and analytics engineers reason about lineage and experience a taste of the technical implementation details.

and
May 15, 2023

Co-founder / Data and ML

May 15, 2023
The Definitive Guide to Snowflake Data Lineage

Data lineage, the process of tracking data through its life cycle across various transformations, is a cornerstone of data engineering. Its importance cannot be overstated as a fundamental building block of data quality, governance, and collaboration.

In this article, we'll explore the best practices for getting data lineage in Snowflake, one of the most popular cloud data warehouses. We will delve into the importance of data lineage, the difference between different granularities of lineage, how to extract lineage from various sources, and how lineage can be leveraged to improve your data workflows.

What is Data Lineage and Why is it Important?

All data comes from somewhere, whether it's a machine logging an event or a human putting in a number. That same rawdata can take many forms before it arrives at its final destination, like a Sigma dashboard or Reverse ETL sync. Data lineage (also called data provenance in the research literature) refers to the process of tracking the flow of data through all of these various stages, transformations, and dependencies within your data pipelines.

To make things more concrete: let's say you have an `orders` table that captures all orders in an e-commerce business. What tables were transformed to calculate that `orders` table? That is upstream lineage. Now, what tables and dashboards are dependent on the `orders` table? That is downstream lineage. Taken together, and added up across all the entities within your data flows, you have an end-to-end lineage graph

Visualization of lineage flowing from Fivetran through Snowflake tables to Looker dashboards in Metaplane.

Now that we know the upstream and downstream lineage of an `orders` table, what can we do? By itself, lineage is just information. But lineage is particularly versatile at helping data engineers and analytics engineers solve a wide set of problems across five categories:

  1. Incident management: analyzing the root cause and impact of incidents
  2. Change management: guiding migrations or understanding impact of changes
  3. Data governance: maintaining compliance and maximizing security posture
  4. Onboarding: helping new members of data team and data consumers gain awareness of data flows.
  5. Optimization: reducing inefficiencies or redundancies through your data infrastructure.

Understanding data lineage is crucial for all data ecosystems, and Snowflake's unique capabilities make it especially valuable in complex and regulated environments. With its flexible compute and storage options, Snowflake is well-suited for intricate data ecosystems where data moves through various stages, transformations, and dependencies. As your data ecosystem becomes more complex, having a clear understanding of how data is processed, transformed, and consumed within Snowflake becomes increasingly important.
{{inline-a}}

Tale of Two Lineages: Table-level vs. Column-level Lineage

Table-level lineage is akin to an overview map of a city, while column-level lineage is more of a street-level view. Both are important when navigating the streets of a new city, but they serve different purposes.

Analogy between different levels of data lineage with the layers of a physical map. Source: "The many layers of data lineage" by Borja Vazquez (highly recommended read)

Table-level lineage tells you how data flows between tables in your database. It can be as simple as identifying the source and destination tables of a SELECT statement. Here's an example from that same `orders` table:

```sql

CREATE TABLE orders_summary AS

SELECT 

  customer_id, 

  COUNT(*) as total_orders 

FROM 

  orders 

GROUP BY 

  customer_id;

```

In this case, the table-level lineage is straightforward: data from the `orders` table is used to create the `orders_summary` table. Therefore, if the `orders` table is deleted, then the `orders_summary` table has a broken dependency. Or, if the `orders` table changes, then the `orders_summary` table also changes.

Note that just like tables, views in a database also have their lineage – both at the table-level and column-level. Table-level lineage for a view demonstrates how data flows from source tables into the view. It helps in understanding which tables contribute data to the view. For instance, consider the following SQL statement:

```sql

CREATE VIEW customer_view AS

SELECT 

  customers.customer_id, 

  customers.first_name, 

  customers.last_name,

  orders_summary.total_orders 

FROM 

  customers 

JOIN 

  orders_summary ON customers.customer_id = orders_summary.customer_id;

```

Here, the table-level view lineage indicates that data from the `customers` and `orders_summary` tables flow into the `customer_view` view.

Table-level lineage can be helpful across all of the use cases we mentioned above, but often it's not granular enough. For example, we might want to make a change to a column in a table and understand the downstream impact.

But with only table-level lineage, we'd over-estimate the impact because table-level lineage wouldn't tell us anything about the individual columns involved. That's where column-level lineage comes in. It provides finer-grained details, showing how individual columns are used and transformed. Consider a slightly more complex example:

```sql

CREATE TABLE customer_summary AS

SELECT 

  customers.customer_id, 

  customers.first_name, 

  customers.last_name,

  orders_summary.total_orders 

FROM 

  customers 

JOIN 

  orders_summary ON customers.customer_id = orders_summary.customer_id;

```

Here, column-level lineage tells us that `customer_id`, `first_name`, and `last_name` come directly from the `customers` table, while `total_orders` is sourced from `orders_summary`. It's a more detailed look at your data's journey.

Example of drilling down to column-level lineage from table-level lineage in Metaplane.

Using Snowflake's Metadata for Data Lineage

Snowflake provides a range of metadata views that can be leveraged to understand data lineage. However, these views offer different levels of detail and are only available on certain Snowflake editions. Make sure to check the Snowflake documentation to verify if your organization has access to these views.

OBJECT_DEPENDENCIES View: The Macroscopic Lens of Objects

The `OBJECT_DEPENDENCIES` view shows dependencies between objects in Snowflake, such as views and UDFs. For example, you can see which objects are dependent on a particular table:

```sql

SELECT * 

FROM INFORMATION_SCHEMA.OBJECT_DEPENDENCIES 

WHERE REFERENCED_OBJECT_NAME= 'orders';

```

This gives you a list of all Snowflake objects that depend on the `orders` table. One of the most useful applications is extracting which views are dependent on a table, which is particularly useful if you lean heavily on views. Lineage without table-to-view relationship is like a map without state highways.

However, `OBJECT_DEPENDENCIES` will not give us table or column dependencies as established through SQL queries, which is the true foundation of our lineage map. They are both the interstate highways and the local roads. For that, we will have to us the `ACCESS_HISTORY` view.

ACCESS_HISTORY View: Tracing the Footprints of Queries

The `ACCESS_HISTORY` view provides a history of DML operations (`INSERT`, `UPDATE`, `DELETE`) on your tables. For example, you can find all objects associated with a specific query with id `query_id`:

```sql

SELECT DIRECT_OBJECTS_ACCESSED, BASE_OBJECTS_ACCESSED, OBJECTS_MODIFIED

FROM INFORMATION_SCHEMA.ACCESS_HISTORY

WHERE QUERY_ID='query_id'

```

With the JSON results returned by this query, we're able to extract the exact dependencies of all queries, which let us establish the full map of column-to-column dependencies. Crucially, these are the dependencies as determined by Snowflake's query parser, which is the ultimate source of truth.

Note that `ACCESS_HISTORY` requires Enterprise Edition (or higher) and, as such, isn't available to all customers. If you do have access to this view, you should 100% start to play around with it. But without access to `ACCESS_HISTORY`, we will have to undertake the journey of parsing through the queries ourselves.

There are Levels to this: Parsing Query History

Snowflake's `QUERY_HISTORY` view is another powerful tool to obtain data lineage. This view contains historical data about all queries that have run in your Snowflake environment, including the SQL text of the query, the user who ran the query, the objects that were accessed or modified, and more.

There are different levels of query history available in Snowflake: account-level, user-level, session-level, and warehouse-level. 

  • Account-level query history gives you a view of all queries across your entire Snowflake account. This is useful for administrators who need to understand usage across all users and workloads. 
  • User-level query history provides a record of all queries run by a particular user, which can help troubleshoot user-specific issues or understand individual usage patterns.
  • Session-level query history is limited to queries run within a single session, which is handy for debugging or analyzing the performance of a specific set of related queries.
  • Warehouse-level query history returns queries executed by specific warehouses, which can help understand queries constrained to a specific scope, context, or tool.

Here's an example of how to use `QUERY_HISTORY` to get lineage information:

```sql

SELECT QUERY_TEXT 

FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY 

WHERE QUERY_TEXT LIKE '%orders%'

ORDER BY START_TIME DESC;

```

This query will return the SQL text of all queries that reference the `orders` table, giving you a picture of how data from the `orders` table is being used and transformed across your Snowflake environment.

Even for relatively simple SQL queries, the parsing logic for columns can become incredibly complex. To address this, an initial solution could be to use string parsing techniques. However, while this approach might work for table names, it falls short when it comes to column names due to the variety of ways columns can be referenced.

To achieve robust column-level lineage support, a different approach is needed. One fundamental challenge is that merely extracting names from SQL statements is insufficient. Instead, it's crucial to understand not only the columns involved but also the structure and relationships between tables and columns within a SQL statement.

A solution to this problem involves building a complete SQL parser that can handle the various dialects and extensions of SQL used by different warehouses. Using a parser generator library like ANTLR, it's possible to convert a SQL statement into an abstract syntax tree (AST) based on a defined grammar. This tree structure enables understanding how different parts of the SQL statement relate to each other, which is key for building column-level lineage.

Example of an Abstract Syntax Tree (AST) parsed from a SQL query.

After parsing an AST from the SQL statement, code is written to walk the AST and pull out relevant data and context into an intermediate representation (IR). The IR output is in a format that's more easily processed and contains all the necessary information to determine column-level lineage relationships.

Subsequent preprocessing can normalize and improve the data even further, including removing all table aliases and replacing them with fully qualified column names. From there, a recursive approach can be used to determine the lineage chain for each column, adding to a list of collected lineage until all column-level lineage for the statement is obtained.

Finally, the lineage is matched to the actual warehouse schema, and the column-to-column mapping is inserted into a lineage table. This then becomes part of the lineage graph that can be queried and displayed.

Despite the apparent simplicity of these steps, they present tricky technical challenges even with relatively mundane SQL operations. These include handling `SELECT *` statements, distinguishing between direct and filter lineage, navigating the complexity of Common Table Expressions (CTEs), managing performance issues with large SQL statements, and handling the intricacies of `INSERT` statements. If you're interested, our founding engineer Todd goes into more detail in this piece.

Caveat: Non-SQL Transformations

As we've seen, Snowflake's metadata views and query history provide a robust framework for understanding data lineage within the SQL layer. They are an indispensable part of our lineage playbook. However, as much as they help us, they have their blind spots. Specifically, they might not fully capture transformations occurring outside the SQL layer. Let's delve into these hidden chapters of data lineage.

The Case of In-Memory Transformations

Consider the scenario where your data pipeline involves in-memory transformations. Perhaps you're using a data processing framework like Apache Spark or Pandas to perform complex operations on your data. These transformations might involve data cleaning, aggregation, reshaping, or even machine learning algorithms.

Unfortunately, these in-memory transformations leave no trace in Snowflake's SQL-based lineage views. This invisibility is due to the simple fact that these transformations occur outside Snowflake, in your application's memory. Therefore, the metadata views, like OBJECT_DEPENDENCIES or QUERY_HISTORY, can't peek into these operations.

External Processes: The Offstage Actors

Similarly, external processes that manipulate data before loading it into Snowflake are also beyond the purview of Snowflake's metadata views. For instance, consider a scenario where you're using a real-time data processing system like Apache Kafka or AWS Kinesis. Such systems might be employed to handle streaming data, filter it, enrich it, or reformat it before it lands in your Snowflake tables.

Again, these transformations are happening offstage, outside of Snowflake's SQL layer. Therefore, they won't show up in the lineage derived from Snowflake's metadata views or query history.

Pre-ETL Data Wrangling

Often, raw data isn't in a pristine state suitable for direct ingestion into a data warehouse. It needs to go through a series of data wrangling steps to cleanse, normalize, and format it. Tools like Trifacta or OpenRefine are commonly used for these tasks.

The transformations performed during this pre-ETL phase are critical pieces of your data's story. Yet, they won't appear in Snowflake's SQL-based lineage. This is because these transformations take place even before the data enters the Snowflake ecosystem.

Lineage Before and After the Warehouse

So far, we've only discussed lineage within a Snowflake data warehouse. While data warehouses are the center of gravities within your data infrastructure, they neither produce or consume data. Data is produced upstream and consumed downstream. Therefore, a fully comprehensive view of data lineage in Snowflake must extend to the edges of your ecosystem.

Before the Warehouse

Data lineage starts at the very beginning of your data infrastructure, long before data lands in your warehouse. Data could originate from Kafka event streams, or spreadsheets uploaded into the warehouse, or through an assortment of other means.

Example of lineage from ELT (Fivetran sync) to BI (Sigma workbook) in Metaplane.

Here, we'll focus on data from transactional databases and applications that flow into your warehouse through Extract, Load, Transform (ELT) tools like Fivetran, Stitch, Matillion, and Airbyte. These ELT tools extract this raw data from sources, transform it into a more analysis-friendly and standardized format, then load the extracted data into a warehouse like Snowflake.

The most straight-forward approach to establishing lineage from a table/column in a transactional DB or application to a table/column in Snowflake is through an API like the Fivetran Metadata API. In this API, endpoints such as `GET https://api.fivetran.com/v1/metadata/connectors/{connector_id}/columns` allow you to map between qualified names in a source to qualified names in a warehouse.

However, not all ELT tools have metadata APIs, which then require more challenging methods. If you are orchestrating an open source ELT tool, or with customizable commercial ELT tools, it's possible to omit lineage after each run. One alternative is inferring lineage through comparing the names, structure, and types from source tables with those of target tables.

Typically, a comprehensive view of upstream lineage will require a combination of API-based, manually annotated, and inferred lineage.

After the Warehouse: Incorporating BI and Reverse ETL

Once your data resides in the warehouse, it becomes the foundation for various downstream applications. The canonical use case is helping stakeholders make data-driven decisions using Business Intelligence (BI) tools like Looker, Sigma, Tableau, and Mode. Recently reverse ETL tools like Census and Hightouch transport warehouse data back into operational workflows and tools.

These tools play a crucial role in visualizing, analyzing, and operationalizing data. As a result, lineage from warehouses to the primitives (like reports) in these tools is critical for understanding the full extent of data within an organization and the downstream impact of changes.

However, incorporating these tools adds another layer of complexity to the data lineage for two reasons. First, transformations may occur within these tools, sometimes outside of the scope of the warehouse. Second, transformations and primitives may only be accessible via an API, if at all.

For instance, let's consider a Looker model designed to track monthly revenue. This model retrieves data from a `monthly_revenue` table in Snowflake, applies specific transformations and calculations as expressed in SQL within LookML:

```sql

view: monthly_revenue {

  derived_table: {

    sql: SELECT

      DATE_TRUNC('month', transaction_date) AS month,

      SUM(transaction_amount) AS revenue

    FROM transactions

    GROUP BY month ;;

  }

  dimension: month {

    type: time

    timeframes: [month, year]

    sql: ${TABLE}.month ;;

  }

  measure: revenue {

    type: number

    sql: ${TABLE}.revenue ;;

  }

}

```

By parsing LookML, either with a regular expression, a custom-built string parser, or a AST-based query parser like we discussed above, it's possible to extract the mapping between Snowflake tables and columns (e.g. `monthly_revenue.revenue` to Looker entities like measures and dimensions.

From there, it's possible to use the Looker API to retrieve lineage from Looker measures and dimensions to Looker dashboard elements to Looker dashboards. The full chain from Snowflake column to Looker dashboard is then established.

A similar pattern exists with tools like Tableau and PowerBI, which provide both transformations that require parsing, as well as APIs for explicit lineage relationships before and after those transformations. Tools like Sigma and Mode are slightly simpler, because their transformations are expressed in raw SQL, which lets us repurpose lineage parsers built for queries from query history.

Moreover, some BI tools provide APIs that grant access to additional metadata about the transformations and primitives within their ecosystems, further enhancing data lineage extraction. For example, the Tableau Metadata API offers insights into Tableau content, including data sources and their corresponding columns, as well as relationships between them. On the other hand, Mode allows the extraction of reports and spaces metadata using their API.

By utilizing these APIs, you can extract valuable information about data transformations, relationships, and dependencies within your BI tools, enriching your overall data lineage.

Entity Resolution: Stitching Together the Threads of Data Lineage

As our data pipeline grows more complex with multiple sources, transformation steps, and downstream applications, maintaining a holistic view of data lineage becomes more challenging. This is where entity resolution steps in. It's like a detective working behind the scenes, matching pieces of information across different systems to form a cohesive picture. In the context of data lineage in Snowflake, entity resolution involves matching tables and columns fetched from external APIs to those in your Snowflake information schema.

Why is Entity Resolution Important?

Consider the scenario where you are extracting data from a transactional database using an ELT tool like Fivetran or Stitch, transforming it using a data processing system like Spark, and then loading it into Snowflake. Once inside Snowflake, this data may undergo further transformations before being consumed by a BI tool like Looker.

In this pipeline, each component has its own metadata about the data it handles, available through APIs or system views. Your transactional database, ELT tool, Spark, Snowflake, and Looker, all hold pieces of your data's lineage story.

However, these pieces of information are like puzzle pieces scattered across different systems. To get a complete picture of your data lineage, you need to connect these pieces together. This is where entity resolution becomes critical.

Entity Resolution in Action

Let's say you have a `users` table in your transactional database, which is extracted by Fivetran and loaded into Snowflake as `fivetran_users`. Later, you create a view in Snowflake called `active_users` based on `fivetran_users`. Finally, you have a Looker model that uses this `active_users` view.

Now, Looker's API tells you that your model uses a table called `active_users`. But how do you connect this `active_users` in Looker to the `users` table in your transactional database?

This is where you'd use entity resolution:

1. First, you'd look at Snowflake's `OBJECT_DEPENDENCIES` or `QUERY_HISTORY` views to find out that `active_users` is dependent on `fivetran_users`.

2. Next, you'd use the metadata from Fivetran (available through its API) to connect `fivetran_users` in Snowflake to the `users` table in your transactional database.

By connecting these pieces of information, you've just traced the lineage of your Looker model all the way back to the source table in your transactional database!

Ensuring Effective Entity Resolution

Entity resolution is not without its challenges. Inconsistent naming conventions, schema changes, and lack of unique identifiers can all make it difficult to match entities across different systems. Here are a few best practices to ensure effective entity resolution:

  1. Adopt consistent naming conventions across your data pipeline.
  2. Track schema changes and reflect them in your lineage tracking.
  3. Use unique identifiers wherever possible.
  4. Regularly update your entity resolution logic as new systems or datasets are added to your pipeline.
  5. Reconcile against additional pieces of information, like the number of columns in a table, or the last time a table was accessed.

Storing Lineage (aka: Ancestry.com?)

The previous code is useful for getting snapshots of lineage for one-off, synchronous usage. But to use lineage in more complex workflows that require persisted data, like understanding lineage change over time, it's essential to persist the lineage information.

Persisting Lineage in Your Database

To make data lineage readily available, it's essential to persist the lineage information in your database. This involves storing the relationships, dependencies, and transformations captured in the lineage graph within your database infrastructure.

Choosing the right database technology is crucial for persisting and querying lineage data. Relational databases, such as PostgreSQL, are widely used and offer advantages in managing structured data. They provide a flexible schema design, robust querying capabilities using SQL, and strong transactional guarantees. However, querying complex lineage relationships and scaling can be challenging as lineage graphs grow in complexity and your operations become more expensive.

On the other hand, graph databases like Neo4j are purpose-built for managing interconnected data, making them well-suited for lineage information. They efficiently represent and query complex relationships, enabling organizations to explore lineage patterns and perform graph-based analytics. Graph databases scale well and can handle larger lineage graphs effectively. However, they require careful consideration and optimization of the graph schema.

Having used both relational databases as well as graph databases for storing lineage information, I think that ergonomics are more important than scaling for many data ecosystems. That is, unless you have in the 10,000s of nodes and 100,000s of edges (which is totally possible), the performance is likely comparable.

Note on hierarchical relationships: When designing a database model for lineage data, it's important to consider both lineage relationships and hierarchical relationships.

Lineage relationships capture the dependencies and transformations between data elements, while hierarchical relationships represent higher-level organizational structures, such as schemas containing multiple tables or spaces in tools like Mode containing multiple reports. Without representing hierarchical relationships, it's difficult to "roll up" lineage relationships to different levels of granularity, for example to go from column-level lineage to table-level lineage.

By incorporating both types of relationships in the data model, organizations can gain a comprehensive understanding of the data ecosystem's flow, dependencies, and structural organization.

Building Visualization Layers on Top of It

Visualization is a key aspect of making data lineage actionable. By building visualization layers on top of the persisted lineage data, you can create intuitive and informative visual representations of your data lineage graph. Here are some of the options available:

  1. Visualization Tools: Tools like Looker and Tableau, which you probably already use, provide powerful capabilities for building interactive and visually appealing data lineage visualizations.
  2. Desktop Tools: Gephi and Cytoscape are desktop applications specifically designed for visualizing and analyzing complex networks. These tools excel in visualizing large-scale data lineage graphs with intricate relationships, and they offer advanced layout algorithms, interactive features, and customization options.
  3. Python Libraries: Python libraries such as NetworkX and Plotly are widely used for visualizing data lineage in a programmatic manner. NetworkX provides a robust set of tools for creating, manipulating, and analyzing network graphs, while Plotly offers interactive visualization capabilities with a focus on web-based visualizations.
  4. Web Libraries: Web-based visualization libraries like D3.js and Vis.js offer a wide range of graph layout algorithms, interactive controls, and animation capabilities, allowing you to create engaging and informative data lineage visualizations that can be accessed and explored by users through a web interface.

Deprecating Lineage When It's No Longer Relevant

Data lineage evolves as your data ecosystem changes. It's crucial to deprecate lineage information when it becomes obsolete or irrelevant. For example, when tables or columns are removed, or transformations are updated, the associated lineage should be marked as deprecated. This ensures that users are working with up-to-date and accurate lineage information, avoiding confusion and potential errors.

Making Lineage Useful: Visualization, Integration, and Beyond

Having access to raw data lineage information is valuable, but to truly make this information actionable, it needs to be visualized and integrated into your workflows, while also serving various operational and strategic purposes.

1. Root Cause Analysis After an Incident

Data lineage is critical for root cause analysis after a data incident. Anomalies can occur at any point in your data pipeline, and without a clear understanding of your data lineage, it can be like finding a needle in a haystack. However, with accurate data lineage, you can trace back from the point of error to its origin, thereby quickly identifying and addressing the root cause.

2. Impact Analysis After an Incident or Change

Whenever a change is made to a data source, it can have downstream effects that are not immediately apparent. Impact analysis using data lineage allows you to understand what downstream objects will be affected if you're planning to change a table or a column. This capability helps you anticipate and manage the ripple effects of changes, thereby minimizing disruptions and maintaining data integrity.

Example of a Pull Request comment that shows downstream impact of a change.

3. Reducing Inefficiencies

In complex data ecosystems, it's common to have redundant data flows or transformations. By visualizing your data lineage, you can identify these inefficiencies and streamline your data processes. This not only improves your system's performance but also reduces costs associated with storage and compute resources. It's equally common to have inefficiencies. Most code can be optimized in some way. Understanding which tables or columns have the most dependencies can help prioritize what to refactor and optimize.

4. Onboarding New Members of the Data Team

Data lineage visualization is a powerful tool for onboarding new team members. It provides them with an overview of the data landscape, helping them understand data sources, transformations, and destinations. This will enable them to get up to speed quickly and contribute effectively to the team.

5. Onboarding New Data Consumers

Similarly, data lineage is crucial when onboarding new data consumers, such as analysts or business users. It helps them understand where the data comes from, how it's transformed, and what it means, enhancing their trust in the data and enabling them to make more informed decisions.

6. Migrating Systems

Migrations are a common use case where data lineage becomes particularly useful. When migrating data from one system to another or restructuring the data infrastructure, understanding the impact on data flows and dependencies is critical. Data lineage helps you assess the potential impact of migrations by visualizing the relationships between data sources, transformations, and downstream objects. This allows you to plan and execute migrations with confidence, minimizing disruptions and ensuring data integrity.

7. Improving data governance and compliance

Maintaining accurate data provenance through comprehensive data lineage is crucial for enhancing data governance, regulatory compliance, and protecting against data exfiltration. By understanding how data flows and implementing proper access controls, organizations can ensure the security of sensitive information and mitigate the risk of unauthorized access. This level of data governance fosters trust, strengthens security measures, and enables compliance with regulatory frameworks.

Integration

In addition to visualization, the integration of data lineage into your workflows can also be automated. With APIs, you can automate the process of capturing data lineage and incorporating it into your data catalog, helping you maintain an up-to-date view of your data landscape. This ensures that your lineage information is always current and accessible, further enhancing its value for the above jobs-to-be-done.

By thoroughly understanding your data lineage, you can manage changes effectively, reduce inefficiencies, onboard team members and data consumers efficiently, and perform root cause and impact analysis with confidence. This not only improves your day-to-day operations but also helps you leverage data as a strategic asset.

Words of Encouragement (aka: Manage Your Expectations)

As you embark on your cross-continental lineage journey, it's important to set realistic expectations.

Attaining 100% accuracy and 100% coverage is arguably impossible. Query parsing will never be flawless without access to proprietary query engines, in-memory transformations are difficult to track, and not all tools offer lineage APIs. Moreover, lineage is dynamic, and is susceptible to going stale.

But even if perfect lineage is out of reach, useful lineage is very much within reach. Just like how all models are wrong but some are useful, all lineage is incomplete but some is useful. That’s why it’s crucial to define and zero-in on a use case that you and your team are trying to accomplish.

Taking a focused approach allows you to identify the point of diminishing returns and prioritize your efforts strategically. Moreover, tethering your progress to a concrete goal is the best way to keep your mind sharp and maintain your motivation on a problem that can feel like pushing a boulder up a hill. How do you eat an elephant (or pumpkin, if you prefer)? Bite by bite.

Summary

Data lineage plays a crucial role in managing and understanding your data landscape. It provides visibility into how data flows and transforms across your systems, which is invaluable for data governance, compliance, debugging, and change impact analysis.

In the context of data lineage, there are two main levels to consider: table-level and column-level lineage. Table-level lineage gives you a broad view of how tables depend on each other, while column-level lineage provides a more granular view into how individual columns are transformed and used.

To capture data lineage, you can leverage database metadata, which includes details like table dependencies and query histories. In Snowflake, you can use views like `OBJECT_DEPENDENCIES`, `ACCESS_HISTORY`, and `QUERY_HISTORY` to gather this information.

However, raw data lineage information is most valuable when visualized and integrated into your workflows. Visualizing data lineage helps you understand complex data relationships, and integrating data lineage into your workflows allows you to use this information for things like impact analysis and debugging.

Data lineage can help with a number of jobs-to-be-done including root cause analysis, impact analysis, reducing inefficiencies, and onboarding new members of the team or data consumers. The right approach to data lineage can provide you with a more robust, transparent, and efficient data landscape.

Obligatory self-promotion: Lineage is an interesting and hard problem, in part because you are never "done." As a result, development a lineage system from scratch can be rewarding (we did it) but time-consuming and tedious.

Building for yourself is the best way to learn, of course. But if you just want to get results, Metaplane is a good option because we automatically parse column-level lineage from query history, and lineage from upstream and downstream dependencies using APIs. This lineage information is displayed using a scalable visualizations (scaling to 1,000s and 10,000s of entities) and embedded within your CI/CD and incident management workflows.

Start for free or book a demo to make your Snowflake data lineage easier and more comprehensive with Metaplane.

We’re hard at work helping you improve trust in your data in less time than ever. We promise to send a maximum of 1 update email per week.

Your email
Ensure trust in data

Start monitoring your data in minutes.

Connect your warehouse and start generating a baseline in less than 10 minutes. Start for free, no credit-card required.