The ultimate guide to data lineage in dbt
Learn why data lineage is important in dbt, and how to implement it for better data visualization, easier debugging, and more collaboration.
Imagine someone giving you directions to a destination, but you don’t have a map or GPS. The explanation is: “Take the highway north, then follow the signs to downtown.” You might know the general area you're headed to, but without clear directions, you’re left piecing things together, wondering where the highway starts, how far north to go, or even if you’re heading in the right direction.
That’s what it’s like trying to explain how data arrived at a certain point without data lineage. Data lineage is like having a detailed map or GPS. It shows you the exact route your data takes from its origin to its final destination, detailing every stop, transformation, and process each step of the way.
With data lineage, organizations can avoid making decisions based on incorrect, incomplete, or outdated data because they can identify where it came from, if/how it was altered, or whether it's trustworthy.
Data lineage serves two main purposes in data management: tackling technical challenges and empowering data teams. If dbt is part of your data stack like it is for Appcues and Reforge, you can trace errors back to their source, plan upstream changes, and maintain accuracy, governance, and transparency in your analytics pipelines.
Below, we’ll explain what data lineage is, why it matters and how to implement it in dbt.
What is data lineage?
Data lineage is the process of tracking data from its origin to its destination, including all transformations and movements in between. It provides a detailed map of how data flows through an organization’s pipelines and how it impacts downstream systems.
Techniques for capturing data lineage include metadata management, data profiling and data mapping. dbt automates data lineage by using its dependency graph, which maps the relationships between different nodes, such as sources, models, snapshots, and metrics.
dbt is a command line tool that enables data engineers and analysts to effectively transform data stored in warehouses. Essentially, it is the T in ELT. dbt takes code, compiles it to SQL, and then runs it against your database.
To fully understand data lineage in dbt, let’s look at Car and Classic’s data stack.
Car and Classic have designed their data infrastructure to prioritize efficiency, scalability, and engineering best practices. They have a structured approach to data modeling and orchestration. Car and Classic define data models using reusable modular code, which means engineers don’t need to re-develop the same model 50 times.
Raw data comes from various sources such as marketing, ads and payment data and is stored in Snowflake. From here, dbt transforms the raw data into clean, structured datasets, while also creating a map of how these datasets evolve over time.
The transformation process is tracked through dbt's lineage feature, which allows the team to view how each model is connected to others and ensures transparency at every stage. Business teams use business intelligence (BI) tool Metabase to visualize and query the transformed data.
The data engineering team also uses reverse ETL tool, Hightouch to sync the transformed data into operational tools to take actions such as personalizing campaigns or updating customer records. Metaplane oversees the entire pipeline, ensuring data integrity and quality by detecting anomalies and maintaining trust in the data.
With this structured approach to data lineage, the engineering team confidently builds, maintains, and iterates on their data products while reducing the complexity and time spent on developing new models.
Why is data lineage important?
Data lineage is important as it improves data quality and accuracy. Data lineage helps engineers ensure that their data has been transformed correctly, and understand how the data was changed and updated. Engineers can ensure the integrity of the data as it flows through an organization.
Data lineage focuses on data accuracy and consistency by allowing engineers to evaluate upstream and downstream pipeline changes to discover anomalies and correct them. Engineers can trace and troubleshoot the datapoints, and identify and fix issues before the data is passed to endpoints like BI and data visualization tools.
Data lineage is also helpful for meeting regulatory requirements like GDPR and CCPA. Most data privacy laws require organizations to demonstrate how they collect, store, process, and delete personal data. Data lineage helps organizations demonstrate compliance with these laws by providing an audit trail of how personal data is collected, processed, and shared across the organization.
What are the benefits of data lineage?
The benefits of data lineage can be summarized like this: enhanced data quality, improved data governance, easier debugging, an improved compliance.
Enhanced data quality: Maintaining the quality of your data throughout its entire lifecycle is challenging. Data lineage uncovers the cause of data quality issues, such as incomplete data sources. Using data observability tools like Metaplane enhances the detection and alerting of issues.
Improved data governance: Data lineage provides transparency and traceability in data movement and transformations. This clear, detailed view of how data flows allows organizations to understand their data and apply governance policies more effectively.
For instance, financial data is often a mix of raw transactional data and aggregated metrics. Data lineage enables engineers to trace the path of each type of data. By knowing the origin and transformations of the data, you can distinguish between transactional data and aggregated metrics.
Easier debugging: Data quality issues are inevitable. Data lineage enables you to find the root cause of the data quality issue, a technique called root cause analysis (RCA). With RCA, you can swiftly identify the root cause of the broken pipeline or data issue and easily fix it.
With data lineage, you can visually see which downstream models, nodes, and exposures are impacted by big upstream changes such as source or model renaming or removals. When you know the exact affected components, it’s easier to debug as you reduce the time investigating the potential issue.
Improved compliance: Non-compliance with data and privacy laws can be costly. Data lineage helps data and legal teams keep on top of data compliance laws because it provides a roadmap of your data, which you can audit.
For example, in financial sectors, data lineage tracks transactional data, which consists of sensitive details about financial transactions. This can include account numbers, payment details, and amounts. This data is subject to strict access controls, encryption and monitoring.
Data lineage provides a clear overview of data handling and processing practices. This overview and visibility are essential for regulatory reviews and audits. By quickly identifying and resolving compliance issues, organizations can improve their compliance.
Why data lineage matters in dbt
Data lineage in dbt provides visibility into how data flows, transforms, and is used across your analytics pipeline. dbt allows you to define dependencies between tables, tag columns, views, tables and logic, which establishes a clear and traceable flow of data throughout the pipeline.
dbt automates data lineage tracking using its dependency graph, which maps the relationships between different data assets, such as sources, models, snapshots, and metrics.
dbt has become increasingly popular in analytics engineering as analysts can run the tool themselves, and you can add it to your CI/CD pipeline so analysts can carry out data lineage themselves, freeing up valuable data engineering time.
Data lineage in dbt enables the following:
Clear visualization of dependencies
dbt automatically builds a Directed Acyclic Graph (DAG) that represents the relationships between models, tables, and transformations. This visual representation allows data teams to see which datasets depend on each other and where the data originates from and how it flows.
Dataset dependency helps you understand how changing one model affects downstream datasets. Knowing where the data originates and how it flows enables you to trace data back to its raw source and verify the data to improve data accuracy.
For instance, if you rename a column and this leads to a break in data, you can quickly identify whether the issue originated from upstream changes in the raw data source or if it took place during the transformation.
Easier debugging and maintenance
Data lineage in dbt leads to easier debugging and maintenance. For instance, in dbt you can:
- Pinpoint where errors occur: For example, if a report is missing data, you can easily identify whether the issue lies in the source data, a transformation, or a downstream dependency.
- Test specific parts of the pipeline: You can isolate a problematic node in the lineage and run targeted tests and validate data quality without needing to rerun the entire pipeline.
Simplified collaboration
Analysts can explore and modify models without needing deep analytics engineering knowledge. Data lineage in dbt requires SQL, YAML and Jinja templating. Data teams can also add meta fields and documentation to nodes in the DAG to provide information for analysts and stakeholders.
For instance, you can:
- Describe tables, columns, and transformations: Provide context on what each dataset represents and how it is derived.
- Enable discovery: New team members or stakeholders can explore the lineage and understand the relationships without needing further explanations.
For example, an e-commerce data team can document that the column `total_revenue` in the `Sales Summary` table is derived by summing the `purchase_amount` from the `Transactions` table. Verifying the source and transformations of sales data before presenting findings to stakeholders.
How data lineage works in dbt
To understand data lineage in dbt, we need to put it into perspective. Let’s imagine we’re on the data team at an e-commerce startup and we’re mapping out data lineage.
Here are a few key components you need to know to understand how data lineage works in dbt.
- Dependency graph
- Models
- Sources
- Snapshots
- Metrics
- Exposures
We’ll provide further details about each component below.
Dependency graph in dbt
The dependency graph is at the core of how dbt manages data lineage. It visually maps how each component in a dbt project interacts with others. This graph is automatically generated when you define dependencies using dbt's built-in syntax, such as `ref()` and `source()`.
For instance, in our e-commerce startup, the graph shows how `ecommerce_data.orders_export` and `ecommerce_data.sales_record` sources are processed by `stg_orders_export` and `stg_sales_record` staging models, which then feed into higher-level transformation models like `orders_and_sales_summary` and `profitability_summary`.
What does the dependency graph show?
- Sources: Raw data tables or files.
- Models: SQL (or Python) files that clean, prepare, or aggregate data.
- Snapshots: Historical records of source data.
- Metrics: Reusable business calculations.
- Exposures: End-user outputs like dashboards or reports.
Each node (a model, source, or other dbt object) and its dependencies are represented as directed edges in the graph. This provides a clear visual of how data flows and transforms.
Models
A model is essentially a SQL (or Python) file that contains the logic for transforming raw data into a more analytical or business-friendly format. Models let dbt know how to build a specific data set. When you run dbt, it takes the SQL (or Python) code and executes it against your data warehouse to create tables or views.
These models help structure data by applying consistent logic, ensuring accuracy and clarity in analytics.
- Staging models: Prepare raw data by standardizing formats and applying basic transformations.
- Transformation models: Build on staging models to create meaningful datasets for analysis, like aggregations or calculated metrics.
The importance of model naming and dependencies
Using clear, consistent naming conventions ensures that the lineage graph remains understandable and organized. Models should be named to reflect their role in the pipeline (e.g., `stg_` for staging, `fct_` for facts, and `dim_` for dimensions). Additionally, setting proper dependencies between models ensures accurate lineage tracking and reduces the risk of errors.
For example:
- Sources:
- ecommerce_data.orders_export – Raw order data from Shopify.
- ecommerce_data.sales_record – Physical store sales data.
- Staging Models:
- stg_orders_export: Cleans and prepares raw order data.
- stg_sales_record: Standardizes the sales data.
- Transformation Models:
- orders_and_sales_summary: Combines cleaned orders and sales data to give a unified view.
- profitability_summary: Calculates profitability metrics from the combined data.
This hierarchy demonstrates how data flows through the pipeline, providing clear lineage.
Snapshots
Snapshots are historical records of your source data that allow you to look at previous data in their tables. Snapshots automatically track and store changes in your data. For example, if a customer’s order address changes, the snapshot will store the old address alongside the new one, along with the timestamp of when the change occurred.
Snapshots (e.g., `orders_snapshot`) automatically appear in the graph when referenced by models.
Metrics
Metrics in dbt are business calculations that are executed on your transformed data. These metrics are typically calculated using SQL queries. For example, you could define a metric like "Total Revenue" by aggregating the sales data in your models and summing up the sales amount over a specific period.
Metrics add logical layers to the graph by enabling reusable business logic. For example, a metric like `avg_order_value` can be included in multiple models.
Exposures
Exposures are used to define and track the downstream use of your data, particularly how your transformed data is used by external tools or processes, like BI tools or other reporting tools. Exposures are defined in `properties.yml` files.
How dbt builds lineage
dbt builds data lineage by defining relationships between data sources and transformation models. These relationships help dbt track the flow of data from its origin to its final output. The two primary functions used to define data lineage in dbt are `source()` and `ref()`.
Through source()
When you define a source in dbt using the `source()` function, dbt understands it as the origin of the data pipeline. For example:
```sql
SELECT
NAME AS customer_name,
TOTAL AS order_total
FROM {{ source('ecommerce_data', 'orders_export') }}
```
In this case `ecommerce_data.orders_export` is the source node and any model referencing this source becomes a dependent node in the graph.
Through ref()
The `ref()` function is used to define dependencies between dbt models. When one model references another using `ref()`, dbt creates a direct relationship in the graph. For example:
```sql
SELECT
customer_name,
SUM(order_total) AS total_revenue
FROM {{ ref('stg_orders_export') }}
GROUP BY customer_name
```
Here the transformation model depends on the staging model `stg_orders_export` and dbt ensures that `stg_orders_export` is executed before the transformation model.
How to implement data lineage in dbt
Here’s how you implement data lineage in dbt and ensure data quality.
Set up the project structure
To implement lineage in dbt, start by organizing your project into folders:
- Sources: Store YAML files defining your raw data sources.
- Staging: Contain models for data cleaning and preparation.
- Transformations: Include models for business logic and aggregations.
- Snapshots: Track historical changes in source data.
- Metrics: Define reusable metrics for analysis.
Define sources
Defining sources is a crucial step in establishing data lineage. Sources represent the raw data tables in your database. Define them in `sources.yml` to track lineage back to the source.
This is critical for data transparency, understanding how data flows through your models, and ensuring traceability in case of errors or changes.
For example:
```sql
version: 2
sources:
- name: ecommerce_data
schema: RAW_DB
tables:
- name: orders_export
- name: sales_record
```
- name: `ecommerce_data` is the name of the source.
- schema: `RAW_DB` is the schema in the database where the source tables reside. This points to the location of the raw data, helping dbt know where to find and reference it.
- tables: Lists the specific raw tables within the schema. Here, `orders_export` and `sales_record` are the two tables in the `RAW_DB` schema that represent the raw data we’re working with.
Defining these sources in your dbt project tells dbt where the data originates. This helps create a map of your data pipeline, which dbt uses to automatically generate lineage diagrams, showing how these raw data tables (sources) are used and transformed in subsequent models.
Build staging models
Staging models clean and prepare data for further transformations. In dbt, staging models are typically represented as views or tables and are designed to handle basic operations like renaming columns, type casting, filtering out invalid records, or handling missing values For example:
```sql
{{ config(materialized='view') }}
SELECT
NAME AS customer_name,
TOTAL AS order_total,
PAID_AT AS payment_date
FROM {{ source('ecommerce_data', 'orders_export') }}
```
`{{ config(materialized='view') }}` tells dbt to create a view instead of a table, which is a lightweight way of storing the data.
The query selects and renames the columns (`NAME`, `TOTAL`, and `PAID_AT`) from the orders_export source to create more readable, business-friendly column names.
`source('ecommerce_data', 'orders_export')` refers to the raw data from the `orders_export` table in the `ecommerce_data` source, ensuring that the pipeline’s lineage is traced back to the original data source.
Staging models standardize and structure the raw data, making it ready for more complex transformations.
Create transformation models
Transformation models perform advanced calculations and aggregations on raw data. These models take the clean data prepared by the staging models and apply advanced calculations, business rules, or aggregations that generate the final datasets for reporting or analysis.
Transformation models allow engineers and analysts to define complex queries, ensuring data combination and manipulation for key performance metrics or data reporting.
For example, the following SQL transformation model calculates `total_revenue` and `total_profit` by aggregating order and sales data.
```sql
{{ config(materialized='table') }}
SELECT
orders.customer_name,
SUM(orders.order_total) AS total_revenue,
SUM(sales.profit) AS total_profit
FROM {{ ref('stg_orders_export') }} AS orders
LEFT JOIN {{ ref('stg_sales_record') }} AS sales
ON orders.customer_name = sales.customer_name
GROUP BY orders.customer_name
```
`{{ config(materialized='table') }}` tells dbt to materialize this transformation model as a table, meaning dbt will create a physical table in the database that stores the result of this transformation.
`{{ ref('stg_orders_export') }}` and `{{ ref('stg_sales_record') }}` functions are used to reference other dbt models that have been defined earlier in the project. These references ensure that the transformation model depends on the staging models (`stg_orders_export` and `stg_sales_record`), creating a directed relationship in the data lineage.
The query also performs aggregations, like summing the `order_total` and profit columns to calculate total revenue and total profit, respectively. It groups the results by `customer_name` to get the total revenue and profit per customer.
In this example, dbt uses `ref()` to link the models and automatically ensure that the staging models are run before the transformation model, based on their dependencies. This structure helps maintain an organized data pipeline, and dbt will track the relationships between all models in the lineage graph, allowing you to trace how the final transformed data is derived from its raw source.
Use snapshots for historical tracking
Snapshots capture changes in source data over time. When using snapshots, dbt compares the current state of the source data against the previous state to detect any changes and records those changes for future analysis.
```sql
{% snapshot orders_snapshot %}
{{
config(
target_schema='DBT_MAKITI_ANALYTICS',
target_database='RAW_DB',
unique_key='id',
strategy='timestamp',
updated_at='PAID_AT'
)
}}
SELECT * FROM {{ source('ecommerce_data', 'orders_export') }}
{% endsnapshot %}
```
`{% snapshot orders_snapshot %}` tag defines the snapshot, where `orders_snapshot` is the name of the snapshot model. The `config()` block defines how dbt should handle the snapshot. The important parameters here are:
- `target_schema` and `target_database`: These define where the snapshot data should be stored.
- `unique_key`: Defines the unique identifier for each record, in this case, `id`, ensuring that each record can be tracked and identified uniquely.
- `strategy`: The timestamp strategy is used here, meaning dbt will track changes based on a timestamp column, such as `PAID_AT`, that indicates when the record was last updated.
- `updated_at`: This defines the column (`PAID_AT`) that dbt will monitor for changes. If the timestamp in this column changes, dbt will record it as a new version of the record in the snapshot.
`{{ source('ecommerce_data', 'orders_export') }}` function references the raw data from the `orders_export` table in the `ecommerce_data schema`. This ensures that dbt is tracking the raw source data for historical changes.
Snapshots allow you to track data at specific moments in time which is important for compliance regulations that require a historical record of changes in data.
Add tests for validation
Adding tests for validation ensures data integrity and accuracy. Tests can be defined within the `schema.yml` file for each model.
```sql
version: 2
models:
- name: orders_and_sales_summary
columns:
- name: customer_name
tests:
- not_null
- unique
```
In the above example, two tests are defined for the `customer_name` column:
- not_null: Ensures that there are no null values in the `customer_name` column, which could affect the accuracy of downstream analysis.
- unique: Ensures that each value in the `customer_name` column is unique and there is no duplicated data.
Generate lineage graphs
Generating lineage graphs provides a visual representation of the relationships and dependencies between the different components in the data pipeline. Lineage graphs enable you to easily understand how data flows through your transformations and how different models, sources, snapshots, and metrics are connected.
Run the following commands to generate and serve lineage graphs:
- `dbt docs generate`
- `dbt docs serve`
The interactive graph will show the relationships between your sources, models, snapshots, and metrics.
Continually test models
Periodically run `dbt test` and review lineage graphs to ensure the pipeline remains accurate and efficient. Regularly testing dbt models helps maintain the accuracy and reliability of the data pipeline.
By periodically running dbt test and reviewing lineage graphs, you ensure that the transformations and dependencies in your pipeline remain consistent and error-free as your data or logic evolves.
Use the `dbt test` command to validate the data against predefined tests, such as checking for, null constraints. This proactive approach ensures the data integrity and helps catch errors early.
After testing, revisit the lineage graphs to confirm that changes made to one part of the pipeline do not affect downstream dependencies. Based on the results of your tests, refine your models, adjust logic, or update dependencies.
Limitations of dbt’s native lineage tracking
dbt's native lineage offers multiple advantages for analytics engineering teams. There are some limitations, more specifically:
Limited scope
dbt's lineage only focuses on the transformations defined within its framework. dbt begins tracking lineage where raw data sources are defined using the `source()` function, meaning that if a data pipeline transforms data before it is loaded into the warehouse, dbt will not capture or display these processes.
Likewise, dbt does not provide insights into how the transformed data is used by downstream systems like BI tools. This lack of visibility can create blind spots, particularly for teams that want a holistic view of their organization’s data lifecycle.
No real-time monitoring
Lineage tracking in dbt is static and lacks real-time monitoring. The lineage graph and documentation reflect the state of the pipeline only after a `dbt run`, `dbt docs generate` command. dbt does not capture real-time changes in the data flow or model dependencies.
If a source table is altered or if a downstream dependency breaks, these changes will not appear in the lineage graph until the next manual update is triggered. Engineers lack the ability to detect issues or anomalies as they occur in the pipeline.
Lack of granularity
dbt’s column-level lineage is only available to dbt Cloud Enterprise accounts that use dbt Explorer. Other account holders can not track how individual columns within their models are derived, transformed, or related across the pipeline.
For example, suppose the e-commerce analytics team questions the source or transformation logic behind a field like `total_revenue`. In that case, dbt's lineage graph can only point to the model it resides in, not the exact upstream columns or transformations used to calculate it.
Limited data observability
dbt’s lineage capabilities primarily focus on transformations defined within its framework, offering little visibility into the broader data ecosystem. In modern data workflows, data flows through various tools and systems, including ETL platforms, data warehouses, and BI tools.
dbt does not provide a unified view between these interconnected tools and systems which is crucial for most data teams. This limitation can prevent teams from understanding how data flows across their entire pipeline, making it harder to diagnose issues or optimize data.
Implement data lineage automatically with Metaplane
Implementing data lineage manually in dbt is effective, but it's also time-consuming and difficult to maintain. Metaplane, on the other hand, easily integrates with your dbt pipeline and other systems and tools in your data stack. Once configured, you can automatically track and visualize how data flows across your entire pipeline, offering enhanced data observability.
You can also track lineage down to the column level, providing deeper insights into the flow and transformation of your data.
With this level of visibility, you can quickly identify issues, detect anomalies, and ensure the accuracy and reliability of your data across all systems, without manual tracking or constant maintenance.
Get better visibility and better decision making with data lineage in dbt
Data lineage is a key aspect of data management that provides a clear, traceable map of how data moves and transforms across systems. dbt allows teams to easily visualize and manage the flow of data throughout their analytics pipeline.
To capture data lineage, you can use dbt, which automatically generates detailed documentation and visualizations of how data flows through each transformation, making it easier to track dependencies and spot potential issues.
Apart from the technical ease data lineage provides to engineering teams, it benefits all teams in an organization, from a compliance and business perspective too. Ultimately, data lineage fosters better governance and transparency.
However, data lineage is most valuable when visualized and integrated into all aspects of your tools and systems in your data stack, ensuring that every stakeholder has access to accurate, trustworthy data for informed decision-making.
Start for free or book a demo to make your dbt data lineage easier and more comprehensive with Metaplane.
Table of contents
Tags
...
...