Data CI/CD for dbt Core
If you're using dbt core and want more visibility into how downstream objects are impacted by your next PR, then this feature was made for you!
There’s a good chance that you’ve heard of dbt - after all, it’s been one of the most-adopted tools in the data world over the past few years. dbt helps data teams transform their data in powerful ways using familiar engineering workflows.
There are two versions of dbt:
- dbt Core: This was their initial open source offering that gives users the functionality of dbt, but requires you to self-host and manage.
- dbt Cloud: dbt Cloud is their more fully-featured offering that provides an intuitive UI, managed infrastructure, and other benefits.
If you’re looking for more information on which you should pick, we’d recommend reading advice from dbt themselves.
As Metaplane’s product roadmap was and is dictated by customer feedback, our first dbt integration was for dbt Cloud, which the majority of our customers at the time used. But for the growing number of our customers that use dbt Core, we’ve been hard at work adding support for them, too. This began with job duration monitors to alert you to anomalous job runtimes that can cause downstream issues, and the inclusion of models into our column-level lineage graphs so that you can understand who and what data quality issues affect.
With those features in place, we turned our eye to the end-goal of all data teams: Preventing issues before they even happen.
Proactive incident prevention
We’ve found, from conversations with our customers, that teams who use dbt are what we think of as “high-leverage”, meaning that they’re called upon to create numerous models to address the need to clean, merge, and automate calculations on their data. Thousands of objects in the warehouse, hundreds of models, and tens of requests to create and update models are a few common indicators of high-leverage teams. Outputs of those models might be used in further models and objects, then in downstream data products, ranging from BI dashboards to data used to power customer experiences.
This can lead to another problem, one that isn’t directly listed in the analytics engineering job description: model management.
Over time, you’ll often find data teams having quick internal conversations questioning the purpose of various model(s): Who was this made for? When was this last updated?
And, most importantly: If we updated this now, would it affect anything else that we’re using?
It was those questions that drove the creation of what we call our Data CI/CD tool. At the heart of this feature is our Github application, made to integrate with users’ typical workflows, to warn users about the downstream impact related to any changes they’re making to their models. There are two separate Data CI/CD tools:
- Impact Analysis: You’ll be able to see the number of downstream tables and objects in your business intelligence tool would be affected by this change. This gives an early indicator of the importance of this model in the context of your data stack.
- Test Previews: Seeing impacted tables is only the first step—many of your models will affect other tables in your warehouse. Test Previews test your proposed changes against your production data on several data quality metrics, providing additional certainty that the changes you make won’t dramatically change your established models and tables.
We’re very excited to announce that our Data CI/CD tools are now available as part of our dbt Core integration! With this feature, more data teams will be able to join the ranks of teams like Upright, who had this to say:
❝Metaplane’s Github application has been indispensable as a tool for additional validation as we change code in our dbt instance. It makes it super simple to understand the downstream impact of any sort of change, and when paired with lineage, we can go directly to analysts to tell them ‘X change has Y impact’ to prevent accidental issues
What else does Metaplane do?
Of course, Data CI/CD isn’t all that Metaplane can do. In addition to the column-level lineage graphs to map upstream objects and data ingestion tools, as well as job duration monitors, Metaplane helps teams find issues within the data itself, to ensure that everyone trusts the data that powers your business.
This means that you’ll not only be able to understand when something has gone with your dbt projects, but also receive alerts when raw upstream data or outputs of models have unexpected results. If you’re interested in getting started, we always recommend signing up for a free account. Of course, our team is always on standby if you’d like to talk about how to optimize your setup.