Introducing Snowflake Table and Column Usage Analytics
It’s easier than ever to ETL data, but harder than ever to understand who or what is using this data. Metaplane now monitors table and column level usage analytics for Snowflake customers so you can better understand how critical data is used, what should be tested, and how to prioritize data quality issues.
With the rise of cloud data warehouses and tools throughout the data pipeline, it’s easier than ever to extract, transform, and load data to be operationalized. But with this ease of storing data comes new complexities around answering a question every data engineer has a tough time answering:
❓How are downstream people and tools using our data?
We’re excited to release table and column usage analytics, a new feature for Snowflake users that helps data teams answer this question by including usage metadata throughout the data engineering lifecycle. By parsing warehouse query history, Metaplane collects metadata like how frequently a table or column is queried and which user and role ran the query.
There are endless use cases, but we’d like to highlight several that our customers have found most helpful: understanding data usage over time, running downstream impact analysis, reducing data debt, testing your most critical data assets, and providing actionable context when things go wrong.
Understanding Data Usage Over Time
When we asked our customers who or what was using their data, they couldn’t tell us with much confidence. And it makes sense why - to get to these answers definitively, a data team would need to read through all of their ad-hoc queries, transformations, and BI tools or invest engineering resources to automate collecting, aggregate, and displaying metadata like warehouse query logs.
With Metaplane, a data engineer can view a table usage report and at a glance confirm if their assumptions about the use of this table was correct, or whether there were any unexpected usage trends. For example, while you would expect to see your core fact and dimension tables used frequently, you shouldn't see a “temporary” table being queried an increasing amount of times in the past seven days.
In order to understand your data usage trends, Metaplane aggregates this metadata over time so you can understand changes in your data. For example, you can view and sort table usage by the number of queries run in the past seven days. Our customers use this to understand when sudden increases or decreases in usage occur. Changes in trends like this are often indicators that something is broken or data is being used in a new way. In both cases, the data team should be made aware as soon as possible so they can fix any issues or prioritize making changes to their models.
Here are some ways our customers use data usage over time:
- Understand how your team is using staging and production data environments
- Confirm known knowns about your data and ensure this is consistent over time
- Find unknown unknowns about your data, like persistent temporary tables with increased usage over time
- Track usage over time to understand if something broke or if new models need to be built due to increased usage
A high level view of data usage is not only helpful in building awareness of trends - it also empowers data engineers to move faster and make less mistakes.
Running Downstream Impact Analysis
As a data engineer, it’s painful to update or remove a model and only find out about broken data from your downstream consumers. We believe data engineers can't be expected to never break things if there aren't easy ways to grok downstream impact. That's where usage metadata can help.
Not all tables and columns are created equal - Metaplane shows you how frequently tables and columns are queried at both a high and low level. Using this context can help you start new projects or make changes to existing data models.
For example, you can view high level usage data about tables and columns to understand how frequently they are queried. This is a great starting point to assess the importance of the data that could be impacted by changes you'd like to make. You can then dig even deeper to see common queries by role and user and further assess how important a particular data asset is. This high and low level context is helpful in planning work, making less mistakes, and communicating changes to interested parties.
In addition to usage, Metaplane customers often use the warehouse to BI data lineage tool to build a full picture of downstream dependencies.
Here are the most common ways to leverage data usage to perform downstream impact analysis:
- Find critical tables or columns that could be impacted by changing a dbt model
- Find BI assets that could be impacted by changing or removing a table or column
- Find which roles and users are most commonly using a table or column
Reducing Data Debt
The most productive data teams proactively reduce data debt and the surface area they need to maintain. It’s challenging to prioritize addressing tech debt because there are always more important data products to build or fires to fight.
Metaplane automates the discovery of rarely used or poorly performing data assets so data engineers only need to spend minutes to plan work for reducing data debt. By sorting tables or columns by the least number of queries, you can immediately scope and prioritize removing them. We're excited about the opportunity to make addressing data debt a reality for every data team.
Here are some ideas on reducing data debt with usage metadata:
- Find tables and columns that can be removed because they are no longer queried
- Find tables and columns that have poor performance and should be refactored
Testing Your Most Critical Data Assets
Metaplane makes it easy to blanket your warehouse with essential baseline tests like freshness and row counts. But how do data teams ensure their most critical tables and columns have the appropriate testing coverage and continuous monitoring? Enter table and column usage.
Metaplane can suggest which tables and columns should be tested based on how frequently they are used. If a critical table is queried more frequently by a reverse ETL tool or BI dashboards, you may want to include finer grain column tests like a suite of distribution tests, or write a custom SQL test to monitor computed metadata. Conversely, a table that is rarely queried most likely does not need all of its columns tested as rigorously.
Here are the most common ways to use data usage to test your data assets:
- See suggested tests based on table and column usage
- Automatically add tests to critical tables and columns based on usage
Actionable Context When Things Go Wrong
Whenever we build new product at Metaplane, we always try to incorporate it into the data engineering workflow. Table and column usage is no exception - one of the most helpful pieces of metadata to consider when a data quality issue occurs is: how important is this problem? One of the most effective ways to answer this question is to understand how frequently this data is being used.
That’s why Metaplane includes this usage data with every alert. When you receive a notification, we’ll include the upstream and downstream lineage, and also include the usage data so data engineers can understand what might have caused the issue and how impactful this issue will be so they can easily prioritize it.
Table and column usage analytics is a new piece of metadata that Metaplane can now collect and monitor. We’re excited to include this metadata throughout the product and provide helpful context in data engineering workflows like planning changes to the warehouse, creating models, or debugging data, to name a few.
We’d love to hear from you - do you have any use cases for table and column usage that we didn’t mention? Get in touch on Twitter and let us know!
Table of contents