Data Quality Considerations for Data Stack Additions
As teams continue to level up their data stack to break down silos, move faster, and generally make data more available to end users, new tools are often needed to replace old ones or legacy homegrown processes. To protect your new investments while accelerating implementation, we’ll cover a few common areas of data quality issues and how Metaplane helps you monitor those areas.
Cloud warehouse migrations
The scenario that we’ll cover is when you’re moving from using a relational database for both transactions and analytics to a cloud-based analytics warehouse/lakehouse. Teams are often doing this to reduce their need for infrastructure management so that they can scale out both storage, improve query speeds without the fear of table locks, and improve capacity for new analytics projects. Note that in most setups, you’ll want to maintain your database to record transactions, resulting in ongoing replication to your new warehouse/lakehouse. To achieve this, you’ll want to watch out for several areas:
- Database object inventory - The question of which objects you’ll want to import will come up fairly early in your migration. To do this, you’ll want to understand the full list of schemas and tables that exist in your current database, as well as the query usage associated with those objects, so that you can create a checklist of items being actively used, and prioritize those moving forward. You can retroactively also clean up your existing database, while building stakeholder relationships, as you discover objects and fields that are no longer in use.
- Schema changes in the source database - While your engineering team may be aware not to make unnecessary schema changes, for fast moving businesses, those changes are occasionally a necessity. Tracking schema changes in the source database alerts you to potentially update your replication process to accommodate new tables and fields, especially if they have analytical value.
- Shifting the culture to proactive incident management - Overhauling your data warehousing strategy is a significant investment of time and effort. It’s always beneficial to show progress to build trust in the new direction, and in a data warehouse, replicating analytics and building new reports is a common way to do so. Accurate data is needed to drive successful insights, and data quality monitoring showing your users what the expected bounds of fields and updates are great ways to prove the accuracy of data used in a report.
Business intelligence tools
Whether you’re moving from queries exported as CSVs or from another tool, the core goals of business intelligence tools are to improve data visibility and allow more users to find insights. No matter which tool you select, there’ll be a learning curve when it comes to creating reports and dashboards. When you move beyond replicating existing reports, and power users begin creating new dashboards, data consumers begin quality assurance validation. If you find inaccuracies, the two most common culprits are:
- The user is wrong - In this case, “user” can refer both to the person creating the dashboard as well as another stakeholder using the dashboard. A dashboard creator may be using a silently deprecated field or accidentally used an incorrect join type, while a dashboard user might have misunderstood a field definition or used an incorrect filter.
- The data is wrong - When this happens, you’ll need to take some time to root cause the issue, which includes taking an inventory of the reference objects used to build a report or dashboard, and parsing queries and scripts to understand how data was generated upstream. This may delay your implementation timeline.
Using column level lineage, showing how fields from tables and views in your warehouse are used in business intelligence workbooks, helps to ensure that you’re using the right fields, and accelerates troubleshooting when needed. On the other hand continuous data quality monitoring ensures that values are accurate and up to date, eliminates “wrong data” as a potential culprit.
Data ingestion tools
Whether you’re attempting to “replicate” (pun intended) how an existing pipeline behaves, or handle new sources, you’ll want to ensure that your replication tool is creating an accurate copy of your source system. Some of the common areas to watch out for are:
- Freshness issues - To ensure that your pipeline is updating data on the schedule that you’ve configured, you’ll want to establish monitors that alert you when an object isn’t updated within an expected timeframe, allowing for latency stemming from extraction, formatting (e.g. JSON parsing to a tabular structure), and merging (e.g. from a staging table).
- Row count issues - Partial data loads can occur from broken connections (e.g. database connection interruptions), inaccurate flat files (e.g. 3rd party exported), or faulty API call pagination (e.g. by timestamp). The purpose of monitoring row count over time would be to understand your expected update volumes, which can change throughout the day, depending both on the cadence of your replication schedule as well as source data updates.
- Null values - When you’re merging several tables, with conditional statements applied to each, null values can exist as silent data bugs. Imagine a scenario where you’re joining on a field that wasn’t accurately replicated - it’s helpful to understand which object(s) in your joins null values exist in, as well as what values historically have had null values (e.g. a deprecated field that’s being used alongside a replacement, with both representing different time frames).
- Schema changes - This is especially common in ERP, CRM, and database sources, as they’ll often have custom named fields, and likely be adding more. It’s helpful to track the history of schema changes not only to share better context to upstream teams on how the changes have impacted downstream dependencies, but also to proactively plan for incorporating schema changes into your analytics strategy.
As you can imagine, you’ll also want to understand your data flow from landing tables to production business intelligence dashboards, to help rule out your new ingestion tool as a potential cause.
In this scenario, we’ll be focusing on in-warehouse transformations (i.e. modeling). This results in a similar approach to your warehouse validation queries and business intelligence report creations, where you’ll want to guarantee accuracy of your data being used in your new and updated models with the data quality metrics mentioned above. One additional consideration will be execution latency, especially for data products that require near real-time insights. You’ll see latency not only in the initial ETL process, but also in situations where you’re orchestrating several transformations in sequence, each with their own distinct runtimes. As you can imagine, you’ll want to track runtime over your models’ lifetime so that you can course correct if/when you notice deteriorating queries that’d impact downstream users.
How Metaplane supports evolving data stacks
As a leader in the data observability space, Metaplane was built to alert data teams when data quality incidents occur, whether those are inaccurate values in the data itself, or an unexpected error in data loads or modeling. To support your goals, we’ve created Metaplane with:
- Schema change detection - Both in transactional databases as well as cloud lakehouses, you’ll be able to view a history of schema changes as well as send those events to notification channels. To facilitate conversations, we recommend creating channels including those responsible for schema changes, such as software engineers or business system administrators, and those reliant on data from those system(s).
- Data quality monitors - Our monitors alert when freshness, volume, nullness rates, uniqueness rates, numeric distributions, and other data quality metrics change outside of an acceptable threshold. These thresholds are created from feeding your data profile(s) into our machine learning models, which also update them over time, so that incidents that you see reflect actual alerts.
- Column level lineage - Metaplane automatically generates and updates column level lineage from your ingestion tools, through your warehouse, down to business intelligence tools, for your data team to more quickly root cause issues with new tool(s) and understand their impact, including affected data consumers.
If you’re implementing a new tool, you can create a free Metaplane account and integrate your new tool(s) within 15 minutes, so that you can spend less time focusing on data quality, and more time proving the value of your investments.