Get the essential data observability guide
Download this guide to learn:
What is data observability?
4 pillars of data observability
How to evaluate platforms
Common mistakes to avoid
The ROI of data observability
Unlock now
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Sign up for a free data observability workshop today.
Assess your company's data health and learn how to start monitoring your entire data stack.
Book free workshop
Sign up for news, updates, and events
Subscribe for free
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Getting started with Data Observability Guide

Make a plan to implement data observability across your company’s entire data stack

Download for free
Book a data observability workshop with an expert.

Assess your company's data health and learn how to start monitoring your entire data stack.

Book free workshop

Data Quality Considerations for Data Stack Additions

As teams continue to level up their data stack to break down silos, move faster, and generally make data more available to end users, new tools are often needed to replace old ones or legacy homegrown processes. To protect your new investments while accelerating implementation, we’ll cover a few common areas of data quality issues and how Metaplane helps you monitor those areas.

and
August 17, 2023

Hooked on Data

August 17, 2023
Data Quality Considerations for Data Stack Additions

Cloud warehouse migrations

The scenario that we’ll cover is when you’re moving from using a relational database for both transactions and analytics to a cloud-based analytics warehouse/lakehouse. Teams are often doing this to reduce their need for infrastructure management so that they can scale out both storage, improve query speeds without the fear of table locks, and improve capacity for new analytics projects. Note that in most setups, you’ll want to maintain your database to record transactions, resulting in ongoing replication to your new warehouse/lakehouse. To achieve this, you’ll want to watch out for several areas:

  • Database object inventory - The question of which objects you’ll want to import will come up fairly early in your migration. To do this, you’ll want to understand the full list of schemas and tables that exist in your current database, as well as the query usage associated with those objects, so that you can create a checklist of items being actively used, and prioritize those moving forward. You can retroactively also clean up your existing database, while building stakeholder relationships, as you discover objects and fields that are no longer in use. 
  • Schema changes in the source database - While your engineering team may be aware not to make unnecessary schema changes, for fast moving businesses, those changes are occasionally a necessity. Tracking schema changes in the source database alerts you to potentially update your replication process to accommodate new tables and fields, especially if they have analytical value.
  • Shifting the culture to proactive incident management - Overhauling your data warehousing strategy is a significant investment of time and effort. It’s always beneficial to show progress to build trust in the new direction, and in a data warehouse, replicating analytics and building new reports is a common way to do so. Accurate data is needed to drive successful insights, and data quality monitoring showing your users what the expected bounds of fields and updates are great ways to prove the accuracy of data used in a report. 

Business intelligence tools

Whether you’re moving from queries exported as CSVs or from another tool, the core goals of business intelligence tools are to improve data visibility and allow more users to find insights. No matter which tool you select, there’ll be a learning curve when it comes to creating reports and dashboards. When you move beyond replicating existing reports, and power users begin creating new dashboards, data consumers begin quality assurance validation. If you find inaccuracies, the two most common culprits are:

  • The user is wrong - In this case, “user” can refer both to the person creating the dashboard as well as another stakeholder using the dashboard. A dashboard creator may be using a silently deprecated field or accidentally used an incorrect join type, while a dashboard user might have misunderstood a field definition or used an incorrect filter. 
  • The data is wrong - When this happens, you’ll need to take some time to root cause the issue, which includes taking an inventory of the reference objects used to build a report or dashboard, and parsing queries and scripts to understand how data was generated upstream. This may delay your implementation timeline.

Using column level lineage, showing how fields from tables and views in your warehouse are used in business intelligence workbooks, helps to ensure that you’re using the right fields, and accelerates troubleshooting when needed. On the other hand continuous data quality monitoring ensures that values are accurate and up to date, eliminates “wrong data” as a potential culprit.

Data ingestion tools

Whether you’re attempting to “replicate” (pun intended) how an existing pipeline behaves, or handle new sources, you’ll want to ensure that your replication tool is creating an accurate copy of your source system. Some of the common areas to watch out for are:

  • Freshness issues - To ensure that your pipeline is updating data on the schedule that you’ve configured, you’ll want to establish monitors that alert you when an object isn’t updated within an expected timeframe, allowing for latency stemming from extraction, formatting (e.g. JSON parsing to a tabular structure), and merging (e.g. from a staging table). 
  • Row count issues - Partial data loads can occur from broken connections (e.g. database connection interruptions), inaccurate flat files (e.g. 3rd party exported), or faulty API call pagination (e.g. by timestamp). The purpose of monitoring row count over time would be to understand your expected update volumes, which can change throughout the day, depending both on the cadence of your replication schedule as well as source data updates.
  • Null values - When you’re merging several tables, with conditional statements applied to each, null values can exist as silent data bugs. Imagine a scenario where you’re joining on a field that wasn’t accurately replicated - it’s helpful to understand which object(s) in your joins null values exist in, as well as what values historically have had null values (e.g. a deprecated field that’s being used alongside a replacement, with both representing different time frames). 
  • Schema changes - This is especially common in ERP, CRM, and database sources, as they’ll often have custom named fields, and likely be adding more. It’s helpful to track the history of schema changes not only to share better context to upstream teams on how the changes have impacted downstream dependencies, but also to proactively plan for incorporating schema changes into your analytics strategy.

As you can imagine, you’ll also want to understand your data flow from landing tables to production business intelligence dashboards, to help rule out your new ingestion tool as a potential cause.

Transformation tools

In this scenario, we’ll be focusing on in-warehouse transformations (i.e. modeling). This results in a similar approach to your warehouse validation queries and business intelligence report creations, where you’ll want to guarantee accuracy of your data being used in your new and updated models with the data quality metrics mentioned above. One additional consideration will be execution latency, especially for data products that require near real-time insights. You’ll see latency not only in the initial ETL process, but also in situations where you’re orchestrating several transformations in sequence, each with their own distinct runtimes. As you can imagine, you’ll want to track runtime over your models’ lifetime so that you can course correct if/when you notice deteriorating queries that’d impact downstream users.

How Metaplane supports evolving data stacks

As a leader in the data observability space, Metaplane was built to alert data teams when data quality incidents occur, whether those are inaccurate values in the data itself, or an unexpected error in data loads or modeling. To support your goals, we’ve created Metaplane with:

  • Schema change detection - Both in transactional databases as well as cloud lakehouses, you’ll be able to view a history of schema changes as well as send those events to notification channels. To facilitate conversations, we recommend creating channels including those responsible for schema changes, such as software engineers or business system administrators, and those reliant on data from those system(s).
  • Data quality monitors - Our monitors alert when freshness, volume, nullness rates, uniqueness rates, numeric distributions, and other data quality metrics change outside of an acceptable threshold. These thresholds are created from feeding your data profile(s) into our machine learning models, which also update them over time, so that incidents that you see reflect actual alerts.
  • Column level lineage - Metaplane automatically generates and updates column level lineage from your ingestion tools, through your warehouse, down to business intelligence tools, for your data team to more quickly root cause issues with new tool(s) and understand their impact, including affected data consumers.

If you’re implementing a new tool, you can create a free Metaplane account and integrate your new tool(s) within 15 minutes, so that you can spend less time focusing on data quality, and more time proving the value of your investments.

Table of contents
    Tags

    We’re hard at work helping you improve trust in your data in less time than ever. We promise to send a maximum of 1 update email per week.

    Your email
    No items found.
    Ensure trust in data

    Start monitoring your data in minutes.

    Connect your warehouse and start generating a baseline in less than 10 minutes. Start for free, no credit-card required.