How to Set Up Data Quality Tests

If you’re ready to move beyond one off unit tests to ensure data quality, read on to understand how to establish best practices and processes that bridge the gap between the business and data.

Kevin Hu, PhD

and

May 10, 2023

Kevin Hu, PhD

Co-founder / Data and ML

May 10, 2023

How to Set Up Data Quality Tests

If you’re ready to move beyond one off unit tests to ensure data quality, read on to understand how to establish best practices and processes that bridge the gap between the business and data.

If you’re in software or data engineering, it’s likely that you’re familiar with the concept of unit testing, which you’ve probably implemented in your data pipelines through stored procedures, a workflow in your DAG, open source solutions, or even dbt tests. These tests can span a wide range of data quality dimensions and metrics, including dimensions like freshness, consistency, accuracy, and/or completeness. As the title suggests, your work in ensuring data quality isn’t done after a handful of tests. From our 100s of users, we’ve collected input to form some best practices on setting up best practices data quality.

What are Data Quality Tests?

Data quality tests refer to the processes and procedures implemented to assess the reliability, accuracy, consistency, completeness, and relevance of data within a given system or database. These tests aim to identify any potential issues or anomalies that may impact the integrity or usefulness of the data. Data quality tests can involve various techniques, such as data profiling, data validation, and data cleansing, which help organizations ensure that their data meets predefined quality standards. By conducting data quality tests, organizations can identify and rectify data errors, improve data integrity, enhance decision-making processes, and maintain high-quality data that is reliable and fit for purpose.

For purposes of this blog post, we’ll be primarily focusing on tests that you can create with SQL, and operate primarily within the data warehouse.

10 Examples of Data Quality Tests and when to apply them

There are more ways to test data quality than mentioned down here, but here are 10 of the most popular ways to apply data quality checks.

Completeness Test:

This test ensures that all required fields within a dataset are populated. It checks for missing values or NULL values in critical columns.

When you’d use this: When importing customer data from various sources, ensuring all necessary fields like name, email, and address are populated for each customer record.

Consistency Test

This test verifies that data across different sources or systems aligns and is consistent. It compares data values for specific attributes or key identifiers to identify any discrepancies or conflicts.

When you’d use this: When integrating data from multiple systems, verifying that the customer IDs across different systems match to avoid any conflicting or duplicate records.

Accuracy Test

This test validates the accuracy of data by comparing it against trusted sources or external references. For example, matching customer addresses with postal databases to ensure correct formatting and validity.

When you’d use this: When importing financial data, cross-referencing it with verified sources like bank statements to ensure the accuracy of transaction amounts and account balances.

Integrity Test

This test examines the relationships between different data elements within a dataset. It checks for referential integrity, ensuring that foreign keys match primary keys and that there are no orphaned or invalid records.

When you’d use this: When migrating data from one database to another, checking that all foreign key relationships between tables are maintained and that there are no orphaned records.

Validity Test

This test assesses the data against predefined rules or constraints. It ensures that data conforms to the expected format, data types, and allowable values. For instance, validating email addresses or checking if numeric values fall within defined ranges.

When you’d use this: When receiving user input for a form, validating that the provided phone numbers adhere to the expected format (e.g., a specific number of digits or a valid area code).

Deduplication Test

This test identifies and removes duplicate records within a dataset. It helps maintain data integrity by ensuring that each unique entity is represented only once.

When you’d use this: When merging customer databases from different departments, identifying and removing duplicate records based on unique identifiers like customer ID or email address.

Timeliness Test

This test evaluates the freshness of data by comparing the data's timestamp against predefined thresholds. It helps identify any delays or discrepancies in data updates.

When you’d use this: When analyzing stock market data, ensuring that the data is up-to-date by comparing the timestamp of the latest data point against a predefined threshold.

Uniqueness Test

This test ensures that specific attributes or combinations of attributes are unique within a dataset. It detects any violations of uniqueness constraints, such as duplicate primary keys.

When you’d use this: When importing data from external sources, verifying that each product SKU (stock keeping unit) is unique to prevent inventory and sales discrepancies.

Data Profiling

This test involves analyzing the structure and characteristics of the data. It provides insights into data distribution, statistics, and patterns, helping to identify anomalies, outliers, or data quality issues.

When you’d use this: Before conducting data analysis on a large dataset, profiling the data to identify any missing values, outliers, skewed distributions, or inconsistent formatting.

Data Consistency Test

This test checks for consistency across related data entities or tables. It verifies that data attributes or values align across different datasets or data sources, ensuring harmonization and accuracy.

When you’d use this: When combining customer data from different regions, validating that attributes like country, state, or postal code align consistently across datasets to ensure accurate geographical analysis.

How to set up Data Quality Testing

Data quality tests can be implemented anywhere along your ETL process, typically through stored procedures running one off SQL statements within your data warehouse.

Let’s imagine that we’re a data team working at WordleIndustries, responsible for sales analytics, answering questions like: how do we generate more revenue & how can we do this faster? In that scenario, ensuring that things like schema consistency and high data accuracy are important for corresponding accurate data model outputs.

Identify Important Dimensions and Track Metrics

Start by understanding which dimensions of data quality you want to improve, such as timeliness. Once these are identified, track metrics associated with these dimensions, like the number of data staleness issues flagged by data stakeholders.

Confirm that you’re Measuring What Matters

Ensure that the metrics you establish align with business outcomes that stakeholders care about. Identify how data is driving towards business goals, the data assets serving those goals, and the quality issues affecting those assets. Establishing metrics around these can help you focus on the most impactful use cases of data within your organization. For example, the number of data staleness issues metric above may not accurately represent timeliness if sales reps using your data historically haven’t been filing tickets.

Make Metrics Actionable

The data quality metrics you provide should be easily understandable and applicable to stakeholders' jobs. They could be served up in an Objectives and Key Results (OKRs) report as a snapshot of data health or used to demonstrate how changes at one team’s level affect users downstream. Establishing SLAs should be done in tandem with the creators and consumers of data. In our sales use case, we’d want to make sure that both sales and operations leaders are aware of our metrics and triage processes.

Prevent Issues with Recurring Check-Ins

Periodic data quality assessments and OKRs around quality can help flag data issues early on. This is definitely easier said than done when starting this for the first time at your organization, but is the first step prior to implementing automated data contracts to confirm agreed upon data quality standards going forward.

How to build a data observability solution

If you want to learn more about how you can build ongoing monitoring for yourself, we detail the 4 pillars that Metaplane was built on here.

‍

How to Set Up Data Quality Tests

How to Set Up Data Quality Tests

What are Data Quality Tests?