The Root Causes of Data Quality Issues
What causes poor data quality? On the surface, it might seem like each case is unique, but each data quality issue often stems from one of five fundamental causes. Fortunately, with a deep understanding of data quality and its impacts, you can leverage your data effectively to drive informed decision-making and optimal business outcomes.
It’s 2024, but data quality problems are more widespread than ever. Why?
Data teams are centralizing, transforming, and exposing an increasing amount of data from an increasing number of data sources that serve an increasing number of stakeholders in an increasing number of use cases.
The volume isn’t a problem on its own. The problem is that more data means more surface area to maintain. And as maintenance overhead increases, data breakdowns become more frequent and severe. This causes the downstream functions that depend on data to falter, which leads to degrading trust in data by stakeholders.
On the surface, these issues may appear varied and unique, but they often stem from a handful of fundamental causes.
Understanding Data Quality
Before we get into the nitty-gritty of data quality issues, let’s make sure we’re all on the same page when it comes to "data quality."
Businesses operate to increase revenue, reduce costs, and mitigate risks. They leverage data as a representation of the real world to make better decisions, enhance execution, and improve products and services. That’s why we data quality as the following:
💡 Data quality is the degree to which data meets the needs of these business use cases or an internal standard that operationalizes these use cases.
Five Main Causes of Data Quality Issues
After careful observation and analysis, these are the five most common data quality issues we’ve found.
Input Errors
Input errors occur when the incoming data doesn't conform to your expectations. These errors can stem from a variety of factors, including:
- Human error (e.g. an account executive might accidentally enter a negative value into a contract size field in Salesforce)
- Misunderstandings of data input requirements (e.g. a financial analyst inputs billing frequency in a "Monthly, Quarterly" format rather than as separate entries for each billing cycle)
- System glitches (e.g. a malfunctioning inventory system interprets "0" as the letter "O")
But even if each error seems isolated, the impact of inaccurate data is not just confined to the immediate field or record. It often ripples through the entire data ecosystem, exacerbating issues like duplicate data, where multiple records of the same entity exist due to inconsistent data entry practices. For instance, if an account executive inputs "CompanyX" in one instance and "Company X" in another due to a lack of standardized naming conventions, the system might treat these as distinct entities.
Something as small as a missing space inflates data volumes and complicates datasets, leading to flawed analyses, misguided business decisions, and ultimately, financial or reputational damage. To address this issue, techniques such as double entry, input validation at the source, secondary checks by another party, and feedback mechanisms can be helpful.
Infrastructure Failures
Infrastructure failures happen when your systems do not perform as expected. And when they do occur, they can disrupt carefully established data governance protocols.
For example, say Salesforce is down, or there’s an upstream sync delay stemming from a server or vendor software outage. You’ll probably see a few downstream problems, such as:
- Data inconsistency: During Salesforce downtime, any data entered or updated in connected systems might not be reflected in Salesforce, leading to inconsistencies across platforms. When the system comes back online, reconciling these discrepancies can be time-consuming and error-prone, compromising data integrity.
- Workflow interruptions: Many organizations rely on Salesforce for customer relationship management, sales tracking, and other essential functions. An outage not only halts these activities but also disrupts workflow automation that depends on real-time data from Salesforce.
- Data loss: In the event of an upstream sync delay due to a server or vendor software outage, there's a risk of data loss. Transactions or interactions occurring during the outage may not be captured or synchronized properly, leading to incomplete data (or even missing data) that can affect analytics, reporting, and decision-making.
Common solutions for these issues include redundancy and failover systems, interruptions upon detecting an issue, and automated backfills.
Incorrect Transformations
Incorrect transformations occur when code doesn't perform as expected—whether that’s due to unexpected data or business assumption changes.
For instance, if a transformation script is designed to categorize customer feedback based on predefined sentiment scores, an unexpected change in the machine learning algorithm could render the script ineffective, misclassifying feedback and skewing analysis results. Similarly, business assumption changes, such as redefining what constitutes a 'high-value customer' without updating the corresponding data transformation codes, can lead to inaccurate customer segmentation, impacting marketing strategies and resource allocation.
Techniques inspired by good software engineering practices, such as unit tests and regression tests on code and data, can help prevent these issues.
Invalid Assumptions
Invalid assumptions often stem from changes in upstream dependencies concerning structure, content, or semantics. These dependencies are foundational when it comes to how data assets are collected, stored, and interpreted—they’re the bedrock for how data pipelines are built. So when these foundational elements change without adjustments in downstream processes, those once-valid assumptions are no longer valid.
Consider this: an e-commerce platform modifies the structure of its database, such as changing the format of product IDs from numerical to alphanumeric. If downstream analytics tools continue to assume product IDs are purely numerical, it will cause errors in data processing and analysis, leading to incorrect product tracking or sales reporting down the line.
Changes in content, like the introduction of new categories or discontinuation of certain products, can also render previous assumptions invalid. If data analysis scripts aren’t updated to reflect these content changes, they’ll continue allocating resources to discontinued products or overlook new categories, skewing sales forecasts and inventory management.
Data contracts, better CI/CD processes, code tests, and data tests can help mitigate the impacts of these issues.
Ontological Misalignment
This is not a technical issue—it’s a human one. Different teams within an organization often have different definitions of metrics or entities (e.g. "active user," "customer lifetime value," or "successful transaction"). All this divergence stems from one, simple thing: the lack of a unified framework or common language across the organization. Usually, this is a result of organizational silos and leads to different formats, inconsistent interpretations, and poor data quality
For example, the marketing team might define an "active user" as someone who opens their app at least once a week, while the product team considers an active user to be someone who completes a specific action within the app in the same timeframe. This discrepancy can lead to conflicting reports on user engagement which makes it challenging to assess the effectiveness of user retention strategies.
That’s why establishing organization-wide data governance protocols, including standardized definitions for all key metrics and entities, is crucial. Regular cross-departmental meetings to discuss and align on these definitions confirms everyone is on the same page at all times. Plus, it empowers teams and fosters a more cohesive, data-informed culture. Win-win.
Knowable vs. Unknowable, Controllable vs. Uncontrollable
Understanding these causes can help teams categorize data quality issues on a spectrum of knowable and unknowable, controllable and uncontrollable. While it's impossible to prevent or eliminate everything, focusing on what is under your control and what is knowable can significantly reduce data quality issues and improve data quality management.
For example, incorrect transformations due to bad dbt PRs being merged are knowable and controllable by the team, while input errors from an upstream team are unknowable but controllable with the right constraints. Some things, however, are not in your control, such as a third-party system delay or an API change from a data vendor. Identifying which issues to eliminate and which to detect early and mitigate is crucial, and these decisions should be based on the cause of the issue and where in the data's provenance it sits.
Understand the Root Causes, Start Improving Data Quality
Data quality issues may have various symptoms, but understanding their root causes is the first step toward addressing them. Adopting a systematic and structured approach and learning from other communities can reduce the number and severity of these issues. The journey towards high-quality data is ongoing, but with the right strategies, it's a journey well worth taking.
About Metaplane
Metaplane is a market leader in Data Observability, helping you find and prevent current and future data incidents to increase trust in your data team. We provide data quality monitoring that integrates with your entire data stack, from transactional databases, to warehouses/lakes, to BI tools. All you need to do is pick your table(s) to deploy tests on, and we’ll automatically create and update thresholds using your historical data and metadata. Metaplane also comes with issue resolution and prevention features, such as column level lineage for you to trace your incident to the root, as well as Data CI/CD so that you can prevent merging changes that’d break your models.
Get your free account today or reach out for best practices before getting started.
Table of contents
Tags
...
...