The Root Causes of Data Quality Issues
This concise guide will illuminate the five key causes of data quality issues and equip you with practical strategies to address them. With a deep understanding of data quality and its impacts, you can leverage your data effectively, driving informed decision-making and optimal business outcomes.
In our increasingly data-centric world, maintaining high-quality data is paramount for businesses across industries. However, it's no secret that data quality issues are rampant and can cause severe problems for businesses. Although these issues may appear varied and unique, they often stem from a few fundamental causes.
Understanding Data Quality
Before we delve into the causes of data quality issues, it's essential to understand what we mean by "data quality." Businesses operate to increase revenue, reduce costs, and mitigate risks. They leverage data as a representation of the real world to make better decisions, enhance execution, and improve products and services. Therefore, we can define data quality as the degree to which data meets the needs of these business use cases or an internal standard that operationalizes these use cases.
Five Main Causes of Data Quality Issues
Through careful observation and analysis, we've identified five primary causes of data quality problems.
Input errors occur when the incoming data doesn't conform to your expectations. For instance, an account executive might accidentally enter a negative value into a contract size field in Salesforce. To address this issue, techniques such as double entry, input validation at the source, secondary checks by another party, and feedback mechanisms can be helpful.
Infrastructure failures happen when your systems do not perform as expected. For example, Salesforce could be down, or there could be an upstream sync delay stemming from a server outage or vendor software outage. Solutions for these issues can include redundancy and failover systems, interrupts upon detecting an issue, and automated backfills.
Incorrect transformations occur when code doesn't have the intended effect, perhaps due to unexpected data or business assumption changes. Techniques inspired by good software engineering practices, such as unit tests and regression tests on code and data, can help prevent these issues.
Invalid assumptions often stem from changes in upstream dependencies concerning structure, content, or semantics. Data contracts, better CI/CD processes, code tests, and data tests can help to mitigate the impacts of these issues.
This is not a technical issue but rather a human one. Different teams within an organization may have varying definitions of a metric or entity, leading to inconsistent interpretations and low data quality. Knowledge management, building data dictionaries or glossaries, and regular communication can help ensure everyone is on the same page.
Noble vs. Unknowable, Controllable vs. Uncontrollable
Understanding these causes can help teams categorize data quality issues on a spectrum of noble and unknowable, controllable and uncontrollable. While it's impossible to prevent or eliminate everything, focusing on what is under your control and what is knowable can significantly reduce data quality issues.
For example, incorrect transformations due to bad DBT PRs being merged are noble and controllable by the team, while input errors from an upstream team are unknowable but controllable with the right constraints. Some things, however, are not in your control, such as a third-party system delay or an API change from a data vendor. Identifying which issues to eliminate and which to detect early and mitigate is crucial, and these decisions should be based on the cause of the issue and where in the data's provenance it sits.
Data quality issues may have various symptoms, but understanding their root causes can significantly help in addressing them. By adopting a systematic and structured approach and learning from other communities, we can reduce the number and severity of these issues. The journey towards better data quality is ongoing, but with the right strategies, it's a journey well worth taking.
Interested in watching this verbally explained, with accompanying images? Check out the full video here.
Metaplane is a market leader in Data Observability, helping you find and prevent current and future data incidents to increase trust in your data team. We provide data quality monitoring that integrates with your entire data stack, from transactional databases, to warehouses/lakes, to BI tools. All you need to do is pick your table(s) to deploy tests on, and we’ll automatically create and update thresholds using your historical data and metadata. Metaplane also comes with issue resolution and prevention features, such as column level lineage for you to trace your incident to the root, as well as Data CI/CD so that you can prevent merging changes that’d break your models.