Data Quality Fundamentals: What It Is, Why It Matters, and How You Can Improve It
In 2021, problems with Zillow’s machine learning algorithm led to more than $300 million in losses. About a year earlier, limitations on table rows caused Public Health England to underreport 16,000 COVID-19 infections. And of course, there’s the classic cautionary tale of the Mars Climate Orbiter, a $125 million-dollar spacecraft lost in space because of a discrepancy between metric and imperial measurement units.
Data quality clearly has a profound impact on the success of an organization. No wonder data teams are in such high demand. But supplying quality data stakeholders can trust isn’t easy. In reality, it requires data teams to master key data quality concepts and practices. Only then can they deliver on their mandate.
In this article, we highlight the fundamentals of data quality, from what it is and why it matters to how to measure and improve it.
What is data quality?
Data quality is the degree to which data serves an external use case or meets an internal standard. Ten dimensions of data quality exist across two categories: intrinsic dimensions, which are independent of use cases, and extrinsic dimensions, which are dependent on use cases. Intrinsic dimensions include accuracy, completeness, consistency, freshness, as well as privacy and security. Extrinsic dimensions include relevance, reliability, timeliness, usability, and validity.
Why is data quality important?
Data quality matters because it significantly impacts your business performance. High-quality data helps you make better decisions and perform better actions, leading to increased revenue, decreased costs, and reduced risk. Low-quality data has the opposite effect, resulting in poor profitability and an increased risk that your business will fold prematurely.
Data quality is especially important for PLG companies, which rely on high-quality data to determine their product roadmaps and deliver exceptional customer experiences.
The top data quality challenges
Data quality problems span both machine and human errors. On the machine side, data teams often struggle with software sprawl and data proliferation. They also tend to lack the metadata they need to perform their jobs efficiently and effectively. On the human side, data creators inevitably make typos and other data entry issues that reduce data quality, whereas data teams themselves have little context around business metrics. That makes it tough to spot problems in the data. Finally, it’s difficult for data leaders to hire experienced team members, which results in reduced capacity for addressing data quality issues. These are some of the most common data quality challenges businesses face.
How to measure your data quality
To measure your data quality, it’s important to start from your use cases. For example, does your organization use data for decision-making purposes, to fuel go-to-market operations, or to teach a machine learning algorithm? Next, identify your pain points. Do you struggle with slow dashboards or stale tables? Perhaps it’s something bigger, like a lack of trust in your company’s data across the organization. Once you’ve documented your biggest challenges, it’s time to connect them to one or more of the ten dimensions of data quality. Ask yourself: Of the causes of recent trouble, which data quality dimensions are relevant and how can they be measured? Take data accuracy as an example, which can be measured by the degree to which your data matches a reference set, corroborates with other data, passes rules and thresholds that classify data errors, or can be verified by humans. Measuring data quality metrics, like this, is the last step in the process.
How to manage data quality incidents
No matter the people, processes, and technology at your disposal, data quality issues will inevitably crop up. To manage these incidents, you can follow our six-step data quality issue management process:
- Preparation is about getting ready for the data quality incidents you’ll inevitably deal with in the future.
- Identification is about gathering evidence that proves a data quality incident exists and documenting its severity, impact, and root cause.
- Containment is about preventing the incident from escalating.
- Eradication is about resolving the problem.
- Recovery is the process of systems returning to a normal state following an incident.
- Learning is about analyzing both what went well and what could be done differently in the future.
How to improve your data quality
Achieving superior data quality requires a combination of people, processes, and technology. You must be committed to best practices, like regularly conducting data quality checks and having a streamlined process for responding to data quality incidents. It also requires you to be proactive. For example, providing data creators with data entry training is a must, as is periodically conducting data quality audits. Finally, you’ll want to have the data infrastructure required to monitor your data quality as it ebbs and flows over time. Carried out consistently, these practices can help you improve your data quality.