How to Use Machine Learning for Robust Data Quality Checks

Data quality is paramount in the era of data-driven decision-making. Explore the advantages of leveraging machine learning for data quality checks, understand its limitations, and learn about the tools and techniques that can streamline your data quality assurance processes.

Kevin Hu, PhD

and

May 16, 2023

Kevin Hu, PhD

Co-founder / Data and ML

May 16, 2023

How to Use Machine Learning for Robust Data Quality Checks

In the realm of data analytics, the adage "Garbage in, garbage out" holds true. High-quality data is crucial for accurate algorithms and reliable machine learning models. However, manual data quality checks can be time-consuming and error-prone. Enter machine learning—an innovative and scalable approach to performing robust data quality checks.

Three Examples of Data Quality Checks

To provide a comprehensive understanding of machine learning for data quality checks, let's explore a few common data quality issues and their solutions using SQL snippets that run on Snowflake. For now, let's assume that we're concerned with a `REVENUE` table that contained revenue by transaction (although we know that, realistically, a single table like this wouldn't actually exist).

1. Delayed Data: This check is to ensure that the `REVENUE` table is updated regularly. If the table hasn't been updated recently, it might indicate a problem with the data pipeline.

```sql

SELECT TABLE_SCHEMA, TABLE_NAME, LAST_ALTERED

FROM INFORMATION_SCHEMA.TABLES

WHERE TABLE_NAME = 'REVENUE'

ORDER BY LAST_ALTERED ASC;

```

2. Non-Unique Primary Keys: This updated SQL snippet will return the percentage of unique `transaction_id`s in the `REVENUE` table.

```sql

SELECT

(COUNT(DISTINCT transaction_id) / COUNT(*)::float) * 100 AS percentage_unique

FROM REVENUE;

```

In this SQL snippet, `COUNT(DISTINCT transaction_id)` returns the number of unique `transaction_id`s, and `COUNT(*)` returns the total number of rows in the `REVENUE` table. The ratio of these two numbers, multiplied by 100, gives the percentage of unique `transaction_id`s.

3. Spike in the Mean of a Number: This updated SQL snippet will return the average revenue for the last day.

```sql

SELECT

DATE_TRUNC('day', transaction_date) AS date,

AVG(amount) AS mean_revenue

FROM REVENUE

WHERE DATE_TRUNC('day', transaction_date) = (

SELECT MAX(DATE_TRUNC('day', transaction_date))

FROM REVENUE

)

GROUP BY DATE_TRUNC('day', transaction_date)

```

In this SQL snippet, `DATE_TRUNC('day', transaction_date)` is used to group the revenue by day, and `AVG(amount)` calculates the average revenue for each day. The `WHERE` clause restricts the results to the last day, which is determined by the subquery `SELECT MAX(DATE_TRUNC('day', transaction_date)) FROM REVENUE`.

Why Use Machine Learning for Data Quality?

Machine learning offers transformative capabilities in maintaining data quality. By leveraging annotated data, accounting for trends and seasonality, and providing inherent scalability, machine learning significantly reduces the need for manual threshold setting and adjustments. It also allows for the detection of "unknown unknowns" - anomalies or issues that were not previously anticipated or identified.

Consider a dataset with strong seasonality patterns. Traditional rule-based approaches may struggle to differentiate between genuine anomalies and seasonal fluctuations. Machine learning, on the other hand, can effectively learn these patterns, enhancing the precision of anomaly detection.

Moreover, machine learning models can adapt to changes in data over time. This is particularly useful when dealing with dynamic datasets where patterns and trends evolve. With machine learning, there's no need to write new code or adjust thresholds every time the data changes. The model can learn from the new data and adjust its predictions accordingly.

Let's examine an SQL snippet that illustrates this concept using Snowflake:

```sql

‍
SELECT date, COUNT(*) as row_count

‍
FROM sales_data

‍
GROUP BY date

‍
HAVING row_count < (

‍
SELECT AVG(row_count) - 3*STDDEV_POP(row_count)

‍
FROM (

‍
SELECT COUNT(*) as row_count

‍
FROM sales_data

‍
GROUP BY date

‍
))

```

This query aims to identify days with abnormally low sales. However, it overlooks seasonal patterns, potentially generating false alarms during traditionally slow periods. A machine learning model, on the other hand, could learn these seasonal patterns and adjust its anomaly detection accordingly, reducing the likelihood of false alarms.

Pictoral representation of a Metaplane freshness monitor adapting to seasonal patterns.

Recognizing the Limitations of Machine Learning

While machine learning is a powerful asset, it may not always be necessary or ideal for every scenario. For instance, when you only want to monitor a small set of tables or metrics for known issues, the complexity and overhead of a machine learning solution might outweigh the benefits.

Machine learning models can be harder to develop and deploy than manual thresholds or simple rule-based checks. They require a significant amount of data to train effectively, and the process of training, validating, and tuning these models can be time-consuming and resource-intensive.

Moreover, machine learning models can suffer from overfitting if not properly trained and validated, leading to less reliable predictions. Overfitting occurs when a model learns the training data too well, to the point where it performs poorly on new, unseen data because it's too focused on the specific details and noise in the training set.

Additionally, machine learning models often act as a black box, making it difficult to understand why a particular prediction was made. This lack of interpretability can be a problem in scenarios where understanding the reasoning behind a decision is important.

Understanding the nature of your data and choosing the right tool for the task is crucial. In some cases, a simple rule-based system or manual threshold might be more efficient and effective. Always consider the trade-offs between complexity, interpretability, development time, and performance when deciding whether to use machine learning for data quality checks. But if you do decide to deploy ML, make sure to consider the following.

Considerations When Deploying ML

Incorporating User Feedback: An Iterative Approach to Model Refinement

Machine learning models, while powerful, are not infallible and require constant refinement. User feedback is instrumental in this iterative process as it provides unique insights that purely quantitative metrics or automated checks may miss. Users bring crucial domain knowledge to the table, offering context and nuances that models lack. Encourage a feedback loop with your users, perhaps through a simple feedback form or a custom user interface. Each piece of feedback helps your model become more aligned with the real-world complexities it seeks to capture, enhancing its effectiveness. Remember, model refinement is a continuous, adaptive process—it's about learning from both your data and your users, ensuring your models remain robust and relevant in their data quality checks.

Real example of a Metaplane Slack Alert that accepts user feedback.

Evaluating Model Performance: Metrics Matter

When applying machine learning to data quality checks, it's essential to evaluate model performance with appropriate metrics. For anomaly detection problems, precision (how many of the flagged anomalies are true anomalies), recall (what proportion of actual anomalies were detected), and the F1 score (the harmonic mean of precision and recall) can be particularly helpful. By monitoring these metrics, you can understand the trade-off between false positives and missed anomalies, and adjust your model accordingly. Employing methods such as cross-validation can provide a more robust evaluation of your model's performance and its ability to generalize.

Addressing Oversampling: A Key to Data Relevance

Oversampling, or retrieving data more frequently than its update rate, can overburden your data infrastructure and result in redundant information that offers no fresh insights. To mitigate oversampling, understand the nature and update frequency of your data, and align your data retrieval strategy accordingly. Implement mechanisms to identify and manage oversampling, such as a data processing step that removes duplicate data points based on a unique identifier or a combination of attributes. For instance, using pandas, you can drop duplicate rows in a DataFrame (`df.drop_duplicates(subset=['unique_id'], keep='last', inplace=True)`), thereby cleansing your data of unnecessary redundancy and maintaining the accuracy of subsequent analyses and machine learning models.

Handling Imbalanced Data: A Balancing Act

In data quality checks, true anomalies are often rare compared to the majority of 'normal' data points, leading to imbalanced datasets. Machine learning models trained on such data can become biased towards the majority class, resulting in missed anomalies. To address this, techniques such as Synthetic Minority Over-sampling Technique (SMOTE) or adjusting class weights can help balance the scales and improve the model's ability to detect rare, but critical, anomalies.

Utilizing Open-Source Tools: The Prophet Example

Several open-source tools can assist in harnessing machine learning for data quality checks. One such tool is Prophet, a time series forecasting procedure that incorporates additive models with non-linear trends, seasonal effects, and holiday effects.

png — Example of predictions with confidence intervals from Prophet.

Let's revisit the example of detecting anomalies in daily sales data using Prophet:

```python

from fbprophet import Prophet

‍

# Assume df is a pandas DataFrame with 'date' and 'sales' columns

model = Prophet(daily_seasonality=True)

model.fit(df)

‍

future = model.make_future_dataframe(periods=0)

forecast = model.predict(future)

‍

# Anomalies are where the actual sales fall outside the Prophet model's uncertainty intervals

df['anomaly'] = (df['sales'] < forecast['yhat_lower']) | (df['sales'] > forecast['yhat_upper'])

```

This approach enables a more sophisticated data quality check without the need to manually handle seasonality.

Utilizing Commercial Offerings: Machine Learning with Metaplane

While open-source tools provide flexibility, they often require expertise and time for effective implementation. If you're looking for an all-in-one solution that combines the power of machine learning with user-friendliness, Metaplane is an excellent choice.

Metaplane offers machine learning-based data quality checks out of the box. It seamlessly integrates into your data stack, continuously monitoring your datasets for anomalies, ensuring data completeness, and tracking the overall health of your data over time. This continuous monitoring allows for real-time detection of data issues, enabling you to address them promptly and maintain the integrity of your data.

In addition to its machine learning capabilities, Metaplane also provides a user-friendly interface that makes it easy to set up and manage your data quality checks. You can easily configure the tool to monitor specific metrics or tables, set custom thresholds, and receive alerts when anomalies are detected. This makes Metaplane a suitable choice for both technical and non-technical users.

Moreover, Metaplane supports a wide range of data sources and can handle both structured and unstructured data. This makes it a versatile tool that can adapt to your specific data needs.

With Metaplane, you can focus on extracting valuable insights from your data while the tool takes care of rigorous data quality checks. This allows you to spend less time worrying about data quality and more time leveraging your data to drive decision-making and business growth.

Summary

In the era of data-driven decision-making, maintaining data quality is paramount. Machine learning provides a robust tool for automating data quality checks, but it's not a one-size-fits-all solution. It's essential to understand your data, choose the right tools, and employ appropriate techniques. However, machine learning models can be complex to develop and deploy, especially when monitoring a small set of tables or metrics for known issues.

User feedback is invaluable in refining machine learning models, and a robust data infrastructure is crucial for maintaining data accuracy. The ultimate goal is to shift from reacting to data quality issues to preventing them, ensuring the reliability and precision of data-driven decisions in areas like BI dashboards, data-driven decision making, and operationalization.