Machine Learning should be data-centric, not model-centric. Here’s why.

As an industry, we resort to tweaking machine learning models themselves, rather than optimizing and refining the data we use to train and deploy these models. Here's why we need to put data quality over model complexity.

Kevin Hu, PhD

and

January 12, 2024

Kevin Hu, PhD

Co-founder / Data and ML

January 12, 2024

Machine Learning should be data-centric, not model-centric. Here’s why.

We’ve all heard the age-old adage “garbage in, garbage out”—your output is only as good as your input. It’s simple and universally understood. But for some reason, when it comes to machine learning (ML), that universal understanding goes out the window.

For years, we’ve been obsessing over the glamor of data models, constantly fine-tuning our ML algorithms to chase that elusive ‘perfect’ model. But here's the thing: this model-centric approach is fundamentally flawed because it causes data teams to overlook the very foundation of effective ML—the data itself.

Model-centric motivation

Until recently, the typical ML model creation steps would be:

Gather data
Clean it
Try multiple models
Tune the model parameters
Push it to production
Monitor the model (if at all)

Because data is the starting point, a simpler model trained on high-quality, well-curated data can outperform a complex algorithm fueled by subpar data. Yet, data teams have prioritized the final steps for two big reasons:

1. Making models more complex and sophisticated is viewed as more “elite” work.

There's an undeniable allure to complexity in the ML community. Developing cutting-edge models carries a sense of "Oh là là" because there’s a perception that it’s more intellectually glamorous.

This appeal, which tends to garner more recognition and admiration from peers, creates a self-perpetuating cycle where complexity drives more attention, resources, and talent towards increasingly sophisticated models—often at the expense of addressing fundamental data quality issues.

2. Data engineering training and education are heavily skewed toward novel modeling techniques.

Most published papers herald a new breakthrough in model design that promises to set the next big benchmark in algorithmic performance because, historically, it’s a direct route to publication, funding, and professional recognition. That’s why academia, as a whole, tends to prioritize model development as a marker of intellectual achievement.

Meanwhile, the meticulous and often less visible work of dataset creation and curation doesn't align as neatly with the current metrics of academic success, so it’s underemphasized (even though those datasets, like MNIST, ImageNet, and CIFAR for computer vision have had an enormous impact on ML).

Data-centric ML shifts the focus back to the input

When we’re too focused on the models themselves (such as in a model-centric framework), we neglect data quality, widening the gap between what we think should happen in theory and what actually happens in practice.

For instance, say you’re a data engineer for a hospital. Your team is working on an ML model to predict which patients are likely to be readmitted to the hospital, but you’re using a model-centric strategy. You’re more focused on refining the model than the quality of the data you’re ingesting. What’s the worst that could happen, right?

Well, as it turns out, your data is incomplete and inconsistent from the start—some patient records are missing important information, the data isn’t standardized across different departments, and it isn’t fully representative of all types of patients.

When you finally deploy this model to production, it doesn’t accurately predict readmissions. Not surprising, because the data it learned from was incomplete and flawed. And now, all the time and resources you spent on this project are wasted because the model is basically useless for your hospital. Or worse, you may not catch that the data was bad and start making misinformed strategic decisions because of it.

No algorithm in the world can compensate for poor data quality, and better models won’t give your organization more of a competitive advantage in the market. The proprietary edge is provided through data. That’s why Google open-sourced TensorFlow and Meta open-sourced PyTorch: The models weren’t the important part—the data was.

Two important distinctions

ML is inherently data-driven, not data-centric. Models extract patterns and insights from the data they're fed—that’s a given. The problem is that they don’t care whether that data is good or bad.

That’s why the shift the data-centricity is so important: it means you don’t just use any data to drive decisions. Instead, this framework places data quality and integrity at the forefront of the entire process. It focuses on data cleansing, pre-processing, balancing, and augmentation rather than hyper-parameters selection and architectural changes.

And while it’s proven that ML applications directly benefit from better data rather than from better algorithms, there are two distinctions to keep in mind.

Data quality is determined by the use case at hand. There’s no such thing as perfect data quality in a vacuum. Real-world data tends to be messy and plagued with issues, which range from missing values and inconsistencies to even more serious issues like bias between different groups of data. This is already bad for performance reasons, but it’s even worse and often illegal when those are protected groups.
Data quality is the result of continual improvement. Data engineers should be responsible for identifying which objects should have monitors, what types of monitors they should have, and setting up those monitors. From there, they can use column-level lineage graphs to trace an issue upstream to understand where and how to focus resolution efforts. By monitoring the metadata this way, they’ll have a pulse on the current state of data quality and how it changes over time.

Data-centric ML starts with turning your dimensions into metrics

Truly data-centric ML shifts the focus toward enhancing and enriching the data used to train the algorithms themselves, prioritizing intrinsic and extrinsic data quality dimensions. Intrinsic dimensions are based on the inherent properties of the data, whereas extrinsic dimensions are based on how well the data fits a particular purpose or use case.

*Visualization of data quality dimensions as they apply to the data stack*

You then operationalize these dimensions into meaningful metrics by defining specific, measurable criteria for each dimension. For instance, for accuracy, you might measure the percentage of records in a database that are verified against a trusted source. Or, for reliability, you might track the frequency of errors or discrepancies in data over a certain period.

The key is to collect these metrics consistently over time, which will enable you to identify trends, pinpoint areas for improvement, and track progress toward your data quality goals. At the end of it all, you’ll have more performant and reliable ML models that steer you toward more ethical and equitable ML practices.

Your ML model can only be as good as your data is

Your ML model's success is directly linked to the data fueling it. No amount of advanced algorithm design or complex model architecture can fix problems rooted in the data itself.

The future of ML is data-centric. A data-centric framework focuses on improving the data, not the model. Data-centric ML understands that real-world data is often far from perfect; it's usually messy, missing pieces, and can carry biases. But by prioritizing data quality, your models are trained on the best possible foundation.

At Metaplane, we recognize the importance of data quality in the success of machine learning. With advanced data observability tools capable of identifying and addressing data quality issues in real time, our solutions enable businesses to refine their datasets, ensuring that their ML models are trained on high-quality, reliable data. Turn your data into a powerful asset for competitive advantage. Sign up for free, or book a demo to learn more.

Machine Learning should be data-centric, not model-centric. Here’s why.

Model-centric motivation

1. Making models more complex and sophisticated is viewed as more “elite” work.

2. Data engineering training and education are heavily skewed toward novel modeling techniques.

Data-centric ML shifts the focus back to the input

Two important distinctions

Data-centric ML starts with turning your dimensions into metrics

Your ML model can only be as good as your data is

Table of contents

Tags

Build a data culture by increasing data literacy

Submit where you want to send your free guide for driving dashboard usage!

How to proactively prevent incidents

Where should we send your 1-pager on incident prevention?

Getting started with data observability guide

Let us know where to send your Guide to Data Observability

Stay updated on the latest product updates

Start monitoring your data in minutes.

Machine Learning should be data-centric, not model-centric. Here’s why.

Model-centric motivation

1. Making models more complex and sophisticated is viewed as more “elite” work.

2. Data engineering training and education are heavily skewed toward novel modeling techniques.

Data-centric ML shifts the focus back to the input

Two important distinctions

Data-centric ML starts with turning your dimensions into metrics

Your ML model can only be as good as your data is

Table of contents

Tags

Build a data culture by increasing data literacy

Submit where you want to send your free guide for driving dashboard usage!

Please check your inbox for your guide to driving dashboard usage!

How to proactively prevent incidents

Where should we send your 1-pager on incident prevention?

Please check your inbox for your 1-pager on incident prevention!

Getting started with data observability guide

Let us know where to send your Guide to Data Observability

Please check your inbox for your guide to data observability!

Stay updated on the latest product updates

Start monitoring your data in minutes.