Data Quality Is Still the Boring Problem That Kills Exciting AI Projects

Everyone wants to talk about models. Nobody wants to talk about the data feeding them. Twenty years in, and this hasn't changed.

All views expressed here are my own and do not represent the views of my employer.

I've killed more AI projects over data quality than any other single factor. Not model performance. Not talent shortages. Not budget constraints. Data.

And I don't mean "our data is messy" in the generic sense that every CIO acknowledges over drinks at a conference. I mean specific, lethal problems: training sets that silently drifted from the distribution they were validated on. Feature stores that nobody maintained after the team that built them moved on. Ground truth labels that were wrong 8% of the time — just enough to pass a spot check, not enough to produce a reliable model.

I started my career in data operations, long before anyone was calling it "data engineering." I've spent twenty years watching organizations invest millions in models and pennies in the data those models consume. Every AI cycle — from business intelligence to predictive analytics to machine learning to GenAI to agentic systems — has repeated the same mistake. And every cycle, we act surprised when the models don't work.

Why GenAI made this worse, not better

There's a popular belief that large language models are more forgiving of data quality issues because they're pretrained on massive, diverse corpora. The model has "seen everything," the argument goes, so your specific data doesn't need to be perfect.

This is dangerously wrong, for two reasons.

First, the moment you fine-tune, do RAG, or give an agent access to your enterprise data, the quality of *that* data matters enormously. A RAG pipeline that retrieves outdated policy documents will produce confidently wrong answers. An agent that queries a customer database with duplicate records will take actions on the wrong customer. The LLM's general intelligence doesn't save you from bad enterprise data — it actually makes the problem harder to detect, because the outputs sound plausible even when they're based on garbage.

Second, GenAI is creating *new* data quality problems. Synthetic data poisoning training sets. Model outputs being fed back into data pipelines as if they were ground truth. Embeddings that silently degrade as the underlying documents change. These are failure modes that didn't exist two years ago, and most data governance frameworks haven't caught up.

I worked with a financial services firm that built a GenAI-powered research assistant for their analysts. The RAG pipeline pulled from an internal document store that included three years of market research reports. Nobody noticed that the store also contained 400+ draft documents that were never finalized — early versions with preliminary numbers, speculative conclusions, and in some cases analysis that was explicitly contradicted by the final published versions. The assistant surfaced these drafts with the same confidence as the final reports. An analyst used one of the draft conclusions in a client presentation. That's a failure mode that wouldn't have happened with traditional ML, because traditional ML doesn't generate plausible-sounding narrative around bad data. It just produces a wrong number that someone might catch. GenAI produces a wrong story that sounds right.

The four data problems that kill the most projects

Across the programs I've led and advised on, four data quality issues come up over and over. None of them are technically hard to fix. All of them are organizationally hard to fix, which is why they persist.

1. Nobody owns the data end-to-end

The data engineering team produces the pipeline. The data science team consumes it. Neither team feels responsible for the quality of what flows through it. When the model underperforms, the data scientists blame the data. The data engineers say they delivered what was specified. Everyone is technically correct and practically useless.

The organizations that solve this create what I call data product teams — small, cross-functional groups that own a dataset end-to-end, from production through consumption. They're accountable for both the pipeline and the outcomes it feeds. The CDO's job isn't to own all the data — it's to establish the standards and incentives that make these teams work. The best CDOs I've worked with spend less time on data catalogs and more time making sure the people who produce data understand how it's being consumed, and the people who consume data have a direct line to the people who produce it.

2. Data quality is measured at ingest, not at consumption

Most organizations validate data when it enters the warehouse or lake. That's necessary but not sufficient. Data can be perfect at ingest and broken by the time a model consumes it — because a join introduced duplicates, a transformation dropped edge cases, or a downstream system changed its schema without telling anyone.

Quality has to be measured at the point of consumption, not just at the point of entry. This is an obvious statement that almost nobody acts on.

3. Data drift is invisible until the model breaks

The distribution of your production data changes constantly. Customer behavior shifts. Market conditions evolve. Regulatory requirements update. If your model was trained on data from six months ago and nobody is monitoring whether today's data still looks like that training set, you're flying blind.

I've seen models in production for eighteen months that were performing well below acceptable thresholds — and nobody knew, because they were measuring model accuracy at training time and never checking again.

The minimum viable approach is simpler than people think. Pick your five most critical features. Compute basic distribution statistics — mean, standard deviation, min, max, null rate — on your training data. Then run the same statistics on your production data weekly. If any metric moves more than two standard deviations from the training baseline, flag it for review. That's it. You can build this in a few days with any standard data pipeline tool. It won't catch everything, but it will catch the catastrophic drifts that silently destroy model performance. Ninety percent of organizations I work with don't even do this much.

4. Metadata is an afterthought

Ask a data team what a field in their dataset means, and half the time you'll get a different answer from each person. Ask when the data was last updated, what its source system is, what business rules were applied during transformation — and you'll get shrugs.

This is the metadata problem, and it's the unglamorous foundation that everything else sits on. Without it, you can't debug data issues. You can't trace model errors back to data causes. You can't govern anything effectively. And yet it's consistently the last thing anyone invests in, because it doesn't demo well.

What actually works

After twenty years of fighting this fight, I've come to believe that data quality is fundamentally an organizational problem, not a technical one. The tools exist. The techniques are well-understood. What's missing is the institutional will to treat data as a first-class asset rather than a byproduct of operations.

The organizations that get this right do three things differently:

They make data quality a production SLA, not a best practice. If your model has an uptime SLA, so should the data feeding it. Same rigor. Same consequences for violations. Same executive visibility when something breaks.

They invest in data operations as a permanent function, not a project. Data quality isn't something you fix once. It degrades continuously. The organizations that maintain it treat DataOps the same way they treat DevOps — as an ongoing discipline with dedicated staff, tooling, and accountability.

They connect data quality to business outcomes, not technical metrics. Nobody at the board level cares about your data completeness score. They care that the fraud detection model missed $12 million in losses last quarter because the transaction data had a 72-hour lag nobody knew about. Tie data quality to business impact and you'll get the investment you need.

If I had one piece of advice for a CDO with a fresh data quality budget, it would be this: don't start with tooling. Start with the five models or AI systems that matter most to your business. Trace their data lineage from source to consumption. Find the three places where quality degrades most. Fix those specific problems. Then build the monitoring to make sure they don't come back. You'll learn more about your actual data quality challenges in that exercise than in any enterprise-wide assessment, and you'll have measurable results to show for it within 90 days. The tooling can come later, once you know what you're actually solving for.

Prakul Sharma is the AI & Insights Practice Leader at a Big 4 consulting firm with 20+ years of experience in AI, data engineering, and data operations. He started his career in the data trenches and has never left. He writes weekly at prakulsharma.ai.