The Data Problem Nobody Wants to Talk About Before the AI Demo
The demo looked great. Nobody in the room asked where the data comes from at 2am on a Tuesday. That question is where most AI programs live or die.
The demo looked great. The model surfaced exactly the right insights, in natural language, from a dataset that had been cleaned, joined, deduplicated, and formatted by a data scientist who spent three weeks on it. Leadership was impressed. The PM got approval to move to production by Q3.
Nobody in the room asked where the data comes from at 2am on a Tuesday.
That question — the boring one, the one that kills the energy in the room — is where most AI programs actually live or die. Not model selection. Not accuracy benchmarks. Not which vendor has the best API. The data infrastructure underneath the model is either ready for what AI demands, or it isn't. In most organizations, it isn't. And the people who know that are the last ones asked.
The demo dataset is a lie
"Lie" is strong. Call it a carefully constructed fiction. The dataset used for the demo was curated. Someone chose which records to include. Someone resolved the naming conflicts between the CRM and the ERP. Someone decided how to handle the 14% of records with missing fields. Someone standardized date formats across three source systems.
None of that happens automatically in production.
In production, the pipeline ingests whatever the source systems produce. That means schema changes nobody communicated. Null values in fields that were always populated during the demo period. Duplicate records that the dedup logic doesn't catch because the matching rules were tuned on the curated set. Timestamps in four different time zones because three teams made three different decisions in 2019 and nobody documented any of them.
The model doesn't know the difference. It processes what it receives. When it received clean data, it produced clean outputs. When it receives what the warehouse actually produces, it will produce something else. The model didn't get worse. The data got real.
The timeline problem
Here's the pattern I've watched play out more than once. The AI team does a proof of concept. It works. Leadership sets a launch date — usually tied to a board meeting, a quarterly review, or a competitor announcement. The PM builds backward from that date.
Then the data engineering team gets the requirements. They need a reliable, documented, governed data pipeline that delivers consistent, quality-checked inputs on a schedule. They look at the current state of the warehouse. They estimate six months to build it right. Maybe nine.
The launch date is in four.
This is the moment where the project's fate is sealed, even though it won't be visible for months. The PM can't push the date — it came from above. The data team can't build faster — the work is what it is. So someone proposes a compromise: "We'll build the pipeline incrementally. Ship with what we have. Improve it in production."
That compromise means the model launches on data that isn't ready. Not catastrophically wrong — just inconsistent enough that the outputs drift. Accurate on Mondays, off on Thursdays. Reliable for the East Coast division, unreliable for the West. Good enough that nobody raises the alarm in week one. Bad enough that by month three, the business unit stops trusting the outputs and builds a spreadsheet workaround.
The AI project is now officially a success that nobody uses.
What the data team knows
Data engineers know what the infrastructure can actually support. They see the distance between what leadership assumes and what exists. And in most organizations, nobody asks them until the architectural drawings are done.
I've sat in rooms where a data engineering lead tried to explain that the warehouse wasn't ready — that the source systems had undocumented dependencies, that the data quality checks didn't exist yet, that the lineage tracking was manual and incomplete. The response was polite acknowledgment followed by zero changes to the timeline.
This isn't malice. It's incentive structure. The PM is measured on delivery dates. The executive sponsor is measured on innovation metrics. The data engineer is measured on uptime and pipeline reliability. Nobody's bonus depends on "told the truth about readiness in a planning meeting."
So the data team does what data teams always do: they work nights and weekends to duct-tape something that sort of works, they document the known issues in a Confluence page nobody reads, and they wait for the call when something breaks in production.
The governance vacuum underneath
The data problem isn't just an engineering problem. It's a governance problem. Who owns the definition of a "customer" when the CRM, the billing system, and the support platform each define it differently? Who decides which source is authoritative when two systems disagree? Who monitors data quality in the pipeline after launch — not the model's accuracy, but the inputs to the model?
In most organizations I've assessed, the answer is: nobody, explicitly. Data governance exists as a concept, maybe even as a team. But the connection between data governance and AI governance is either informal or absent. The AI team assumes the data is governed. The data governance team doesn't know what the AI pipeline consumes. The gap between those two assumptions is where production failures live.
The unsexy fix
The fix isn't a better model. It isn't a bigger budget. It's the willingness to answer three questions honestly before the demo becomes a production commitment:
What does the data actually look like in production? Not the curated set. The real pipeline output, with all its inconsistencies, gaps, and undocumented transformations. Run the model on that. See what happens.
Who owns data quality for the AI pipeline? Not "the data team" — a specific person, with authority to halt a deployment if the inputs aren't ready. If nobody has that authority, the data quality is optional, and optional quality is no quality.
What's the gap between the demo timeline and the data timeline? If the data won't be ready when the model is, that's the constraint that matters. Everything else is negotiable. That isn't.
These aren't exciting questions. They don't demo well. Nobody gets a conference talk out of "we delayed our AI launch because the data wasn't ready." But the organizations that ask them ship AI that works in production, not just in the boardroom.
The ones that don't ask build something impressive that slowly becomes something nobody trusts.