All writing
AI Deployment

I've Watched Three AI Programs Fail the Same Way

Different organizations, different industries, different models, same three patterns. Every failed AI program I've been close to exhibited at least one.

After years close to enterprise AI programs — shaping them, assessing them, watching them succeed and fail — I've stopped being surprised by the failures. Not because I've become cynical. Because the failures repeat. Different organizations, different industries, different models, same three patterns.

Every failed AI program I've been close to exhibited at least one. Most exhibited all three. And in every case, the organization believed the program was working right up until the moment it became undeniable that it wasn't.

Pattern 1: The model optimizes for the wrong thing

A large services organization built a model to predict customer churn. The success metric was prediction accuracy — how well the model identified customers likely to leave. The model scored well. Leadership was pleased. The retention team started acting on the predictions.

Six months in, the retention team noticed something strange. They were reaching out to the customers the model flagged, offering incentives to stay. Some stayed. The churn number improved. But revenue didn't. The customers the model identified as high churn risk were, disproportionately, low-value accounts. The model was technically correct — these customers were likely to leave. But saving them didn't move the number that mattered.

The problem wasn't accuracy — it was the objective. The model optimized for predicting churn. The business needed to optimize for retaining valuable customers. Those are different problems, and nobody caught the gap because the success metric — prediction accuracy — was green.

What it looked like from the outside: Dashboards showed improving churn prediction and increasing retention interventions. The program appeared to be delivering value.

What was actually happening: The retention team was spending its limited capacity on the wrong customers. High-value accounts that were quietly leaving weren't being flagged because their behavioral patterns didn't match the model's learned definition of "likely to churn." The model was right about what it was asked. It was never asked the right question.

The moment it became undeniable: A quarterly business review showed retention activity up 40% and net revenue retention flat. The CFO asked why. The answer required admitting that the AI program had been optimizing for the wrong metric for two quarters.

The fix wasn't technical. It was a conversation that should have happened before the model was built: what does the organization actually value, and does the model's objective capture that? In my experience, that conversation either happens in the first week or it happens after the first failure. Rarely in between.

Pattern 2: Nobody asked the operators

A financial institution deployed an AI-assisted document review system. The system was designed by the ML engineering team in partnership with the compliance leadership. It was technically sophisticated — good model architecture, solid training data, reasonable accuracy benchmarks. It was also, from the perspective of the analysts who actually reviewed documents every day, nearly unusable.

The interface presented results in a format that didn't match the analysts' workflow. The confidence scores were calibrated for data scientists, not for compliance professionals who needed a clear "review this" or "skip this." The system flagged documents for review but didn't explain why, which meant analysts had to reconstruct the model's reasoning before they could act — adding time to a process that was supposed to save time.

The analysts weren't consulted during design. They were consulted during UAT, by which point the architecture was fixed and their feedback could only influence surface-level adjustments. The engineering team had built what the requirements document described. The requirements document described what compliance leadership thought the analysts needed. Nobody had asked the analysts.

What it looked like from the outside: The system was deployed on schedule. The team reported processing volume metrics that showed the system was handling the expected throughput. Leadership reported to the board that the AI-assisted review was operational.

What was actually happening: Analysts were running the AI system and then re-doing the review manually because they didn't trust the outputs and couldn't interpret the confidence scores. The AI hadn't replaced the manual process. It had been added on top of it. Processing time per document actually increased. Analysts developed informal workarounds — heuristics for when to trust the system and when to ignore it — that were undocumented and inconsistent across the team.

The moment it became undeniable: An internal audit found that analysts were spending more time per review than before the system was deployed. When asked, the analysts said — diplomatically but clearly — that they'd been raising concerns since month one but the feedback hadn't reached anyone who could change the design.

The lesson is not "ask users for feedback." Every organization says it does that. The lesson is that operator input has to happen before the architecture is set, not after. If the people who will use the system every day aren't shaping the design, the system will be shaped by people who don't do the work. And it will show.

Pattern 3: Ship and forget

A technology company deployed a content classification model that performed well at launch. Accuracy was above 90%. False positive rates were acceptable. The model was integrated into a production workflow and the team moved on to the next project.

Twelve months later, accuracy had degraded to 74%. Nobody noticed for four months because nobody was monitoring accuracy in production. The team had set up monitoring for system availability — uptime, latency, throughput — but not for model performance. The model was reliably returning results. The results were just increasingly wrong.

The degradation had a clear cause: the content being classified had evolved. New categories of content emerged that didn't exist in the training data. User behavior shifted. The distribution of inputs drifted from the distribution the model was trained on. This is called model drift, and it's not a bug. It's a certainty. Every model deployed in a changing environment will drift. The only question is how fast, and whether anyone is watching.

What it looked like from the outside: The system was operational. Uptime was 99.7%. Throughput metrics were stable. The system appeared in the company's quarterly report as an example of successful AI deployment.

What was actually happening: The model was confidently misclassifying an increasing percentage of content. Downstream processes that depended on accurate classification were making decisions based on bad inputs. Users who noticed inconsistencies attributed them to edge cases rather than systemic drift. Nobody had the data to see the trend because nobody was tracking the trend.

The moment it became undeniable: A customer-facing incident traced back to a misclassification. The post-mortem revealed that accuracy had been declining for months. The retraining that could have been a routine maintenance task became an emergency remediation project — with executive attention, a timeline, and a "lessons learned" document.

The same problem wearing three masks

These aren't three separate failure modes. They're three symptoms of the same underlying condition: the organization treated AI as a technology project instead of a governance challenge.

Technology projects have a finish line. You build it, you ship it, you move on. Governance challenges don't end at deployment. They require ongoing attention: Are we measuring the right things? Are the right people involved? Is the system still performing as intended?

The churn model failed because nobody governed the alignment between the model's objective and the business objective. The document review system failed because nobody governed the design process to include the people who'd operate it. The classification model failed because nobody governed the system after it shipped.

In all three cases, the teams were competent. The technology worked. The organizations had smart people and reasonable budgets. What they didn't have was a structure that asked the hard questions continuously — not just at launch, but at every stage of the system's life.

This is what I mean when I say AI governance isn't a compliance exercise. It's an operating discipline. Compliance asks "did we check the box?" Governance asks "is this still working the way we intended, and how would we know if it wasn't?"

The organizations that build that discipline into their AI programs — not as an afterthought, not as a separate workstream, but as part of how the program operates — are the ones whose AI investments survive contact with production. The ones that don't build it will keep failing in the same three ways, convinced each time that the problem was technical.

It wasn't.