Stop Measuring AI by Accuracy. Start Measuring It by What Happens When It's Wrong.

Ninety-five percent accuracy on a loan approval system means one in twenty applicants gets the wrong decision. Accuracy tells you nothing about who bears the cost of the 5%.

"The model is 95% accurate."

It shows up in slide decks, vendor pitches, and quarterly reviews. It's always delivered with confidence. It's always received with approval. And it almost always obscures the question that actually matters.

Ninety-five percent accuracy on a playlist recommendation engine means one in twenty songs isn't quite right. The user skips it. Nobody's harmed. Ninety-five percent accuracy on a loan approval system means one in twenty applicants gets the wrong decision. Some of them are denied credit they should have received. Some of them are approved for loans they can't afford. The consequences land on real people, and "95% accurate" doesn't tell you anything about who bears the cost of the 5%.

Accuracy is a model metric. It tells you how often the system is right. It tells you nothing about what happens when it's wrong.

The moment the number stopped being reassuring

Consider a team that deploys a model informing eligibility decisions for a high-volume process. Thousands of decisions per day. The model is 96% accurate — above the threshold the organization set, above what the previous rules-based system achieved. By every metric the team reports, the system is performing well.

Then someone does the math differently.

Four percent error rate. Thousands of decisions per day. That means dozens of wrong decisions every day. Hundreds every week. Over a quarter, thousands of people receive an outcome they shouldn't have. Some are denied something they're entitled to. Some receive something they shouldn't have.

The model is 96% accurate. It is also producing wrong outcomes at industrial scale. Both of those statements are true simultaneously. The first one makes it into the executive dashboard. The second one doesn't — until a patterns-of-harm analysis forces the question.

This is exactly the kind of question bank examiners ask when they evaluate model risk. The FDIC's guidance on third-party risk and model governance keeps coming back to the same point: who bears the cost when the model is wrong? Not "how often is it wrong" — who absorbs the consequence? If the answer is "the customer, and they have no recourse," the regulatory exposure isn't in the accuracy number. It's in the harm the accuracy number hides.

Nobody in this scenario asked what "wrong" meant for the people on the receiving end of that 4%. Nobody categorized the errors by severity. Nobody assessed whether the errors fell disproportionately on specific populations. The accuracy metric answered the question the data science team was asking — is the model performing well? It didn't answer the question the organization should have been asking — what is the impact when it doesn't?

Accuracy is the wrong denominator

The problem with accuracy as a governance metric is that it treats all errors as equal. A false positive and a false negative count the same in an accuracy calculation. A wrong answer on a low-stakes decision and a wrong answer on a life-altering decision count the same. The metric doesn't distinguish between errors that cause inconvenience and errors that cause harm.

This isn't a flaw in how accuracy is calculated. It's a flaw in how accuracy is used. Accuracy is a useful engineering metric for model development. It tells the data science team whether the model is learning, whether changes improve performance, whether the model generalizes to held-out data. It's the right tool for building models.

It's the wrong tool for governing them.

Governance requires understanding the consequences of failure, not just the frequency of failure. A model with 99% accuracy that's used to screen medical images and misses 1% of malignancies is a fundamentally different risk than a model with 90% accuracy that recommends articles. The accuracy numbers suggest the first model is better. The consequences say the second one is safer.

What to measure instead

The governance question isn't "how often is this model right?" It's a set of harder questions that accuracy doesn't answer.

What happens when it's wrong? Map the downstream consequences of each error type. A false positive on a fraud detection system means a legitimate customer gets blocked. A false negative means a fraudulent transaction goes through. These are different consequences with different severities and different remediation paths. Governance needs to know both the frequency and the impact of each.

Who bears the cost? Errors are rarely distributed evenly. A model trained on data that underrepresents a population will perform worse for that population. The aggregate accuracy number can look strong while specific groups experience significantly higher error rates. If you're only reporting the aggregate, you're not governing — you're averaging away the harm.

What's the blast radius? A model that advises a human decision-maker has a smaller blast radius than a model that makes autonomous decisions. A model that affects one customer at a time has a smaller blast radius than a model that affects a cohort simultaneously. The same error rate produces different levels of organizational risk depending on how many people are affected and how quickly.

Is there a fallback? When the model is wrong, what happens next? Is there a human review? An appeal process? An automatic correction mechanism? Or does the wrong decision simply stand? The existence and quality of the fallback is a governance factor that accuracy doesn't capture.

The dashboard nobody builds

Every AI program has a model performance dashboard. Accuracy, precision, recall, F1 — the standard metrics, tracked over time, reported to leadership. These dashboards are necessary. They're not sufficient.

The dashboard nobody builds is the impact dashboard. How many wrong decisions were made this week? What was the severity distribution? Which populations were most affected? What was the remediation cost? How many of the errors were caught by downstream processes and how many reached the end user?

This dashboard is harder to build because it requires connecting model outputs to real-world outcomes. It requires knowing not just that the model predicted X, but that the prediction resulted in action Y, and that action Y affected person Z in a measurable way. Most organizations don't have this feedback loop instrumented. They know what the model predicted. They don't systematically track what happened as a result.

Building this feedback loop is the difference between reporting accuracy and governing impact. It's the difference between telling leadership "the model is 95% accurate" and telling them "the model produced 47 incorrect denials this week, disproportionately affecting applicants in three ZIP codes, and 12 of those denials have been escalated."

The first statement is a metric. The second is governance.

The shift

The organizations that govern AI well don't stop measuring accuracy. They stop treating it as the primary indicator of whether a system is working. They add the questions that accuracy can't answer: what breaks when this is wrong, who gets hurt, and how would we know?

These aren't technical questions. They're organizational ones. They require data scientists, business owners, risk teams, and compliance working together to define what failure means in human terms, not just statistical ones.

Ninety-five percent accuracy is a number. What happens in the other 5% is a governance program.