All writing
Governance

Stop Measuring AI by Accuracy. Start Measuring It by What Happens When It's Wrong.

Ninety-five percent accuracy on a loan approval system means one in twenty applicants gets the wrong decision. Accuracy tells you nothing about who bears the cost of the 5%.

"The model is 95% accurate."

I've heard this sentence in more executive briefings than I can count. It's always delivered with confidence. It's always received with approval. And it almost always obscures the question that actually matters.

Ninety-five percent accuracy on a playlist recommendation engine means one in twenty songs isn't quite right. The user skips it. Nobody's harmed. Ninety-five percent accuracy on a loan approval system means one in twenty applicants gets the wrong decision. Some of them are denied credit they should have received. Some of them are approved for loans they can't afford. The consequences land on real people, and "95% accurate" doesn't tell you anything about who bears the cost of the 5%.

Accuracy is a model metric. It tells you how often the system is right. It tells you nothing about what happens when it's wrong.

The moment the number stopped being reassuring

I watched a team deploy a model that informed eligibility decisions for a high-volume process. Thousands of decisions per day. The model was 96% accurate — above the threshold the organization had set, above what the previous rules-based system achieved. By every metric the team reported, the system was performing well.

Then someone did the math differently.

Four percent error rate. Thousands of decisions per day. That meant dozens of wrong decisions every day. Hundreds every week. Over a quarter, thousands of people received an outcome they shouldn't have. Some were denied something they were entitled to. Some received something they shouldn't have.

The model was 96% accurate. It was also producing wrong outcomes at industrial scale. Both of those statements were true simultaneously. The first one made it into the executive dashboard. The second one didn't — until a patterns-of-harm analysis forced the question.

Nobody had asked what "wrong" meant for the people on the receiving end of that 4%. Nobody had categorized the errors by severity. Nobody had assessed whether the errors fell disproportionately on specific populations. The accuracy metric answered the question the data science team was asking — is the model performing well? It didn't answer the question the organization should have been asking — what is the impact when it doesn't?

Accuracy is the wrong denominator

The problem with accuracy as a governance metric is that it treats all errors as equal. A false positive and a false negative count the same in an accuracy calculation. A wrong answer on a low-stakes decision and a wrong answer on a life-altering decision count the same. The metric doesn't distinguish between errors that cause inconvenience and errors that cause harm.

This isn't a flaw in how accuracy is calculated. It's a flaw in how accuracy is used. Accuracy is a useful engineering metric for model development. It tells the data science team whether the model is learning, whether changes improve performance, whether the model generalizes to held-out data. It's the right tool for building models.

It's the wrong tool for governing them.

Governance requires understanding the consequences of failure, not just the frequency of failure. A model with 99% accuracy that's used to screen medical images and misses 1% of malignancies is a fundamentally different risk than a model with 90% accuracy that recommends articles. The accuracy numbers suggest the first model is better. The consequences say the second one is safer.

What to measure instead

The governance question isn't "how often is this model right?" It's a set of harder questions that accuracy doesn't answer.

What happens when it's wrong? Map the downstream consequences of each error type. A false positive on a fraud detection system means a legitimate customer gets blocked. A false negative means a fraudulent transaction goes through. These are different consequences with different severities and different remediation paths. Governance needs to know both the frequency and the impact of each.

Who bears the cost? Errors are rarely distributed evenly. A model trained on data that underrepresents a population will perform worse for that population. The aggregate accuracy number can look strong while specific groups experience significantly higher error rates. If you're only reporting the aggregate, you're not governing — you're averaging away the harm.

What's the blast radius? A model that advises a human decision-maker has a smaller blast radius than a model that makes autonomous decisions. A model that affects one customer at a time has a smaller blast radius than a model that affects a cohort simultaneously. The same error rate produces different levels of organizational risk depending on how many people are affected and how quickly.

Is there a fallback? When the model is wrong, what happens next? Is there a human review? An appeal process? An automatic correction mechanism? Or does the wrong decision simply stand? The existence and quality of the fallback is a governance factor that accuracy doesn't capture.

The dashboard nobody builds

Every AI program I've seen has a model performance dashboard. Accuracy, precision, recall, F1 — the standard metrics, tracked over time, reported to leadership. These dashboards are necessary. They're not sufficient.

The dashboard nobody builds is the impact dashboard. How many wrong decisions were made this week? What was the severity distribution? Which populations were most affected? What was the remediation cost? How many of the errors were caught by downstream processes and how many reached the end user?

This dashboard is harder to build because it requires connecting model outputs to real-world outcomes. It requires knowing not just that the model predicted X, but that the prediction resulted in action Y, and that action Y affected person Z in a measurable way. Most organizations don't have this feedback loop instrumented. They know what the model predicted. They don't systematically track what happened as a result.

Building this feedback loop is the difference between reporting accuracy and governing impact. It's the difference between telling leadership "the model is 95% accurate" and telling them "the model produced 47 incorrect denials this week, disproportionately affecting applicants in three ZIP codes, and 12 of those denials have been escalated."

The first statement is a metric. The second is governance.

The shift

The organizations that govern AI well don't stop measuring accuracy. They stop treating it as the primary indicator of whether a system is working. They add the questions that accuracy can't answer: what breaks when this is wrong, who gets hurt, and how would we know?

These aren't technical questions. They're organizational ones. They require data scientists, business owners, risk teams, and compliance working together to define what failure means in human terms, not just statistical ones.

Ninety-five percent accuracy is a number. What happens in the other 5% is a governance program.