What the Military Taught Me About Evaluating Systems Under Pressure

Before I was a bank examiner or an AI PM, I was an OC/T and Drill Sergeant. The gap between military evaluation standards and enterprise AI evaluation is enormous.

Before I was a bank examiner or an AI PM, I was an Observer Controller/Trainer and a Drill Sergeant in the United States Army. The job was straightforward: evaluate units and individuals performing under stress, against defined standards, and tell them the truth about what you observed.

That job shaped every evaluation I've done since — including how organizations should be evaluating their AI systems. And the gap between what the Army requires and what enterprise AI programs accept is enormous.

Define the standard before you test

In the Army, the evaluation criteria exist before the exercise begins. The standard is published. The conditions are specified. The evaluator knows what "go" looks like and what "no-go" looks like before the first round is fired. There is no ambiguity about what success means, and there is no renegotiating the standard after the fact because the unit had a rough day.

Most enterprise AI evaluations work in the opposite direction. The team builds the model, runs it against a test set, gets a number, and then decides whether the number is good enough. The success criteria are set after the results are in. This is not evaluation. This is rationalization.

If you can't define what "good enough" looks like before you run the test, you don't have a standard. You have a hope.

Observe behavior under realistic conditions

An OC/T evaluates units during force-on-force exercises, lane training, live-fire events — conditions designed to approximate the stress and friction of actual operations. You don't evaluate a platoon by watching them walk through a rehearsal in daylight with no opposing force. You evaluate them when it's dark, when communications break down, when the plan falls apart and they have to adapt.

Enterprise AI evaluation is almost entirely the rehearsal version. Curated test sets. Clean inputs. Controlled environments. The model performs well against the data it was designed to perform well against. Then it ships to production, where the inputs are messy, the edge cases are real, and the users don't behave like the test harness assumed they would.

If you haven't tested the system under conditions that resemble production — with real data, real users, real failure modes — you haven't evaluated it. You've confirmed your own assumptions.

Document findings without diplomacy

The After Action Review is the Army's post-exercise debrief, and it is one of the most honest organizational processes I've ever participated in. The structure is simple: What was supposed to happen? What actually happened? Why was there a difference? What are we going to do about it?

The critical element is the second question. Not the plan. Not the intent. What actually happened. And the OC/T's job is to answer that question with evidence — observed actions, times, locations, outcomes — not with opinion softened by politics.

In an AAR, a lieutenant doesn't get to reframe a failed breach as "a learning opportunity." The OC/T observed the breach. The breach failed. Here's why. Here's what the standard required. Here's the gap. Now the unit decides how to fix it.

Compare this to how most AI programs conduct post-deployment reviews. The findings are filtered through layers of organizational sensitivity. "The model experienced some accuracy degradation in certain edge cases" means "the model was wrong 30% of the time for an entire customer segment and nobody caught it for two months." The language is designed to protect relationships, not to surface truth.

You cannot improve what you refuse to accurately describe.

Hold units accountable for remediation

In the Army, a "no-go" on a critical task means the unit retrains and retests. There is no "acknowledged finding" that sits in a tracking spreadsheet for three quarters. The remediation has a timeline, an owner, and a retest. If the unit fails the retest, that's a different conversation — one that involves the commander, not just the training officer.

In enterprise AI, findings from evaluations routinely go into backlogs where they compete with feature work for prioritization. A known model deficiency might sit unaddressed for months because the team is busy building the next model. There's no forcing function. There's no retest. There's no one asking "did you fix the thing we found?"

This is the difference between an evaluation culture and a reporting culture. Evaluation cultures use findings to drive remediation. Reporting cultures use findings to demonstrate that someone looked.

The gap

The Army builds evaluation systems for conditions where failure has immediate, physical consequences. That discipline — standards-first, realistic conditions, honest findings, mandatory remediation — produces units that can perform under pressure because they've been tested under pressure and held accountable for the results.

Enterprise AI evaluation is, in most organizations, designed to produce comfort. The test sets are curated, the findings are diplomatic, and the remediation is optional. The evaluation exists to confirm that the program is going well, not to discover where it's failing.

Not every AI program needs to run like a military exercise. But the discipline translates: define the standard before you test, observe under realistic conditions, document what actually happened without softening it, and make remediation non-optional.

The organizations that adopt even half of that rigor will find problems earlier, fix them faster, and build systems that hold up under pressure — not just under demo conditions.

The ones that don't will keep producing evaluations that tell leadership exactly what leadership wants to hear. Right up until the system fails in production and everyone discovers that the evaluation never tested for the thing that broke.