All writing
AI Deployment

What 'Production-Ready' Actually Means for AI

The gap between 'it works' and 'it's production-ready' is where AI projects go to stall. Nobody agreed on what 'ready' meant before the building started.

"It works in my notebook" is a statement of fact, not a claim of readiness. The notebook is a controlled environment. The data is static. The inputs are curated. The person running it understands every assumption baked into every cell. None of those conditions exist in production.

Production is where the inputs are whatever the upstream system sends at 3am. Where the user provides text in a format nobody anticipated. Where the dependency that was always available is suddenly not. Where the person investigating a failure at 6am has never seen the code and needs to understand what went wrong in ten minutes, not ten hours.

The gap between "it works" and "it's production-ready" is where AI projects go to stall. I've watched that gap consume months — not because the model was wrong, but because nobody agreed on what "ready" meant before the building started.

The definition gap

Data scientists define "done" differently than operations engineers. This isn't a failure of either group. It's a vocabulary problem with real consequences.

For a data scientist, "done" often means: the model performs well on the evaluation set, the accuracy meets the target, the outputs are reasonable. The notebook runs end to end. The results are reproducible. That's a legitimate and rigorous definition of done — for the modeling phase.

For an operations engineer, "done" means something entirely different: the system handles unexpected inputs without crashing. It scales to production load. It logs enough to debug but not so much it fills the disk. It can be deployed, rolled back, monitored, and restarted by someone who didn't build it. There's documentation. There are runbooks. There are alerts.

Both definitions are correct. They just describe different things. And when the project plan says "model development: 8 weeks, deployment: 2 weeks," it's treating the gap between those definitions as a two-week task. It's usually a two-month task, minimum.

What production-ready actually requires

I've started keeping a mental checklist, built from watching what breaks and what doesn't. Production-ready AI isn't a quality bar for the model. It's a quality bar for the system.

It handles inputs the training data didn't include. Not gracefully in the academic sense — gracefully in the operational sense. When the model receives an input outside its expected distribution, it doesn't crash, it doesn't hallucinate confidently, and it doesn't silently return a garbage result that looks plausible. It flags the anomaly, returns a safe default or a low-confidence indicator, and logs enough detail for someone to investigate later.

It degrades predictably. When a dependency fails — a feature store is slow, an upstream service returns errors, a data source is stale — the system doesn't cascade into an unrecoverable state. It degrades in a way someone designed. Maybe it falls back to a simpler model. Maybe it returns cached results. Maybe it stops serving predictions and returns an explicit "unavailable." The key word is "designed." If nobody designed the failure mode, the failure mode is whatever happens to happen, and that's usually bad.

Someone who didn't build it can operate it. This is the test that eliminates 80% of "production-ready" ML systems. Can the on-call engineer — who has never seen the model code, doesn't know what XGBoost is, and is handling this alert at 2am — figure out what's wrong and what to do about it? If the answer requires understanding the model's internals, it's not production-ready. It's a research project with an uptime requirement.

Someone who didn't build it can tell when it's broken. Not through model metrics — through operational signals. Response latency spiked. Error rate crossed a threshold. Output distribution shifted from the baseline. Confidence scores dropped. These are things an operations dashboard can show without requiring ML expertise to interpret.

There's a definition of "fail." This sounds obvious. It isn't. For a traditional service, "fail" means it returns errors or stops responding. For an ML system, "fail" can mean: it's returning results, the results look normal, but the results are wrong. Defining what "wrong" means — and making that definition measurable and monitorable — is one of the hardest parts of productionizing AI. Teams that skip this step ship systems that fail silently for months.

There's a runbook. When the system fails — per the definition above — there's a documented procedure for what to do. Not "page the data scientist." A real runbook: check these logs, look at these metrics, try these remediation steps, escalate to this person if those don't work. The runbook is the bridge between the people who built the model and the people who keep it running.

Why this gap exists

The gap isn't about competence. Data scientists are rigorous about the things they're trained to be rigorous about: model performance, statistical validity, experimental design. Operations engineers are rigorous about the things they're trained to be rigorous about: reliability, observability, incident response.

The gap exists because productionizing AI requires both sets of rigor simultaneously, and most organizations don't create the context for that to happen. The data science team builds in notebooks and hands off a model artifact. The engineering team wraps it in a service and deploys it. The assumptions that were in the data scientist's head — what inputs are valid, what performance degradation looks like, what the model is sensitive to — don't survive the handoff. They were never written down in a format that operations can use.

The fix is agreement, not process

More process doesn't close this gap. Another stage gate doesn't close it. What closes it is a shared definition of "production-ready" that both the people building the model and the people operating the system agree to before anyone starts writing code.

That definition should be specific to the system. A fraud detection model has different production-readiness requirements than a recommendation engine. But the categories are consistent: input handling, failure modes, operability, observability, documentation.

Write the definition at the start of the project. Make it a contract between the ML team and the operations team. Review it at every milestone. Don't let the model move to deployment until both teams sign off — not just that the model performs well, but that the system is ready to run in production without the people who built it standing behind it.

The teams I've seen do this well don't ship faster. They ship fewer times. They don't launch and then spend three months stabilizing. They launch and it works, because "works" was defined before the first line of code.

That's what production-ready means. Not "it works." It works, and we agreed on what "works" means, and we built the system to prove it continuously.