Model monitoring in production: what to actually watch

2026-03-14 · Mikael Laakso

Model monitoring in production

"Monitoring" in ML is a word that covers everything from CloudWatch CPU graphs to PhD-grade distributional drift detection. Most teams get the extremes right and the middle wrong. Here's the middle.

Four layers of monitoring

Layer	Question it answers	Tools
Infrastructure	Is the service running?	Prometheus, Datadog
Serving	Is the model responding correctly?	Logs, tracing
Input data	Is the incoming data what the model expects?	Drift detection
Output quality	Are predictions still good?	Delayed labels, feedback

Most failures we see are not at layer 1. They're at layer 3 (silent data drift) or layer 4 (quality degradation the model doesn't know about).

Signals worth owning

Prediction distribution shift. Compare the distribution of model outputs today vs last week. Use PSI or a simple KS test. Sudden shifts usually mean the input changed.

Input feature drift. Same, but for input features. Critical for features coming from upstream services that might silently change their semantics.

Latency percentiles. p50, p95, p99. Always all three. A model that goes from 100ms to 400ms p99 with stable p50 has a tail-latency problem nobody noticed yet.

Error rate by segment. Aggregate accuracy can hide a 20% drop on a specific user cohort. Monitor by segment (geo, tier, product area).

Signals that are mostly noise

Overall accuracy on unlabeled traffic. Without ground truth it's a proxy at best.
Feature statistics with no reference window. "Mean shifted from 0.42 to 0.44" — is that bad? You need a baseline to compare.
Weekly alerts with no action attached. If there's no runbook, the alert is just anxiety.

Detection vs diagnosis

Monitoring tells you something's wrong. Diagnosis is a separate activity. We maintain a "diagnostic pack" for each model: a script that, given a time window, outputs:

Top-N features by PSI
Top-N input segments by error rate change
Sample of predictions in the affected window
Link to the training data vs current distribution comparison

This takes the "what changed" question from a 3-hour exploration to a 10-minute readout.

Example alert schema

alert: prediction_distribution_shift
model: churn-predictor-v7
measure: psi
threshold: 0.25
window: 1h
baseline: 7d
runbook: https://docs/runbooks/churn-drift
owner: ml-core@mlpipeline-cloud.com

Every field is there for a reason. Especially runbook and owner.

What to do when it fires

Don't retrain immediately. Retraining on drifted data often makes things worse. Diagnose first.
Check upstream. 70% of "model drift" we've investigated turned out to be an upstream pipeline bug.
Shadow the new model. When you do retrain, shadow-deploy before switching traffic.
Document. Every alert fire goes in a log with "was it real? what was the cause?". This becomes the knowledge base for tuning thresholds.

Conclusion

Good ML monitoring is less about fancy statistics and more about owning the layers of the stack, having clear runbooks, and resisting the urge to retrain first and think later. Most production ML pain comes from not knowing what broke — not from not having a clever detection algorithm.

← Back to all posts