Shadow deployments for ML: catching regressions before they hit users
2026-02-22 · Mikael Laakso
Shadow deployments for ML
Canary deployments work well for services. For ML models, they have one awkward property: the canary's predictions go to real users, so a bad canary affects someone. Shadow deployment sidesteps this by routing a copy of production traffic to the new model without using its output.
The pattern
┌─── [Model v7] ─── served to user
request ──────┤
└─── [Model v8] ─── logged, compared, discarded
Both models see the same input. Only v7's prediction is used. v8's prediction is logged alongside v7's. You now have a real-traffic A/B comparison without any user-visible risk.
What to compare
| Metric | Reveals |
|---|---|
| Prediction agreement rate | How often models disagree |
| Confidence distribution | Is new model more/less certain? |
| Latency distribution | Production-load performance |
| Error rate | Hidden bugs under real inputs |
| Resource use | CPU/GPU/memory under load |
| Segment-level metrics | Differences on specific cohorts |
The last one is what usually saves us. An aggregate 2% improvement can hide a 15% regression on one user segment.
Implementation
Simplest version: a teeing serving layer.
@app.post("/predict")
async def predict(req: Request):
primary = await model_v7.predict(req.features)
# Fire-and-forget
asyncio.create_task(shadow_and_log(req, primary))
return primary
async def shadow_and_log(req, primary_pred):
try:
shadow_pred = await model_v8.predict(req.features)
await log_comparison(req.id, primary_pred, shadow_pred)
except Exception as e:
metrics.increment("shadow.error", tags={"model": "v8"})
Two things matter here: 1. Shadow must not affect primary latency. If v8 blocks on I/O, it should not slow v7. Fire-and-forget or separate worker pool. 2. Shadow errors must not affect primary. Caught, logged, never raised up the primary path.
How long to shadow
Depends on traffic volume and seasonality. Rule of thumb: - Statistical significance: until you have at least 10k comparison samples per relevant segment. - Temporal coverage: at least one full business cycle (for us, usually a week — Monday looks different from Saturday).
Pitfalls
- Non-idempotent features. If feature generation has side effects (logging, calls to external APIs), running it twice can double-count. Use read-only feature fetch for shadow.
- Cost. You're doubling inference cost. On GPU-heavy models, shadowing at 100% traffic is expensive. Sample 10–25% instead.
- Delayed labels. Prediction agreement ≠ quality. You still need delayed ground truth to evaluate which model is actually better.
Shadow vs canary vs A/B
| Pattern | User impact | Label needed for decision | Use when |
|---|---|---|---|
| Shadow | none | for ground truth only | pre-promotion validation |
| Canary | small % affected | yes, from canary segment | catch live regressions early |
| A/B | measured cohort | yes | compare two live models |
Shadow is pre-promotion. Canary is post-promotion. Don't skip either.
Conclusion
Shadow deployment is the cheapest bug-catcher in an ML team's toolkit. It doesn't replace offline eval or A/B tests — it's the layer between them, and it catches the bugs that only show up in production traffic patterns.
← Back to all posts