Shadow deployments for ML: catching regressions before they hit users

2026-02-22 · Mikael Laakso

Shadow deployments for ML

Canary deployments work well for services. For ML models, they have one awkward property: the canary's predictions go to real users, so a bad canary affects someone. Shadow deployment sidesteps this by routing a copy of production traffic to the new model without using its output.

The pattern

              ┌─── [Model v7] ─── served to user
request ──────┤
              └─── [Model v8] ─── logged, compared, discarded

Both models see the same input. Only v7's prediction is used. v8's prediction is logged alongside v7's. You now have a real-traffic A/B comparison without any user-visible risk.

What to compare

Metric	Reveals
Prediction agreement rate	How often models disagree
Confidence distribution	Is new model more/less certain?
Latency distribution	Production-load performance
Error rate	Hidden bugs under real inputs
Resource use	CPU/GPU/memory under load
Segment-level metrics	Differences on specific cohorts

The last one is what usually saves us. An aggregate 2% improvement can hide a 15% regression on one user segment.

Implementation

Simplest version: a teeing serving layer.

@app.post("/predict")
async def predict(req: Request):
    primary = await model_v7.predict(req.features)
    # Fire-and-forget
    asyncio.create_task(shadow_and_log(req, primary))
    return primary


async def shadow_and_log(req, primary_pred):
    try:
        shadow_pred = await model_v8.predict(req.features)
        await log_comparison(req.id, primary_pred, shadow_pred)
    except Exception as e:
        metrics.increment("shadow.error", tags={"model": "v8"})

Two things matter here: 1. Shadow must not affect primary latency. If v8 blocks on I/O, it should not slow v7. Fire-and-forget or separate worker pool. 2. Shadow errors must not affect primary. Caught, logged, never raised up the primary path.

How long to shadow

Depends on traffic volume and seasonality. Rule of thumb: - Statistical significance: until you have at least 10k comparison samples per relevant segment. - Temporal coverage: at least one full business cycle (for us, usually a week — Monday looks different from Saturday).

Pitfalls

Non-idempotent features. If feature generation has side effects (logging, calls to external APIs), running it twice can double-count. Use read-only feature fetch for shadow.
Cost. You're doubling inference cost. On GPU-heavy models, shadowing at 100% traffic is expensive. Sample 10–25% instead.
Delayed labels. Prediction agreement ≠ quality. You still need delayed ground truth to evaluate which model is actually better.

Shadow vs canary vs A/B

Pattern	User impact	Label needed for decision	Use when
Shadow	none	for ground truth only	pre-promotion validation
Canary	small % affected	yes, from canary segment	catch live regressions early
A/B	measured cohort	yes	compare two live models

Shadow is pre-promotion. Canary is post-promotion. Don't skip either.

Conclusion

Shadow deployment is the cheapest bug-catcher in an ML team's toolkit. It doesn't replace offline eval or A/B tests — it's the layer between them, and it catches the bugs that only show up in production traffic patterns.

← Back to all posts