Data drift detection: PSI vs Kolmogorov–Smirnov in practice

2026-01-20 · Emma Schmidt

Data drift detection: PSI vs KS in practice

Data drift is one of those topics where every team picks a test based on what the previous team used. We actually sat down and compared Population Stability Index (PSI) and the Kolmogorov–Smirnov (KS) test on our real features. Here's what we found.

Quick refresher

PSI bins the reference and current distributions, then sums (p_current − p_ref) × log(p_current / p_ref) across bins. Interpretation is conventional: <0.1 stable, 0.1–0.25 slight drift, >0.25 significant drift.

KS computes the max distance between two empirical CDFs. Non-parametric, returns a statistic and a p-value.

from scipy.stats import ks_2samp

stat, pvalue = ks_2samp(reference_sample, current_sample)

Our setup

Features tracked: 214 (numerical, categorical, hashed)
Reference window: 30 days
Current window: rolling 1 day, updated hourly
Labeled drift events: 42 (manually identified by domain experts over 6 months)

For each event, we checked: did PSI catch it? Did KS catch it? How fast?

Results

Metric	PSI	KS
Recall (of 42 known events)	0.81	0.88
Precision at threshold	0.72	0.54
Median detection delay	4h	2h
Works on categorical	yes (native)	no (needs encoding)
Sensitive to sample size	moderate	high
Threshold interpretability	high	low

When KS wins

KS is more sensitive to subtle shifts, especially in the tails. In two incidents, KS flagged a drift 6+ hours before PSI crossed its threshold. For features where tail behavior matters (e.g. latency, transaction size), KS catches the real thing earlier.

When PSI wins

PSI has better precision at conventional thresholds. KS with naive p-value threshold (p<0.05) produces a lot of alerts at high traffic volume — the test becomes "everything is statistically different from everything" as sample size grows. PSI's thresholds are traffic-independent.

PSI also handles categorical features natively, which is most of the annoyance with KS.

Hybrid approach

We ended up using both:

def check_drift(reference, current):
    signals = {}
    if is_numerical(feature):
        psi = compute_psi(reference, current, bins=10)
        ks = ks_2samp(reference, current)
        signals["psi"] = psi
        signals["ks_stat"] = ks.statistic
        # Alert on: PSI > 0.25 AND ks_stat > 0.08 (dual condition)
    else:
        signals["psi"] = compute_psi_categorical(reference, current)
    return signals

Requiring both tests to fire before alerting cut our false-positive rate by 63% while missing only 2 of the 42 real events.

What neither test catches

Label drift. Both measure input distribution. If inputs stay stable but the relationship between input and label changes, you need a separate monitor on prediction quality.
Multivariate drift. Both are univariate. If two features individually stable but their joint distribution shifted — you need something like domain classifier drift detection.
Semantic drift. If "category_id=7" used to mean "electronics" and now means "apparel" — statistically identical, semantically broken. Only catchable with contract tests.

Conclusion

PSI and KS are complementary. Use PSI for a stable baseline with interpretable thresholds, layer KS on top for tail sensitivity, and require dual confirmation to tame false positives. Neither replaces domain-aware contract tests on the data itself.

← Back to all posts