Data drift detection: PSI vs Kolmogorov–Smirnov in practice
2026-01-20 · Emma Schmidt
Data drift detection: PSI vs KS in practice
Data drift is one of those topics where every team picks a test based on what the previous team used. We actually sat down and compared Population Stability Index (PSI) and the Kolmogorov–Smirnov (KS) test on our real features. Here's what we found.
Quick refresher
PSI bins the reference and current distributions, then sums (p_current − p_ref) × log(p_current / p_ref) across bins. Interpretation is conventional: <0.1 stable, 0.1–0.25 slight drift, >0.25 significant drift.
KS computes the max distance between two empirical CDFs. Non-parametric, returns a statistic and a p-value.
from scipy.stats import ks_2samp
stat, pvalue = ks_2samp(reference_sample, current_sample)
Our setup
- Features tracked: 214 (numerical, categorical, hashed)
- Reference window: 30 days
- Current window: rolling 1 day, updated hourly
- Labeled drift events: 42 (manually identified by domain experts over 6 months)
For each event, we checked: did PSI catch it? Did KS catch it? How fast?
Results
| Metric | PSI | KS |
|---|---|---|
| Recall (of 42 known events) | 0.81 | 0.88 |
| Precision at threshold | 0.72 | 0.54 |
| Median detection delay | 4h | 2h |
| Works on categorical | yes (native) | no (needs encoding) |
| Sensitive to sample size | moderate | high |
| Threshold interpretability | high | low |
When KS wins
KS is more sensitive to subtle shifts, especially in the tails. In two incidents, KS flagged a drift 6+ hours before PSI crossed its threshold. For features where tail behavior matters (e.g. latency, transaction size), KS catches the real thing earlier.
When PSI wins
PSI has better precision at conventional thresholds. KS with naive p-value threshold (p<0.05) produces a lot of alerts at high traffic volume — the test becomes "everything is statistically different from everything" as sample size grows. PSI's thresholds are traffic-independent.
PSI also handles categorical features natively, which is most of the annoyance with KS.
Hybrid approach
We ended up using both:
def check_drift(reference, current):
signals = {}
if is_numerical(feature):
psi = compute_psi(reference, current, bins=10)
ks = ks_2samp(reference, current)
signals["psi"] = psi
signals["ks_stat"] = ks.statistic
# Alert on: PSI > 0.25 AND ks_stat > 0.08 (dual condition)
else:
signals["psi"] = compute_psi_categorical(reference, current)
return signals
Requiring both tests to fire before alerting cut our false-positive rate by 63% while missing only 2 of the 42 real events.
What neither test catches
- Label drift. Both measure input distribution. If inputs stay stable but the relationship between input and label changes, you need a separate monitor on prediction quality.
- Multivariate drift. Both are univariate. If two features individually stable but their joint distribution shifted — you need something like domain classifier drift detection.
- Semantic drift. If "category_id=7" used to mean "electronics" and now means "apparel" — statistically identical, semantically broken. Only catchable with contract tests.
Conclusion
PSI and KS are complementary. Use PSI for a stable baseline with interpretable thresholds, layer KS on top for tail sensitivity, and require dual confirmation to tame false positives. Neither replaces domain-aware contract tests on the data itself.
← Back to all posts