ML pipeline best practices: reproducibility as a first-class citizen

2026-03-28 · Sofia Lindqvist

ML pipeline best practices

After building ML platforms at three different companies, you start to see the same mistakes repeat. Here are the seven principles that changed how my team designs pipelines.

1. Pipelines are code, data is a parameter

A pipeline that reads s3://prod-data/latest/ hard-coded is not reusable, not testable, not portable. Every pipeline takes input datasets as parameters. Dev, staging, and prod are just different parameter bindings of the same graph.

2. Every step is idempotent

Re-running a step with the same inputs must produce byte-identical outputs (or declare the non-determinism up front, e.g. with a seed). This is the foundation of both caching and reproducibility. If step 7 fails, re-running from step 5 must give the same state.

3. Inputs and outputs are typed and versioned

steps:
  - name: featurize
    inputs:
      raw_events: s3://.../events@v12
    outputs:
      features:
        schema: features-v3
        partition: date

No "whatever columns happen to be there". A schema registry + contract tests catches 80% of breakage before it hits training.

4. Caching is cheap, recomputation is expensive

Hash (step_code_version, input_artifact_hashes, params_hash) and skip the step if output already exists. On our main training pipeline, cache hit rate averages 64%, saving ~9 hours per full rerun.

5. Metadata is as important as data

For every run we persist: git SHA, container digest, input artifact IDs, output artifact IDs, metrics, runtime, user. Three years later, someone will ask "why did model v4.2 predict X?" — and you need the receipts.

6. Failure is a first-class output

Don't raise and crash. Emit a structured failure record:

{
    "run_id": "r-20260328-1412",
    "step": "train_model",
    "status": "failed",
    "error_type": "OOM",
    "attempt": 2,
    "retryable": true,
    "diagnostics": {...},
}

Your alerting, retry, and triage logic all read this same record. No more regex-parsing stack traces.

7. Local execution must work

A pipeline that only runs in Kubernetes is a pipeline where iteration is measured in hours. Every step must be runnable on a laptop with a subset of the data — ideally via mlpipeline run --step featurize --sample 1%. This is a 10x productivity multiplier for debugging.

The meta-principle

When in doubt, make it boring. Fancy DAG abstractions and clever magic decorators age badly. A pipeline that's easy to read at 3am when it's broken is worth more than one that's elegant on slides.


← Back to all posts