MLOps stacks in 2025: a comparison of five approaches we evaluated

2025-11-18 · Sofia Lindqvist

MLOps stacks in 2025

When we were rebuilding our ML platform this summer, we spent six weeks seriously evaluating five different architectural approaches. Here's what each looked like, what it cost, and where each one breaks down.

Option 1: Monolithic platform (SageMaker / Vertex / Azure ML)

One vendor, one console, one API. Experiment tracking, training, serving, monitoring — all in the box.

Pros: fastest time-to-first-model. Integrations "just work" within the ecosystem. Cons: vendor lock-in. Costs climb fast at scale. Ceiling on customization hits quickly (e.g. custom training operators).

Good fit for: early-stage teams, teams without dedicated platform engineers.

Option 2: Kubeflow + MLflow + custom glue

Self-hosted. Kubeflow for pipelines, MLflow for experiments, custom serving.

Pros: no vendor lock-in, full control. Cons: Kubeflow operational complexity is real — budget one full-time platform engineer. Upgrades are painful.

Good fit for: teams with 3+ infra engineers who can own the platform.

Option 3: Managed best-of-breed (Databricks + Weights & Biases + Seldon + ...)

One tool per job, each from a specialist vendor.

Pros: each piece is best-in-class. Flexible. Cons: integration burden falls on you. Identity, billing, and data movement between 4+ vendors is non-trivial. SOC2 audit is 5x harder.

Good fit for: mature ML orgs with product requirements no single vendor covers.

Option 4: Open-core managed (Metaflow, Prefect, ZenML)

One orchestration tool (managed or self-hosted), pluggable backends for compute and storage.

Pros: good balance of control and ops simplicity. Clean abstractions. Cons: smaller ecosystem. Some features (e.g. advanced monitoring) need external tools.

Good fit for: teams with pipeline-heavy workloads.

Option 5: Build your own on Kubernetes primitives

Argo Workflows, plain Kubernetes, Prometheus, a few internal services.

Pros: maximum flexibility, no unnecessary abstractions. Cons: you're building an MLOps platform. That's a multi-quarter project.

Good fit for: FAANG-scale teams with existing platform investment.

Comparison table

Option Setup time Ongoing ops Vendor lock Ceiling
Monolithic platform 1 week low high medium
Kubeflow stack 2 months high low high
Best-of-breed 1 month medium medium high
Open-core (Metaflow) 2 weeks low low medium-high
Build your own 6+ months high none very high

What we chose

Option 4 (managed open-core) as the orchestration backbone, with a managed experiment tracker, a managed feature store, and self-hosted inference on Kubernetes. This matched our team size (8 ML engineers, 2 platform) and gave us customization on serving — our main differentiator — while outsourcing the less interesting pieces.

Recommendations

  1. Be honest about team size. Option 2 or 5 with fewer than 2 platform engineers is a recipe for burnout.
  2. Start simpler, migrate later. Over-engineering an MLOps platform for 5 models is a classic trap.
  3. Lock-in is overrated as a risk at small scale. It becomes real when you're spending >€50k/month with one vendor. Below that, velocity matters more.
  4. Instrumentation matters more than tools. A well-instrumented pipeline on option 1 is more useful than a fragmented stack on option 3.

Conclusion

There is no "best" MLOps stack — only "best for your team size, stage, and problem shape". Match the stack to the organization, not the other way around.


← Back to all posts