Feature store: build vs buy, six months in

2025-10-25 · Mikael Laakso

Feature store: build vs buy

The feature store debate is one of those infrastructure questions where "it depends" is the honest answer — but teams tend to pick a side before examining what depends on what. We went through both options and have opinions now.

Why you (might) need a feature store

Three problems a feature store solves:

Training/serving skew. Features computed differently in training and inference cause silent quality degradation. A feature store enforces one definition.
Feature reuse. Same feature (e.g. "user 7-day session count") used by three teams. Compute once, use everywhere.
Point-in-time correctness. When building training data, each row needs feature values "as of" that row's timestamp. Non-trivial to implement correctly.

If you're solving one of these, you probably want a feature store. If none — you probably don't.

The build phase (months 1–6)

We built on top of Spark + Redis + a custom metadata service. Roughly 2 engineer-quarters of effort to reach parity with baseline requirements.

What went well: - Tight integration with our existing pipelines - Custom offline/online consistency checks baked in - No extra vendor bill

What went poorly: - Point-in-time joins. Turns out getting this right across 30+ feature views with different freshness SLAs is hard. We shipped 3 correctness bugs to production. - Schema evolution. Adding a new feature to an existing view was straightforward; changing a type was a minor migration nightmare. - Online store hot path. Redis worked, but latency tails under load (p99 > 200ms) needed a full round of optimization. - Ongoing ops load. Every week someone had to firefight something.

The buy phase (months 7–12)

We migrated to a managed feature platform. ~6 weeks of migration + training.

What improved immediately: - Point-in-time correctness tested by someone else - Online store p99 <40ms out of the box - Schema tools that understand feature engineering, not just databases - UI that ML engineers could use without platform help

What we gave up: - Some integration flexibility. Custom feature types needed workarounds. - Direct visibility into internal behavior. Logs are fine; sometimes you want to look inside.

Cost comparison

Axis	Self-built (6 mo)	Managed (6 mo)
Engineering cost	~€180k (2 FTE-quarters initial + 0.5 FTE ops)	~€30k (migration) + 0.1 FTE ops
Infra cost	~€12k (compute, Redis, storage)	~€36k (vendor bill)
Correctness incidents	3	0
Feature ship velocity	2–3 days	same-day
Total 6-month cost	~€192k + 3 incidents	~€66k

The managed option saved money even before counting incident cost.

When build makes more sense

You have unusual feature types (embeddings at huge scale, geospatial, graph-derived) that managed tools don't handle well.
You're big enough that the managed bill would exceed €500k/year.
You have an existing platform team with available capacity and strong domain knowledge.

When buy makes more sense

You're a team of 4–20 ML engineers shipping multiple models.
Point-in-time correctness matters (most supervised learning setups).
You'd rather ship models than operate a feature store.

The middle path

Some teams build a thin abstraction over a managed backend — custom feature definitions, vendor-native offline/online. This gets you migration flexibility (change vendors later) without the cost of building the actual hard parts.

Conclusion

If we had to do it again, we would have bought on day one. The learning was valuable, but the three correctness incidents we shipped to production were not a good price for that learning. For most teams, feature stores are now firmly in the "buy" column.

← Back to all posts