Materialize Only When Forced

#resources #resources/programming #resources/programming/architecture

Materialize only when forced

Defer materialized caches and denormalized tables until real query latency forces them. Document the path; don't build the path. The cost of premature materialization is real: schema migrations, freshness bugs, sync gaps between source and cache, code paths that the original query no longer matches.

The rule

Compute on demand first. If the query is bounded — touches a small set of rows, uses indexes that already exist, returns quickly on real data — leave it on demand. Add a materialized cache when a benchmark on a representative dataset shows the on-demand query is slow enough to bother a user.

Don't add caches for hypothetical scale. Don't add caches because "we might need them later." A documented-but-deferred cache is cheaper than a built-but-stale one.

What "documented" means

When you decide not to materialize, write down what the cache would look like if you did. Table name, columns, refresh strategy, invalidation trigger. The note serves two purposes:

The next contributor doesn't reinvent the analysis
When latency finally forces the decision, the path forward is short — you're implementing the documented design, not designing it under pressure

Mxr's AI email layer does this explicitly. recipient_reply_latency_buckets, expertise_index, collaborator_patterns, entities, entity_mentions — all documented as future materialization options. None implemented. The trigger for implementing each is a real, measured query slow on a real mailbox.

Why this is restraint, not laziness

Materialized caches are infrastructure. Once they exist:

Every write path that touches the source must also touch the cache
Stale caches produce wrong answers without throwing errors
Migrations get more complex because there are two schemas to evolve in lockstep
The query layer has to know about both the source and the cache

That cost is acceptable when query latency makes it acceptable. Before that, it's just complexity hunting a justification.

The complement: when you should materialize

The signal is concrete:

A real user-visible operation takes too long on real data
The slow path is the same query every time (caching helps), not a different query each time
The source data changes predictably enough that invalidation has a clean trigger

When that's true, materialize. Pick the structure you documented. Move quickly.

Where this generalizes

This is "premature optimization is the root of all evil" applied specifically to data layout. It applies anywhere:

Read replicas (don't add one until the primary is contended)
Search indexes (don't add one until grep is too slow)
Denormalized views (don't add one until the join is hot)
Computed columns (don't add one until the function is in every hot query)

The reverse is harder. Removing materialized infrastructure once it exists is painful because something somewhere depends on it. Better to delay the addition than reverse it later.