Architecture as risk management

A junior engineer opens a pull request adding a new NotificationService. The diff looks fine. The class is clean, the tests pass, and DI wires up.

I leave one comment: what risk is this absorbing?

He asks me what I mean.

The frame

Every architecture decision spends part of a finite budget. Add a service, you spend complexity to absorb the risk that two callers will need to diverge later. Add a queue, you spend operational surface to absorb the risk that the producer will outpace the consumer. Add a microservice, you spend a network hop to absorb the risk that two teams will need to deploy independently.

If the risk is real, the spend is good architecture. If the risk is imagined, the spend is over-engineering. Same diagram, different verdict.

The mistake juniors make is to treat architecture as a checklist of patterns. The mistake seniors make is to treat it as a portfolio of preferences. Both miss the question: which mistakes will my team be free to make later because of this choice, and is that the trade I want?

Three real choices, three answers

A shared UI library on a multi-product platform. I built a shared component library and pushed every product surface to consume it. The cost: every change to a button ripples across three teams. The risk it absorbs: design-system drift. With three product surfaces and four contributors, you’d have three button styles in six months without it. With it, you have one button and an argument when a designer wants to change it. The argument is the feature.

A three-layer data platform. A side project I’ve been building has Collection (scrapers, API clients, file watchers), Processing (normalisers, enrichers, validators), and Platform (REST API, admin UI). They run in separate processes and communicate over a message bus. The cost: more deployment surface, three log streams, two extra failure modes. The risk it absorbs: a bad scraper cannot take down the API. When a scraper crashes - and they do, often - the processing queue drains naturally and the API stays up. Without that split, every Cloudflare-403 from one of forty source sites would page me.

A national-scale platform we kept boring. We did NOT add microservices. One Angular app, one .NET API, one SQL Server, one team of seven. People assumed a national-scale system needed a service mesh. It needed predictable on-call. The risk we’d otherwise have absorbed is “two teams need to deploy independently.” There was one team. The trade was clear: no microservices, no Kubernetes, no event sourcing. Boring tech that the on-call rotation could understand at 3am. The platform has been in production at the ministry level since 2021.

That third project is the example I show to engineers who want to add complexity. Three of those four bullets - service mesh, microservices, event sourcing - would have been “good engineering” by every checklist. They would have absorbed risks the team did not have.

Over-engineering as fake risk reduction

The trap is real because it feels exactly like good engineering. Adding a queue feels safer than not. Adding a service feels more decoupled. Adding a layer feels more “clean.” Each addition can be defended with a plausible-sounding risk story.

The check: name the risk concretely. Not “we might need to scale.” Not “what if requirements change.” Concretely. “By July we’ll have three teams pushing notifications and they’ll deadlock on the schema if we don’t separate them now.” If you can’t name it that concretely, you’re spending the budget on anxiety.

How to talk about it on a team

I run system design reviews instead of writing process docs. The first question I ask, every time, is the same one: what risk does this absorb?

If the answer is “future flexibility,” we talk about it more. If the answer is “we’re hiring two more devs in March and the current schema won’t survive their first feature,” we approve the spend.

Take

Pick the smallest structure that absorbs the change you actually expect. Postpone what you don’t need yet. Name what you do.