01 Writing · 38 essays

Notes on building
things that don't fall over.

Long-form essays on systems engineering, on-call practice, and the unglamorous decisions that make or break a platform. Roughly one a month, no newsletters disguised as articles.

Topic 38 essays · since 2017
2026 · 04 · 22 The cost of a graceful degradation When the fallback quietly becomes the only path anyone trusts. Systems 9 min · 2,840 words

Every team I've worked with eventually ships a "graceful" fallback path. The intent is honest: when the primary path is unhealthy, route around it, return a cached value, drop a feature, anything to keep the user from seeing a blank page. The implementation is usually small — a few hundred lines, an extra timeout, a circuit breaker.

And then, twelve months later, the fallback is the only path anyone trusts.

The pattern

The slide is always the same. A sharp, easy-to-spot incident in month one — the fallback fires, the team praises itself, the postmortem ships. Then the fallback fires again, this time for a reason no one can immediately explain. It works, so the on-call moves on. Six weeks later it's firing daily. Eight weeks later it's firing during deploys. By month nine the team has built tooling to monitor the fallback, not the primary.

You've now built a second system, with second-class observability, that runs in production more often than the one you intended to ship. Your happy path has withered.

Why it happens

Three forces, all rational:

  • The fallback is cheap to extend. It already returns "something". Adding a small caveat is one PR.
  • The fallback owns the incident. When it fires, the on-call sees it. The primary path silently doing fine doesn't get a Slack ping.
  • The fallback is forgiving. Latency budgets, freshness, consistency — all a notch lower. Engineers naturally prefer the codepath that doesn't punish them.

What I do now

Two architectural choices, made early, that have stopped this for me:

  1. Treat the fallback as a separate product. Different SLO, different dashboards, different on-call rotation if you can afford it. If it's worth shipping, it's worth owning.
  2. Alert when the fallback ratio creeps up. Not when it fires — that's noisy. When the rolling 7-day ratio of fallback-served requests exceeds, say, 1%. That's the early warning that the primary is rotting.
"A graceful degradation is a contract with your future self. Read the contract."

None of this is novel. It's just hard to do when the fallback is winning every postmortem.

Originally drafted on a flight to Berlin Comments welcome at contact Cite as: Nellis, J. (2026). The cost of a graceful degradation
2026 · 03 · 11 Why I stopped paginating streams Cursor pagination is a beautiful hack — until your stream becomes the source of truth. Engineering 6 min · 1,720 words 2026 · 02 · 04 Small teams and the 100-rule Why a 5-person team beats a 12-person team at almost everything that matters in year one. Leadership 5 min · 1,420 words 2026 · 01 · 19 Tracing as a design tool Distributed tracing isn't only for incidents — it's the cheapest way to interview your own architecture. Observability 11 min · 3,420 words 2025 · 12 · 02 On being legible to your CFO A short field guide to translating engineering trade-offs into financial ones, without losing either. Leadership 7 min · 2,040 words 2025 · 10 · 28 The cheapest queue is no queue Six anti-patterns I keep seeing in event-driven architectures. Most start as cleverness. Systems 8 min · 2,360 words 2025 · 09 · 15 Reading notes — Designing Data-Intensive Applications, again Fourth re-read in eight years. The chapters that aged the best, and the one I now skip. Reading notes 12 min · 3,800 words 2025 · 08 · 04 Why I keep a paper runbook A heretical practice that's saved the day twice in three years. Yes, paper. Yes, in 2025. Engineering 4 min · 1,180 words 2025 · 07 · 11 What "production" means Working through a definition that's actually load-bearing, not just a slack-channel name. Systems 6 min · 1,860 words