Three principles for resilient infrastructure
Infrastructure fails. This is not a design flaw to be eliminated; it is a property of complex systems to be managed. The question is not whether your infrastructure will fail, but what it will do when it does.
Most infrastructure is designed for the happy path. The unhappy paths — partial failures, cascading degradations, split-brain scenarios, timeout accumulations — are addressed reactively, after they produce incidents. This is understandable. Failure modes are, by definition, harder to anticipate than normal operation. But it is expensive. Each incident is an unplanned redesign.
The three principles below do not prevent failure. They change what failure looks like.
Principle 1: Design for graceful degradation, not binary availability
The traditional framing — a system is either "up" or "down" — obscures the design space. Most systems can operate in degraded modes that are better than complete unavailability, if those modes are designed explicitly.
Consider a content delivery system. Under normal operation, it serves personalised, dynamically assembled content. Under degraded operation — perhaps because the personalisation service is unavailable — it could serve cached, non-personalised content. Under further degradation, it could serve static pages. Each step down reduces capability but preserves some value.
The practice of designing these degraded modes explicitly has several benefits:
- It forces clarity about which dependencies are truly critical versus merely convenient
- It creates natural failure isolation boundaries that limit blast radius
- It produces systems that operators can reason about under pressure
The design work happens before deployment. The question to ask at every dependency boundary: what does this component do if its dependency is unavailable? If the answer is "fail," that is a valid choice — but it should be an explicit choice, not a default.
Principle 2: Circuit breaking and timeout budgets
Uncontrolled latency is more dangerous than immediate failure. A dependency that fails fast gives you an error you can handle. A dependency that hangs — that accepts your request and never returns — consumes resources indefinitely and can propagate failure throughout the call tree.
The circuit breaker pattern addresses this. A circuit breaker tracks failure rates for a given dependency. When failures exceed a threshold, it opens the circuit: subsequent requests fail immediately rather than waiting for the dependency. This:
- Releases threads/connections that would otherwise be held waiting
- Gives the dependency time to recover without continued load
- Provides clear signal that a problem exists
The complementary mechanism is explicit timeout budgets. Every network call should have a deadline. Deadlines should propagate through call chains. If a request has a 2-second total budget, and 800ms have elapsed by the time it reaches a particular service, that service should not make a call with a 1-second timeout — it should make a call with a 1200ms timeout, or fail immediately if the remaining budget is insufficient.
# Illustrative timeout propagation
request_deadline = now() + 2000ms
def call_service_a(deadline):
if time_remaining(deadline) < 100ms:
raise InsufficientBudget()
return service_a.call(timeout=time_remaining(deadline) - 50ms)
def call_service_b(deadline):
if time_remaining(deadline) < 100ms:
raise InsufficientBudget()
return service_b.call(timeout=time_remaining(deadline) - 50ms)
This is mechanical to implement. It is rarely implemented. The result of not implementing it is incidents where a slow downstream dependency causes cascading timeouts throughout a healthy system.
Principle 3: Understand your failure modes before you operate
The most valuable resilience investment is not technical. It is epistemic. Systems that are well understood fail more predictably and are recovered faster than systems that are not.
Understanding failure modes requires:
Failure mode documentation — for each component, what are the ways it can fail? What does each failure mode look like from the outside? What is the expected blast radius? This documentation should live with the component and be updated when the component changes.
Runbook discipline — for each class of failure, what are the first three diagnostic steps? What are the recovery actions? Runbooks written after incidents are better than none; runbooks written before are better still. The writing process itself surfaces gaps in understanding.
Regular failure testing — controlled failure injection in production or production-equivalent environments. Not chaos for its own sake, but structured hypothesis testing: "we believe this system degrades gracefully when dependency X is unavailable; let us verify this belief." Beliefs that are not tested are assumptions, and assumptions accumulate interest.
The goal is not to prevent surprise — surprise is inevitable in complex systems. The goal is to reduce the space of surprises and to ensure that the team's response to any given surprise is competent rather than improvisational.
Resilient infrastructure is not expensive infrastructure. It is thoughtful infrastructure. The thoughtfulness happens before the incident, when it is cheap. The alternative is purchasing that thoughtfulness in the middle of the night, under pressure, when it is very expensive indeed.