Designing for Failure
Topic 74

Designing for Failure

Resilience

Links get cut by backhoes, line cards die, providers have regional outages, and BGP sessions flap. Resilient design starts from the premise that every component will eventually fail and asks one question of each: when this dies, what else dies with it? The goal is never to prevent failure — that is impossible — but to ensure that no single failure turns into an outage. That demands redundant paths, components that do not share fate, and a system that degrades gracefully instead of falling over.

Redundancy is only worth what its independence is worth. Two links that follow the same conduit, two routers fed by the same power strip, two regions behind one DNS provider — these look redundant on the diagram and fail together in the field. This topic is the synthesis of the whole course's reliability thread: anycast (Topic 23), multi-region failure domains, health-checked load balancing, and the single-DNS trap all reduce to the same discipline of eliminating shared fate.

Eliminating Single Points of Failure

A single point of failure is any component whose loss takes down the whole path. Find them by walking the request end to end and asking, at each hop, "what is there exactly one of?" One uplink, one router, one firewall, one provider, one DNS zone — each is a SPOF until it is doubled. The subtle ones live where the diagram hides them: two fibers in the same trench fail to the same backhoe, two instances in the same rack fail to the same top-of-rack switch.

True redundancy requires independence at every layer the components share — physical path, power, provider, and control plane. Two links from the same carrier riding the same physical route are one link for failure purposes, no matter how the contract reads. The honest test is correlation: if a single event can take out both copies, you have one component drawn twice, not two.

Multipath and ECMP

Redundant paths held purely as standby waste half your capacity and hide a failover bug until the day you need it. Equal-cost multipath (ECMP) uses them all at once: when several routes to a destination share the same cost, the router hashes each flow's 5-tuple to one of them and load-balances across the set. A failed next hop simply drops out of the hash set, and the surviving paths absorb the traffic — no failover event, no standby to wake up.

The catch is per-flow hashing, not per-packet: ECMP pins each connection to one path to avoid reordering, so a single elephant flow rides one link and cannot exceed that link's capacity. Anycast (Topic 23) is the same idea at internet scale — many sites advertise one address, and BGP steers each client to the nearest, with a withdrawn route shifting traffic to the next site automatically.

# two equal-cost next hops to the same prefix — Linux hashes
# each flow to one, and a dead nexthop drops out of the set
ip route add 10.20.0.0/16 \
    nexthop via 10.0.0.1 dev eth0 weight 1 \
    nexthop via 10.0.0.2 dev eth1 weight 1
ip route get 10.20.5.7   # shows which nexthop this flow lands on

Failure Domains and Blast Radius

A failure domain is the set of things that fail together. A zone is a failure domain bounded by shared power and cooling; a region bounds shared zones; a provider bounds everything you run on it. Designing for failure means choosing how wide your blast radius is allowed to be and then spreading replicas across domains so no single one holds a majority of your capacity — three zones at 34% each survives the loss of one with room to spare; two zones at 50% does not.

The trap is the hidden shared dependency that collapses two domains into one. Replicas in two regions that both resolve through a single DNS provider share that provider's failure domain — when it goes down, both "independent" regions go dark at once. The single-DNS SPOF and the single-upstream SPOF are the same mistake: a global dependency silently spanning the domains you thought you had separated.

Graceful Degradation

Failover only counts if it has been exercised. A standby that has never taken live traffic is a hypothesis, not a backup — health checks must actively probe each member and pull failing ones out of rotation faster than clients notice, and connection draining must let in-flight requests finish before an instance is removed. Without draining, every deploy and every failover resets live connections, turning a routine event into a visible blip.

Graceful degradation means the system sheds load instead of toppling. When a backend pool loses half its members, the survivors should serve a reduced but correct service — shedding non-essential work, serving stale cache, returning a fast error rather than a slow hang — not pile every retry onto the remaining capacity until it too falls over. The standby must also be sized for full load: failover to a member that buckles the instant it inherits production traffic is no failover at all.

Active-activeall paths carry load
Every member serves live traffic via ECMP or anycast. Full capacity in use, eventless failover when one drops — but state must stay consistent and you must run below ~50% each (N+1) so a survivor can absorb the rest.
Active-passivewarm standby ready
A standby takes over only when the primary fails. Simpler, no split-brain on stateful systems — but you pay for idle capacity, failover takes seconds to minutes, and the path is untested until the moment it matters.
Active-Active vs Active-Passive

Active-active sends live traffic to every path or replica at once via ECMP or anycast. You get full use of all capacity and instant, eventless failover when one member drops out — but every member must hold consistent state, and you must run continuously below the point where losing one overloads the rest (N+1 sizing).

Active-passive keeps a warm standby that takes over only when the primary fails. It is simpler to reason about and avoids split-brain on stateful systems — but you pay for idle capacity, failover takes seconds to minutes, and the path is untested until the moment it matters. Choose active-active when you can size for N+1 and tolerate distributed state; active-passive when correctness under a single writer outweighs the idle cost.

Common Mistakes
  • Two "redundant" links that share one conduit, one carrier, or one power feed — a single backhoe or PDU failure takes out both, so the pair is one link for availability math despite costing twice as much.
  • Failover paths that are never tested until the incident that needs them, where the standby config has silently drifted, the DNS TTL is too long, or the standby was never actually wired up — and the failover fails.
  • A single DNS provider or single upstream transit as the hidden SPOF spanning every region — both "independent" sites resolve or egress through it, so its outage darkens all of them at once.
  • Standby or surviving capacity that cannot absorb full load: a two-member active-active pool run at 60% each means the survivor needs 120% the instant its partner dies, so failover trades one outage for another.
  • No connection draining on deploys or failover, so removing an instance resets every in-flight request — turning routine rollouts and health-check evictions into user-visible connection errors.
Best Practices
  • Walk the request end to end and double every component there is exactly one of — uplink, router, firewall, DNS zone, provider — verifying the two copies share no conduit, power, or control plane.
  • Run redundant paths active-active with ECMP or anycast so capacity is used and failover is continuous, and size to N+1 so the loss of any single member never overloads the survivors.
  • Spread replicas across at least three failure domains at roughly equal share, so losing one zone drops a third of capacity rather than half, and a quorum survives.
  • Use a second DNS provider and a second transit upstream for anything global, because a single one of either is a SPOF that no amount of regional redundancy behind it can fix.
  • Exercise failover on a schedule with game days and enable connection draining everywhere, so the standby is proven under real traffic and in-flight requests finish before an instance leaves rotation.
Comparable conceptsAnycast path redundancy (Topic 23)Multi-AZ / multi-region failure domainsN+1 / quorum sizing

Knowledge Check

You run two backend instances active-active, each steadily handling 60% of peak load. What is the consequence when one instance fails?

  • The survivor is asked for 120% of its capacity and overloads
  • Traffic stalls for the full DNS time-to-live window while every client has to re-resolve over to the survivor
  • The survivor absorbs all traffic with no degradation, since both were already active
  • Both instances enter split-brain and serve conflicting state to clients

Why does ECMP pin each flow to a single path rather than spraying a flow's packets across all available paths?

  • Per-packet spraying would reorder a flow's segments across unequal-latency paths
  • Because the candidate paths must all have strictly identical bandwidth or the route hash refuses to balance them at all
  • So each router can retransmit any packet it drops on a given path
  • Because anycast already balances within the flow, so ECMP must not touch it

Two application replicas run in separate cloud regions but both resolve through one DNS provider. What does that arrangement actually achieve?

  • Region redundancy undone by a shared DNS failure domain
  • Full independence, since the two separate regions share no underlying infrastructure with each other at all
  • Automatic ECMP balancing between the regions for every client request
  • Anycast-grade resilience, because shared DNS implies a shared anycast pool

You got correct