High Availability and DR
Topic 61

High Availability and DR

ReliabilityRecovery

High availability is staying up through failures; disaster recovery is getting back up after a large one. They are different problems with different mechanisms, and conflating them is a common and costly mistake. Both come down to handling failure at each level — Pod, node, zone, region — deliberately.

The recurring lesson: availability is designed in at every failure domain, and a recovery plan you have never tested is not a plan.

Failure Domains

Handle each failure domain in turn
Podmultiple replicas
Nodespread across nodes
Zonespread across zones
Regionmulti-region + DR

Failures happen at nested levels, and you handle each: a Pod fails (run multiple replicas with a controller); a node fails (spread replicas across nodes with anti-affinity/topology spread); a zone fails (spread across zones, the bar for real HA); a region fails (multi-region, the realm of DR). Each level up costs more and is needed by fewer workloads. The mistake is stopping too low — three replicas on one node survive a Pod crash but not a node failure.

Failure domainHow to survive it
PodMultiple replicas via a controller
NodeAnti-affinity / topology spread across nodes
ZoneSpread replicas across availability zones
RegionMulti-region deployment + DR plan

Designing for HA

In-cluster HA combines the tools from earlier chapters: enough replicas, topology spread across zones (Topic 26), PodDisruptionBudgets so operations don't breach availability (Topic 30), readiness probes so traffic only hits healthy Pods (Topic 29), and a highly available control plane with quorum etcd (Topic 03). None of these alone is HA; together they let a workload survive routine failures and operations without an outage.

Backup and Restore

DR rests on backups you can actually restore. Two layers: etcd snapshots for cluster state (Topic 50) and Velero (or equivalent) for Kubernetes objects and PersistentVolume data, which can also migrate workloads between clusters. The non-negotiable discipline is rehearsing the restore — a backup whose restore has never been tested routinely fails when it matters, on a permission, a version mismatch, or a missing dependency discovered under pressure.

RTO, RPO, and Testing

DR is quantified by two numbers. RTO (recovery time objective) is how long recovery may take; RPO (recovery point objective) is how much data loss is acceptable, set by backup frequency and replication. These drive the design: a tight RPO needs near-continuous replication; a tight RTO needs warm standby, not cold restore. And the whole thing is only real if exercised — game days that deliberately fail components (and occasionally a zone) are how you find the gaps before an incident does. Define RTO/RPO explicitly, then build and test to meet them.

High availability vs disaster recovery

High availability (HA) — stay up through Pod/node/zone failures and routine ops — replicas, spread, PDBs, HA control plane.

Disaster recovery (DR) — recover after a major loss (region, data) — backups, restore, failover, measured by RTO/RPO.

Common Mistakes
  • Conflating HA and DR — they are different problems needing different mechanisms.
  • Spreading replicas across nodes but not zones, so a zone failure still takes the service down.
  • Backups that have never been restored, which fail when actually needed.
  • Leaving RTO and RPO undefined, so the recovery design has no target to meet.
  • Assuming a multi-region cluster topology is DR without a tested failover and data plan.
Best Practices
  • Handle each failure domain deliberately — replicas, topology spread across zones, HA control plane.
  • Combine replicas + topology spread + PDBs + readiness probes for real in-cluster HA.
  • Back up both etcd (cluster state) and app/PV data (Velero), and rehearse the restore regularly.
  • Define RTO and RPO explicitly and design backup/replication to meet them.
  • Run game days that fail components and zones to find gaps before an incident does.
Relatedetcd and backup — cluster-state recovery (Topic 50)PDBs / topology spread — the HA building blocks (Topics 30, 26)Multi-cluster and multi-region — region-level resilience (Topic 60)

Knowledge Check

What is the difference between HA and DR?

  • HA stays up through failures (replicas, spread, PDBs); DR recovers after a major loss, measured by RTO/RPO
  • They are simply two interchangeable names for the very same goal, just achieved with slightly different tooling and vendor terminology
  • HA exclusively protects your stored data while DR exclusively protects your compute and running workloads
  • HA depends entirely on backups and restore, while DR depends entirely on replicas and topology spread

Three replicas all land on one node. What failure does this NOT survive?

  • A node failure — they survive a Pod crash but not the node going down
  • A single one of the three Pods crashing and then being recreated by the controller on that same node
  • A container restart triggered by a failing liveness probe on one of the three Pods on that node
  • A readiness probe failure pulling one of the three Pods out of the Service endpoints for a while

What do RTO and RPO define?

  • RTO = acceptable recovery time; RPO = acceptable data loss — together they drive the DR design
  • The number of replicas and nodes you should provision so the workload has enough capacity headroom
  • Two Kubernetes object kinds you declare in a manifest
  • Per-container CPU and memory resource limits

You got correct