High Availability and DR
High availability is staying up through failures; disaster recovery is getting back up after a large one. They are different problems with different mechanisms, and conflating them is a common and costly mistake. Both come down to handling failure at each level — Pod, node, zone, region — deliberately.
The recurring lesson: availability is designed in at every failure domain, and a recovery plan you have never tested is not a plan.
Failure Domains
Failures happen at nested levels, and you handle each: a Pod fails (run multiple replicas with a controller); a node fails (spread replicas across nodes with anti-affinity/topology spread); a zone fails (spread across zones, the bar for real HA); a region fails (multi-region, the realm of DR). Each level up costs more and is needed by fewer workloads. The mistake is stopping too low — three replicas on one node survive a Pod crash but not a node failure.
| Failure domain | How to survive it |
|---|---|
| Pod | Multiple replicas via a controller |
| Node | Anti-affinity / topology spread across nodes |
| Zone | Spread replicas across availability zones |
| Region | Multi-region deployment + DR plan |
Designing for HA
In-cluster HA combines the tools from earlier chapters: enough replicas, topology spread across zones (Topic 26), PodDisruptionBudgets so operations don't breach availability (Topic 30), readiness probes so traffic only hits healthy Pods (Topic 29), and a highly available control plane with quorum etcd (Topic 03). None of these alone is HA; together they let a workload survive routine failures and operations without an outage.
Backup and Restore
DR rests on backups you can actually restore. Two layers: etcd snapshots for cluster state (Topic 50) and Velero (or equivalent) for Kubernetes objects and PersistentVolume data, which can also migrate workloads between clusters. The non-negotiable discipline is rehearsing the restore — a backup whose restore has never been tested routinely fails when it matters, on a permission, a version mismatch, or a missing dependency discovered under pressure.
RTO, RPO, and Testing
DR is quantified by two numbers. RTO (recovery time objective) is how long recovery may take; RPO (recovery point objective) is how much data loss is acceptable, set by backup frequency and replication. These drive the design: a tight RPO needs near-continuous replication; a tight RTO needs warm standby, not cold restore. And the whole thing is only real if exercised — game days that deliberately fail components (and occasionally a zone) are how you find the gaps before an incident does. Define RTO/RPO explicitly, then build and test to meet them.
High availability (HA) — stay up through Pod/node/zone failures and routine ops — replicas, spread, PDBs, HA control plane.
Disaster recovery (DR) — recover after a major loss (region, data) — backups, restore, failover, measured by RTO/RPO.
- Conflating HA and DR — they are different problems needing different mechanisms.
- Spreading replicas across nodes but not zones, so a zone failure still takes the service down.
- Backups that have never been restored, which fail when actually needed.
- Leaving RTO and RPO undefined, so the recovery design has no target to meet.
- Assuming a multi-region cluster topology is DR without a tested failover and data plan.
- Handle each failure domain deliberately — replicas, topology spread across zones, HA control plane.
- Combine replicas + topology spread + PDBs + readiness probes for real in-cluster HA.
- Back up both etcd (cluster state) and app/PV data (Velero), and rehearse the restore regularly.
- Define RTO and RPO explicitly and design backup/replication to meet them.
- Run game days that fail components and zones to find gaps before an incident does.
Knowledge Check
What is the difference between HA and DR?
- HA stays up through failures (replicas, spread, PDBs); DR recovers after a major loss, measured by RTO/RPO
- They are simply two interchangeable names for the very same goal, just achieved with slightly different tooling and vendor terminology
- HA exclusively protects your stored data while DR exclusively protects your compute and running workloads
- HA depends entirely on backups and restore, while DR depends entirely on replicas and topology spread
Three replicas all land on one node. What failure does this NOT survive?
- A node failure — they survive a Pod crash but not the node going down
- A single one of the three Pods crashing and then being recreated by the controller on that same node
- A container restart triggered by a failing liveness probe on one of the three Pods on that node
- A readiness probe failure pulling one of the three Pods out of the Service endpoints for a while
What do RTO and RPO define?
- RTO = acceptable recovery time; RPO = acceptable data loss — together they drive the DR design
- The number of replicas and nodes you should provision so the workload has enough capacity headroom
- Two Kubernetes object kinds you declare in a manifest
- Per-container CPU and memory resource limits
You got correct