Topic 61

High Availability and DR

ReliabilityRecovery

High availability is staying up through failures; disaster recovery is getting back up after a large one. They are different problems with different mechanisms, and conflating them is a common and costly mistake. Both come down to handling failure at each level — Pod, node, zone, region — deliberately.

The recurring lesson: availability is designed in at every failure domain, and a recovery plan you have never tested is not a plan.

Failure Domains

Handle each failure domain in turn

Podmultiple replicas

→

Nodespread across nodes

→

Zonespread across zones

→

Regionmulti-region + DR

Failures happen at nested levels, and you handle each: a Pod fails (run multiple replicas with a controller); a node fails (spread replicas across nodes with anti-affinity/topology spread); a zone fails (spread across zones, the bar for real HA); a region fails (multi-region, the realm of DR). Each level up costs more and is needed by fewer workloads. The mistake is stopping too low — three replicas on one node survive a Pod crash but not a node failure.

Failure domain	How to survive it
Pod	Multiple replicas via a controller
Node	Anti-affinity / topology spread across nodes
Zone	Spread replicas across availability zones
Region	Multi-region deployment + DR plan

Designing for HA

In-cluster HA combines the tools from earlier chapters: enough replicas, topology spread across zones (Topic 26), PodDisruptionBudgets so operations don't breach availability (Topic 30), readiness probes so traffic only hits healthy Pods (Topic 29), and a highly available control plane with quorum etcd (Topic 03). None of these alone is HA; together they let a workload survive routine failures and operations without an outage.

Backup and Restore

DR rests on backups you can actually restore. Two layers: etcd snapshots for cluster state (Topic 50) and Velero (or equivalent) for Kubernetes objects and PersistentVolume data, which can also migrate workloads between clusters. The non-negotiable discipline is rehearsing the restore — a backup whose restore has never been tested routinely fails when it matters, on a permission, a version mismatch, or a missing dependency discovered under pressure.

RTO, RPO, and Testing

DR is quantified by two numbers. RTO (recovery time objective) is how long recovery may take; RPO (recovery point objective) is how much data loss is acceptable, set by backup frequency and replication. These drive the design: a tight RPO needs near-continuous replication; a tight RTO needs warm standby, not cold restore. And the whole thing is only real if exercised — game days that deliberately fail components (and occasionally a zone) are how you find the gaps before an incident does. Define RTO/RPO explicitly, then build and test to meet them.

High availability vs disaster recovery

High availability (HA) — stay up through Pod/node/zone failures and routine ops — replicas, spread, PDBs, HA control plane.

Disaster recovery (DR) — recover after a major loss (region, data) — backups, restore, failover, measured by RTO/RPO.

Common Mistakes

Conflating HA and DR — they are different problems needing different mechanisms.
Spreading replicas across nodes but not zones, so a zone failure still takes the service down.
Backups that have never been restored, which fail when actually needed.
Leaving RTO and RPO undefined, so the recovery design has no target to meet.
Assuming a multi-region cluster topology is DR without a tested failover and data plan.

Best Practices

Handle each failure domain deliberately — replicas, topology spread across zones, HA control plane.
Combine replicas + topology spread + PDBs + readiness probes for real in-cluster HA.
Back up both etcd (cluster state) and app/PV data (Velero), and rehearse the restore regularly.
Define RTO and RPO explicitly and design backup/replication to meet them.
Run game days that fail components and zones to find gaps before an incident does.

Relatedetcd and backup — cluster-state recovery (Topic 50)PDBs / topology spread — the HA building blocks (Topics 30, 26)Multi-cluster and multi-region — region-level resilience (Topic 60)

Knowledge Check

What is the difference between HA and DR?

HA stays up through failures (replicas, spread, PDBs); DR recovers after a major loss, measured by RTO/RPO
They are simply two interchangeable names for the very same goal, just achieved with slightly different tooling and vendor terminology
HA exclusively protects your stored data while DR exclusively protects your compute and running workloads
HA depends entirely on backups and restore, while DR depends entirely on replicas and topology spread

Three replicas all land on one node. What failure does this NOT survive?

A node failure — they survive a Pod crash but not the node going down
A single one of the three Pods crashing and then being recreated by the controller on that same node
A container restart triggered by a failing liveness probe on one of the three Pods on that node
A readiness probe failure pulling one of the three Pods out of the Service endpoints for a while

What do RTO and RPO define?

RTO = acceptable recovery time; RPO = acceptable data loss — together they drive the DR design
The number of replicas and nodes you should provision so the workload has enough capacity headroom
Two Kubernetes object kinds you declare in a manifest
Per-container CPU and memory resource limits

You got correct