Reliability & Disaster Recovery
Service 64

Reliability & Disaster Recovery

ArchitectureReliabilityPractices

A reliable system stays up when individual components fail; a disaster-recovery plan brings it back when many fail at once. The two are related but distinct, and both turn the Well-Architected Reliability principles into concrete AWS choices.

Reliability work without targets is wishful thinking — define RTO and RPO first, then let the architecture follow.

Targets, Multi-AZ, and Multi-Region

RTO (how long can it be down) and RPO (how much data can you lose) drive everything. RTO of an hour fits backup-and-restore; RTO of minutes and RPO of seconds requires warm standby or active-active. Different workloads in the same account can have different targets — strictest-everywhere is expensive and unnecessary.

Multi-AZ is the floor for production: compute across at least two AZs, Multi-AZ databases, at least one reader in a different AZ. Multi-Region handles whole-Region loss (rare); warm standby is the practical DR default, active-active for genuine global scale. Aurora Global Database, DynamoDB Global Tables, S3 CRR, Route 53 health checks, and KMS multi-Region keys support it.

Multi-Region DR — From Backup-and-Restore to Active-Active
Backup & Restore
Snapshots copied cross-Region; restore after a disaster. RTO hours.
Pilot Light
Core always running in the second Region, scaled up on failover. RTO tens of minutes.
Warm Standby
Scaled-down full copy always on. RTO single-digit minutes. The practical default.
Active-Active
Full copies serve traffic everywhere. RTO immediate; hardest data model.
Cheaper · slower recoveryCostlier · instant recovery
Snapshots plus a manual restore.Both Regions live; needs conflict resolution.

Automatic Recovery and Application Resilience

Auto-scaling and health checks make recovery automatic at the AZ level: Auto Scaling Groups and ECS Service Auto Scaling replace unhealthy instances, ALB and Route 53 health checks remove failing targets, and stateless design (session state in ElastiCache/DynamoDB, files in S3) makes instances interchangeable.

The application has its part too: retries with exponential backoff, timeouts on every call, circuit breakers for flaky downstreams, idempotency (SQS/SNS are at-least-once), graceful degradation, and dead-letter queues.

Tested Backups and Practiced Failures

A backup you have never restored is wishful thinking. Use AWS Backup for the umbrella, set retention deliberately, copy backups cross-Region and cross-account, and schedule restore drills. Use S3 versioning and Object Lock for irreversible data.

Real reliability comes from practicing failure: game days, Region-failover drills (annually at minimum), restore drills, and AWS Fault Injection Service. The operational side — SLOs, actionable-only alerts, blameless retros, IaC — is what keeps the architecture working over time.

The four multi-Region DR strategies

Backup & restore / Pilot light — cheapest, RTO hours to tens of minutes — for workloads that tolerate longer recovery.

Warm standby — a scaled-down full version always running, single-digit-minute RTO — the practical DR default.

Active-active — full versions everywhere, immediate recovery — for genuine global scale, hardest data model.

Common Mistakes
  • Doing reliability work without defined RTO and RPO targets, so effort is unfocused and possibly mis-sized.
  • Running single-AZ production — the same failure profile as a single data center, for small savings.
  • Running an Aurora/DocumentDB/Neptune cluster with no reader in a second AZ, slowing failover.
  • Treating untested backups as a recovery plan — a backup never restored is not a backup.
  • Keeping backups only in the same Region and account as the data, defeating DR and ransomware protection.
  • Building multi-Region DR and never drilling the failover, so it fails for real when it is needed.
Best Practices
  • Define RTO and RPO per workload before designing for them.
  • Make Multi-AZ the floor for production compute, databases, and caches.
  • Use warm standby as the multi-Region DR default; active-active only for genuine global scale.
  • Make recovery automatic with auto-scaling, health checks, and stateless design.
  • Add application resilience: retries with backoff, timeouts, circuit breakers, idempotency, dead-letter queues.
  • Back up cross-Region and cross-account, test restores, and drill failures (game days, Region failover) on a schedule.
Comparable services GCP Cloud DR planning guide, regional/multi-regional resourcesAzure Azure Site Recovery, availability zones/regions

Knowledge Check

What two targets should drive reliability and DR design?

  • RTO (how long down) and RPO (how much data loss) — defined per workload
  • CPU and memory utilization, measured continuously across the whole fleet of instances
  • Cost and headcount
  • Latency and throughput only

What is the reliability floor for a production workload on AWS?

  • Multi-AZ — compute, databases, and caches spread across at least two Availability Zones
  • Multi-Region active-active, with full live capacity running in two or more Regions at once
  • A single large instance with frequent snapshots
  • Single-AZ with a backup

Why are application-level patterns like idempotency and dead-letter queues part of reliability?

  • SQS/SNS deliver at-least-once and components fail; idempotency, retries, and DLQs make the system survive partial failures
  • They exist mainly to reduce the AWS bill by cutting the number of SQS requests and Lambda invocations a workload is charged for
  • They replace the need for Multi-AZ, so an application that uses idempotency and dead-letter queues can safely run in a single Availability Zone
  • They are only relevant to security, acting as controls that block unauthorized access to the messages flowing through SQS and SNS

What makes a backup actually a backup?

  • Testing the restore — plus copying it cross-Region and cross-account
  • Storing it in the same account and Region as the source data for the fastest possible access
  • Taking it once and never verifying it
  • Keeping it only in CloudWatch Logs

You got correct