Service 64

Reliability & Disaster Recovery

ArchitectureReliabilityPractices

A reliable system stays up when individual components fail; a disaster-recovery plan brings it back when many fail at once. The two are related but distinct, and both turn the Well-Architected Reliability principles into concrete AWS choices.

Reliability work without targets is wishful thinking — define RTO and RPO first, then let the architecture follow.

Targets, Multi-AZ, and Multi-Region

RTO (how long can it be down) and RPO (how much data can you lose) drive everything. RTO of an hour fits backup-and-restore; RTO of minutes and RPO of seconds requires warm standby or active-active. Different workloads in the same account can have different targets — strictest-everywhere is expensive and unnecessary.

Multi-AZ is the floor for production: compute across at least two AZs, Multi-AZ databases, at least one reader in a different AZ. Multi-Region handles whole-Region loss (rare); warm standby is the practical DR default, active-active for genuine global scale. Aurora Global Database, DynamoDB Global Tables, S3 CRR, Route 53 health checks, and KMS multi-Region keys support it.

Multi-Region DR — From Backup-and-Restore to Active-Active

Backup & Restore

Snapshots copied cross-Region; restore after a disaster. RTO hours.

Pilot Light

Core always running in the second Region, scaled up on failover. RTO tens of minutes.

Warm Standby

Scaled-down full copy always on. RTO single-digit minutes. The practical default.

Active-Active

Full copies serve traffic everywhere. RTO immediate; hardest data model.

Cheaper · slower recoveryCostlier · instant recovery

Snapshots plus a manual restore.Both Regions live; needs conflict resolution.

Automatic Recovery and Application Resilience

Auto-scaling and health checks make recovery automatic at the AZ level: Auto Scaling Groups and ECS Service Auto Scaling replace unhealthy instances, ALB and Route 53 health checks remove failing targets, and stateless design (session state in ElastiCache/DynamoDB, files in S3) makes instances interchangeable.

The application has its part too: retries with exponential backoff, timeouts on every call, circuit breakers for flaky downstreams, idempotency (SQS/SNS are at-least-once), graceful degradation, and dead-letter queues.

Tested Backups and Practiced Failures

A backup you have never restored is wishful thinking. Use AWS Backup for the umbrella, set retention deliberately, copy backups cross-Region and cross-account, and schedule restore drills. Use S3 versioning and Object Lock for irreversible data.

Real reliability comes from practicing failure: game days, Region-failover drills (annually at minimum), restore drills, and AWS Fault Injection Service. The operational side — SLOs, actionable-only alerts, blameless retros, IaC — is what keeps the architecture working over time.

The four multi-Region DR strategies

Backup & restore / Pilot light — cheapest, RTO hours to tens of minutes — for workloads that tolerate longer recovery.

Warm standby — a scaled-down full version always running, single-digit-minute RTO — the practical DR default.

Active-active — full versions everywhere, immediate recovery — for genuine global scale, hardest data model.

Common Mistakes

Doing reliability work without defined RTO and RPO targets, so effort is unfocused and possibly mis-sized.
Running single-AZ production — the same failure profile as a single data center, for small savings.
Running an Aurora/DocumentDB/Neptune cluster with no reader in a second AZ, slowing failover.
Treating untested backups as a recovery plan — a backup never restored is not a backup.
Keeping backups only in the same Region and account as the data, defeating DR and ransomware protection.
Building multi-Region DR and never drilling the failover, so it fails for real when it is needed.

Best Practices

Define RTO and RPO per workload before designing for them.
Make Multi-AZ the floor for production compute, databases, and caches.
Use warm standby as the multi-Region DR default; active-active only for genuine global scale.
Make recovery automatic with auto-scaling, health checks, and stateless design.
Add application resilience: retries with backoff, timeouts, circuit breakers, idempotency, dead-letter queues.
Back up cross-Region and cross-account, test restores, and drill failures (game days, Region failover) on a schedule.

Comparable services GCP Cloud DR planning guide, regional/multi-regional resourcesAzure Azure Site Recovery, availability zones/regions

Knowledge Check

What two targets should drive reliability and DR design?

RTO (how long down) and RPO (how much data loss) — defined per workload
CPU and memory utilization, measured continuously across the whole fleet of instances
Cost and headcount
Latency and throughput only

What is the reliability floor for a production workload on AWS?

Multi-AZ — compute, databases, and caches spread across at least two Availability Zones
Multi-Region active-active, with full live capacity running in two or more Regions at once
A single large instance with frequent snapshots
Single-AZ with a backup

Why are application-level patterns like idempotency and dead-letter queues part of reliability?

SQS/SNS deliver at-least-once and components fail; idempotency, retries, and DLQs make the system survive partial failures
They exist mainly to reduce the AWS bill by cutting the number of SQS requests and Lambda invocations a workload is charged for
They replace the need for Multi-AZ, so an application that uses idempotency and dead-letter queues can safely run in a single Availability Zone
They are only relevant to security, acting as controls that block unauthorized access to the messages flowing through SQS and SNS

What makes a backup actually a backup?

Testing the restore — plus copying it cross-Region and cross-account
Storing it in the same account and Region as the source data for the fastest possible access
Taking it once and never verifying it
Keeping it only in CloudWatch Logs

You got correct