Reliability & Disaster Recovery
A reliable system stays up when individual components fail; a disaster-recovery plan brings it back when many fail at once. The two are related but distinct, and both turn the Well-Architected Reliability principles into concrete AWS choices.
Reliability work without targets is wishful thinking — define RTO and RPO first, then let the architecture follow.
Targets, Multi-AZ, and Multi-Region
RTO (how long can it be down) and RPO (how much data can you lose) drive everything. RTO of an hour fits backup-and-restore; RTO of minutes and RPO of seconds requires warm standby or active-active. Different workloads in the same account can have different targets — strictest-everywhere is expensive and unnecessary.
Multi-AZ is the floor for production: compute across at least two AZs, Multi-AZ databases, at least one reader in a different AZ. Multi-Region handles whole-Region loss (rare); warm standby is the practical DR default, active-active for genuine global scale. Aurora Global Database, DynamoDB Global Tables, S3 CRR, Route 53 health checks, and KMS multi-Region keys support it.
Automatic Recovery and Application Resilience
Auto-scaling and health checks make recovery automatic at the AZ level: Auto Scaling Groups and ECS Service Auto Scaling replace unhealthy instances, ALB and Route 53 health checks remove failing targets, and stateless design (session state in ElastiCache/DynamoDB, files in S3) makes instances interchangeable.
The application has its part too: retries with exponential backoff, timeouts on every call, circuit breakers for flaky downstreams, idempotency (SQS/SNS are at-least-once), graceful degradation, and dead-letter queues.
Tested Backups and Practiced Failures
A backup you have never restored is wishful thinking. Use AWS Backup for the umbrella, set retention deliberately, copy backups cross-Region and cross-account, and schedule restore drills. Use S3 versioning and Object Lock for irreversible data.
Real reliability comes from practicing failure: game days, Region-failover drills (annually at minimum), restore drills, and AWS Fault Injection Service. The operational side — SLOs, actionable-only alerts, blameless retros, IaC — is what keeps the architecture working over time.
Backup & restore / Pilot light — cheapest, RTO hours to tens of minutes — for workloads that tolerate longer recovery.
Warm standby — a scaled-down full version always running, single-digit-minute RTO — the practical DR default.
Active-active — full versions everywhere, immediate recovery — for genuine global scale, hardest data model.
- Doing reliability work without defined RTO and RPO targets, so effort is unfocused and possibly mis-sized.
- Running single-AZ production — the same failure profile as a single data center, for small savings.
- Running an Aurora/DocumentDB/Neptune cluster with no reader in a second AZ, slowing failover.
- Treating untested backups as a recovery plan — a backup never restored is not a backup.
- Keeping backups only in the same Region and account as the data, defeating DR and ransomware protection.
- Building multi-Region DR and never drilling the failover, so it fails for real when it is needed.
- Define RTO and RPO per workload before designing for them.
- Make Multi-AZ the floor for production compute, databases, and caches.
- Use warm standby as the multi-Region DR default; active-active only for genuine global scale.
- Make recovery automatic with auto-scaling, health checks, and stateless design.
- Add application resilience: retries with backoff, timeouts, circuit breakers, idempotency, dead-letter queues.
- Back up cross-Region and cross-account, test restores, and drill failures (game days, Region failover) on a schedule.
Knowledge Check
What two targets should drive reliability and DR design?
- RTO (how long down) and RPO (how much data loss) — defined per workload
- CPU and memory utilization, measured continuously across the whole fleet of instances
- Cost and headcount
- Latency and throughput only
What is the reliability floor for a production workload on AWS?
- Multi-AZ — compute, databases, and caches spread across at least two Availability Zones
- Multi-Region active-active, with full live capacity running in two or more Regions at once
- A single large instance with frequent snapshots
- Single-AZ with a backup
Why are application-level patterns like idempotency and dead-letter queues part of reliability?
- SQS/SNS deliver at-least-once and components fail; idempotency, retries, and DLQs make the system survive partial failures
- They exist mainly to reduce the AWS bill by cutting the number of SQS requests and Lambda invocations a workload is charged for
- They replace the need for Multi-AZ, so an application that uses idempotency and dead-letter queues can safely run in a single Availability Zone
- They are only relevant to security, acting as controls that block unauthorized access to the messages flowing through SQS and SNS
What makes a backup actually a backup?
- Testing the restore — plus copying it cross-Region and cross-account
- Storing it in the same account and Region as the source data for the fastest possible access
- Taking it once and never verifying it
- Keeping it only in CloudWatch Logs
You got correct