Chapter Seven

Scaling, Availability, and Reliability

What happens when a system gets too busy, or when part of it breaks? This chapter explains how the cloud handles load and failure — from the basic choice between bigger and more, to the mechanisms that keep systems running even when things go wrong.

4 topics

Every system eventually runs into the same two questions: what do you do when there is too much work for it to handle, and what do you do when part of it stops working? The cloud does not magic these problems away — but it gives you tools for handling them that were previously out of reach for most organizations.

Four topics trace the answers. First, the core scaling decision: make one machine bigger, or add more machines? Then the cloud feature that automates that decision: elasticity and auto-scaling. Then the design approach that survives failures: high availability through redundancy. And finally the safety net for the worst cases: backups and disaster recovery.

The right tool for the situation

Need more power for a growing workload?→Scale up (bigger machine) or scale out (more machines)

Traffic rises and falls unpredictably?→Elasticity — auto-scale to match demand

Must keep running if one component fails?→Redundancy across availability zones

Need to survive a major outage or mistake?→Backups and a disaster recovery plan