Chapter Seven
Scaling, Availability, and Reliability
What happens when a system gets too busy, or when part of it breaks? This chapter explains how the cloud handles load and failure — from the basic choice between bigger and more, to the mechanisms that keep systems running even when things go wrong.
Every system eventually runs into the same two questions: what do you do when there is too much work for it to handle, and what do you do when part of it stops working? The cloud does not magic these problems away — but it gives you tools for handling them that were previously out of reach for most organizations.
Four topics trace the answers. First, the core scaling decision: make one machine bigger, or add more machines? Then the cloud feature that automates that decision: elasticity and auto-scaling. Then the design approach that survives failures: high availability through redundancy. And finally the safety net for the worst cases: backups and disaster recovery.