Topic 32

Backups and Disaster Recovery

Concept

Redundancy keeps a system running through ordinary failures — a machine dies, another takes over. But redundancy alone cannot save you from a different category of problem: a bad software deployment that corrupts data on every replica simultaneously, a staff member who accidentally deletes a critical database, ransomware that encrypts everything it can reach, or a rare but real event that takes an entire cloud region offline.

For those situations, you need two separate things: backups — periodic copies of your data kept somewhere safe — and a disaster recovery plan — a tested procedure for getting your whole system running again after a major event.

Think of them like insurance and a fire drill. You hope never to use either, but the day you do, having them is everything. A business that has never run a recovery drill before a real disaster will find that recovery takes far longer than expected.

What Backups Actually Do

A backup is a copy of your data taken at a point in time and stored separately from the live system. "Separately" matters: a backup sitting in the same storage bucket as the data it protects will be lost in the same disaster.

For important systems, good backup practice often means storing copies separately from the live system — sometimes in another region, sometimes in another account or project, or in an immutable (object-lock) location — keeping multiple versions so you can roll back to before the problem happened, and testing restores regularly — a backup you have never tested may not actually work when you need it.

Cloud providers offer automated backup services for most of their managed databases and storage products, but these usually must be turned on and configured. The assumption that "the cloud backs up my data automatically" is one of the most dangerous beliefs in cloud computing. Read the documentation for every service you depend on.

Why Redundancy Is Not a Backup

This is the most important distinction in this topic. Redundancy protects against hardware failure: if a machine dies, a live replica takes over smoothly. But the replica is a real-time mirror — it reflects every change immediately.

If a developer accidentally runs a command that deletes all the records in a table, that deletion is replicated to every copy in milliseconds. The redundant system is perfectly healthy; it is just running on perfectly empty data. Redundancy cannot help you here. Only a backup — a copy taken before the deletion — can restore what was lost.

Disaster Recovery: More Than Just Backups

Disaster recovery (DR) is the complete plan for getting a system running again after a major outage. Having backups is part of a DR plan, but it is only one part. A full DR plan answers questions like: where will you run the system if your primary region is down? How long will it take to restore from backup? Who is responsible for each step? Has that procedure been tested recently?

A disaster recovery plan that exists only as a document but has never been rehearsed is much weaker than one the team has practiced. The cloud makes it practical to test recovery regularly — you can spin up a restored environment, verify it works, and tear it down, all without touching the production system.

Two Important Dials: RPO and RTO

Every disaster recovery plan is tuned by two measures, usually stated in plain language even though they have technical names.

The first is RPO — Recovery Point Objective. It answers: how much data can we afford to lose? If your RPO is one hour, then your backups must be taken at least every hour, so that the worst case is losing up to one hour of data. If your RPO is 24 hours (backups once a day), you might lose an entire day of transactions.

The second is RTO — Recovery Time Objective. It answers: how quickly must we be back up? If your RTO is two hours, then your DR plan and infrastructure must be capable of getting the system running within two hours of a disaster. A four-hour restore-from-backup process would fail that target.

Tighter RPO and RTO are more expensive: more frequent backups, more stand-by infrastructure. The right values depend on what the business can tolerate — and both are ultimately business decisions, not purely technical ones.

Backup and Recovery Flow

Live Datasystem in use

→

Regular Backupstored separately from live data

→

Disaster Strikesoutage or data loss

→

Restore & Resumesystem back online

Common Confusions

"Redundancy is the same as a backup." Redundancy protects against hardware failure by keeping a live copy. Backups protect against data loss — deletions, corruptions, ransomware — by keeping a historical copy taken before the problem happened. They solve different problems and you need both.
"The cloud backs up my data automatically." Many cloud services do not enable automatic backups by default, or back up only some data. You must read the documentation for each service and configure the backup policy yourself. Never assume.
"Disaster recovery is just having backups." Backups are one input. DR is the full plan: tested procedures, people who know their roles, infrastructure ready to receive a restore, and a target recovery time you have actually verified is achievable.

Why It Matters

"Do we have backups?" and "What is our DR plan?" are among the first questions asked after any serious incident — and they are questions every manager and team lead should be able to answer before an incident happens.
RPO and RTO are the business dials that translate risk tolerance into engineering requirements. Understanding them lets you participate in those decisions, not just receive the results.
The distinction between redundancy and backups is widely misunderstood, even by people who work in technology. Getting it right puts you ahead of a large share of the conversation.

Knowledge Check

A team has redundancy across three availability zones but no backups. What risk remains?

Traffic between zones travels unencrypted and could be intercepted
Running in multiple zones creates duplicate billing and wastes money
A deleted record or corrupted data replicates to every zone at once
Load balancers stop working correctly without their own backups

In plain terms, what does RPO (Recovery Point Objective) measure?

How much data you can afford to lose, measured in time
How quickly the system must be running again after an outage
How many backup copies are required to meet compliance rules
How quickly a replacement machine boots up after the original fails

What does a disaster recovery plan add beyond just having backups?

A weekly automated test confirming that each backup is stored correctly and can be restored
The full tested plan for getting the whole system running again after a major outage
Extra backup copies stored in a third-party facility outside the cloud
Automatic real-time replication of every change to a standby region

You got correct