High Availability & DR
High availability keeps a workload running through component and datacenter failures; disaster recovery brings it back after a larger loss, such as a whole region. They are different problems with different mechanisms, and conflating them — assuming a highly available design is also disaster-recoverable, or vice versa — is how teams discover a gap during the incident that exposes it.
Every design here starts from two numbers: the recovery point objective (how much data you can afford to lose) and the recovery time objective (how long recovery may take). Set RPO and RTO first; they decide whether you need zone redundancy, region pairing, continuous replication, or just backups — and how much you will spend.
Availability Zones
Availability zones are physically separate datacenters within a region. Spreading a workload's instances and data across zones is the baseline for high availability — it survives a datacenter failure with low-latency synchronous replication and an automatic failover, at modest added cost. Most production workloads should be zone-redundant before any cross-region design is considered.
Paired Regions and DR
For disaster recovery, a second region provides somewhere to recover to when the primary is lost. Azure region pairs offer sequential platform updates and, for some services, built-in replication. The DR mechanism — backup restore, asynchronous replication, or continuous replication with failover — follows from the RTO and RPO you set, not the other way around.
RPO and RTO
RPO is the maximum acceptable data loss, measured as a time window; RTO is the maximum acceptable downtime. A near-zero RPO demands continuous or synchronous replication and costs accordingly; an RPO of hours may be met by periodic backups. Stating these numbers per workload turns DR from a vague aspiration into a design with a known cost.
| Mechanism | Typical RPO | Typical RTO |
|---|---|---|
| Backups (Azure Backup) | Hours | Hours |
| Async replication | Minutes | Minutes–hours |
| Site Recovery (continuous async) | Seconds–minutes | Minutes |
Backup vs Replication vs Site Recovery
These three are not interchangeable. Azure Backup retains recovery points for restore and ransomware resilience. Replication (built into many data services) keeps a near-current copy for low data loss. Azure Site Recovery orchestrates failover of whole VMs to another region for the lowest RTO. A complete strategy usually uses backups for retention and replication or Site Recovery for fast recovery — they solve different parts of the problem.
High Availability — Survives component and datacenter failures within a region — zone redundancy, automatic failover. Keeps the workload running.
Disaster Recovery — Recovers after a larger loss such as a region outage — backups, replication, or Site Recovery to a second region. Brings the workload back.
- Conflating HA and DR — assuming a zone-redundant design also survives a region loss, or that backups alone provide high availability.
- Designing the mechanism before setting RPO and RTO, so the solution does not match the actual tolerance for loss and downtime.
- Running production single-zone when zone redundancy was the cheap baseline for surviving a datacenter failure.
- Treating backups as a DR plan for low-RTO workloads, when restore takes far longer than the RTO allows.
- Building a multi-region DR design and never testing a failover, so it fails when it is finally needed.
- Paying for near-zero RPO continuous replication on a workload whose real tolerance was hours.
- Set RPO and RTO per workload first, then choose the mechanism that meets them.
- Make production workloads zone-redundant as the baseline before considering cross-region DR.
- Use a second region for DR sized to the RTO/RPO — backups for relaxed targets, replication or Site Recovery for tight ones.
- Combine Azure Backup (retention) with replication or Site Recovery (fast recovery) — they solve different problems.
- Test failovers regularly; an untested DR plan is an assumption, not a capability.
- Match the spend to the requirement — do not buy near-zero RPO for a workload that tolerates hours.
Knowledge Check
What two numbers should drive a high-availability and DR design?
- RPO (acceptable data loss) and RTO (acceptable downtime)
- The total vCPU count and the amount of memory provisioned per VM
- The number of regions and the number of availability zones used
- The peak ingress and egress network bandwidth the app sustains
How do high availability and disaster recovery differ?
- HA keeps a workload running through datacenter failures within a region; DR recovers it after a larger loss such as a region outage
- They are the same resilience concept under two different marketing names, so a zone-redundant design already covers both of them equally
- HA protects the data tier while DR protects the compute tier
- HA requires a paired second region, whereas DR runs in one region
Why are backups alone insufficient for a low-RTO workload?
- Restoring from backup takes far longer than a tight RTO allows; replication or Site Recovery is needed for fast recovery
- Recovery points cannot be replicated to a paired second region for safekeeping, so the only copy sits in the primary region
- Backups offer no protection against a ransomware encryption event
- Azure Backup can only capture database workloads, nothing else
You got correct