High Availability & DR
Service 62

High Availability & DR

Reliability

High availability keeps a workload running through component and datacenter failures; disaster recovery brings it back after a larger loss, such as a whole region. They are different problems with different mechanisms, and conflating them — assuming a highly available design is also disaster-recoverable, or vice versa — is how teams discover a gap during the incident that exposes it.

Every design here starts from two numbers: the recovery point objective (how much data you can afford to lose) and the recovery time objective (how long recovery may take). Set RPO and RTO first; they decide whether you need zone redundancy, region pairing, continuous replication, or just backups — and how much you will spend.

Availability Zones

Availability zones are physically separate datacenters within a region. Spreading a workload's instances and data across zones is the baseline for high availability — it survives a datacenter failure with low-latency synchronous replication and an automatic failover, at modest added cost. Most production workloads should be zone-redundant before any cross-region design is considered.

Paired Regions and DR

For disaster recovery, a second region provides somewhere to recover to when the primary is lost. Azure region pairs offer sequential platform updates and, for some services, built-in replication. The DR mechanism — backup restore, asynchronous replication, or continuous replication with failover — follows from the RTO and RPO you set, not the other way around.

RPO and RTO

RPO is the maximum acceptable data loss, measured as a time window; RTO is the maximum acceptable downtime. A near-zero RPO demands continuous or synchronous replication and costs accordingly; an RPO of hours may be met by periodic backups. Stating these numbers per workload turns DR from a vague aspiration into a design with a known cost.

MechanismTypical RPOTypical RTO
Backups (Azure Backup)HoursHours
Async replicationMinutesMinutes–hours
Site Recovery (continuous async)Seconds–minutesMinutes
DR Mechanisms — Cheaper and Slower vs Costlier and Faster
Backups
Restore from retained recovery points. Hours of RPO and RTO.
Async replication
A near-current copy in a second region. Minutes of RPO.
Site Recovery
Continuous async replication + orchestrated failover. Low RPO, lowest RTO.
Lower cost · higher RPO/RTOHigher cost · near-zero RPO/RTO
Acceptable when you can lose hours and recover slowly.Required when downtime and data loss must be minimal.

Backup vs Replication vs Site Recovery

These three are not interchangeable. Azure Backup retains recovery points for restore and ransomware resilience. Replication (built into many data services) keeps a near-current copy for low data loss. Azure Site Recovery orchestrates failover of whole VMs to another region for the lowest RTO. A complete strategy usually uses backups for retention and replication or Site Recovery for fast recovery — they solve different parts of the problem.

HA vs DR

High Availability — Survives component and datacenter failures within a region — zone redundancy, automatic failover. Keeps the workload running.

Disaster Recovery — Recovers after a larger loss such as a region outage — backups, replication, or Site Recovery to a second region. Brings the workload back.

Common Mistakes
  • Conflating HA and DR — assuming a zone-redundant design also survives a region loss, or that backups alone provide high availability.
  • Designing the mechanism before setting RPO and RTO, so the solution does not match the actual tolerance for loss and downtime.
  • Running production single-zone when zone redundancy was the cheap baseline for surviving a datacenter failure.
  • Treating backups as a DR plan for low-RTO workloads, when restore takes far longer than the RTO allows.
  • Building a multi-region DR design and never testing a failover, so it fails when it is finally needed.
  • Paying for near-zero RPO continuous replication on a workload whose real tolerance was hours.
Best Practices
  • Set RPO and RTO per workload first, then choose the mechanism that meets them.
  • Make production workloads zone-redundant as the baseline before considering cross-region DR.
  • Use a second region for DR sized to the RTO/RPO — backups for relaxed targets, replication or Site Recovery for tight ones.
  • Combine Azure Backup (retention) with replication or Site Recovery (fast recovery) — they solve different problems.
  • Test failovers regularly; an untested DR plan is an assumption, not a capability.
  • Match the spend to the requirement — do not buy near-zero RPO for a workload that tolerates hours.
Comparable servicesAWS Multi-AZ / Multi-Region, Elastic DRGCP Zonal/Regional resources, Backup and DR

Knowledge Check

What two numbers should drive a high-availability and DR design?

  • RPO (acceptable data loss) and RTO (acceptable downtime)
  • The total vCPU count and the amount of memory provisioned per VM
  • The number of regions and the number of availability zones used
  • The peak ingress and egress network bandwidth the app sustains

How do high availability and disaster recovery differ?

  • HA keeps a workload running through datacenter failures within a region; DR recovers it after a larger loss such as a region outage
  • They are the same resilience concept under two different marketing names, so a zone-redundant design already covers both of them equally
  • HA protects the data tier while DR protects the compute tier
  • HA requires a paired second region, whereas DR runs in one region

Why are backups alone insufficient for a low-RTO workload?

  • Restoring from backup takes far longer than a tight RTO allows; replication or Site Recovery is needed for fast recovery
  • Recovery points cannot be replicated to a paired second region for safekeeping, so the only copy sits in the primary region
  • Backups offer no protection against a ransomware encryption event
  • Azure Backup can only capture database workloads, nothing else

You got correct