Service 65

Multi-Region Architecture

Resilience

Multi-region architecture spreads a workload across two or more Azure regions to survive a region-wide outage or to serve users with low latency worldwide. It is the most powerful resilience and reach pattern — and the most expensive and complex, so it is justified by a real requirement (an availability target or a global user base), not by default.

The core decision is active-passive versus active-active, and it cascades into everything else: how data is replicated, how traffic is routed, and how much complexity the team takes on. Most workloads do not need multi-region; zone redundancy within one region meets their availability target far more cheaply. Reach for multi-region when the requirement genuinely demands it.

Active-Passive vs Active-Active

In active-passive, one region serves traffic and a second stands ready to take over on failure — simpler, with a failover step and some recovery time. In active-active, all regions serve traffic simultaneously — lowest latency and fastest recovery, but it forces hard problems: data consistency across regions, conflict resolution, and testing that both halves actually work. Active-passive is the common, pragmatic choice; active-active is for the workloads that truly need it.

Data Replication

Data is the hard part of multi-region. Stateless compute is easy to run in two places; data is not. Asynchronous replication is cheaper and gives a non-zero RPO; synchronous costs latency and is limited by distance; Cosmos DB's multi-region writes handle distribution but require a conflict-resolution policy. The data tier's replication model, and the RPO it implies, usually dictates the whole multi-region design.

Global Routing

Traffic is steered across regions by Front Door (Layer 7, edge-terminated, with health-based failover and caching) or Traffic Manager (DNS-level routing). Front Door fits HTTP applications needing edge performance; Traffic Manager fits non-HTTP endpoints or pure DNS direction. The routing layer is what makes a region failure or a latency-based decision invisible to the user.

Cost and Complexity

Multi-region roughly doubles infrastructure and adds the ongoing cost of replication, cross-region traffic, and operational complexity — every deployment, test, and incident now spans regions. That cost is worth paying for a genuine global-availability or latency requirement and wasteful otherwise. Confirm that zone redundancy cannot meet the target before committing to multi-region.

Active-passive vs active-active

Active-passive — One region serves, a second fails over on outage. Simpler and cheaper, with a failover step and recovery time. The common choice.

Active-active — All regions serve simultaneously — lowest latency and fastest recovery, but hard data-consistency and conflict problems. For workloads that truly need it.

Common Mistakes

Going multi-region when zone redundancy within one region would meet the availability target far more cheaply.
Choosing active-active without solving cross-region data consistency and conflict resolution, producing subtle data corruption.
Treating data replication as an afterthought when it is the hard part that dictates the whole design.
Building the multi-region topology and never testing a regional failover, so it fails when needed.
Underestimating the doubled infrastructure, replication, and operational cost of running in two regions.
Routing with the wrong layer — Traffic Manager where edge caching and L7 failover from Front Door were needed, or vice versa.

Best Practices

Confirm zone redundancy cannot meet the requirement before committing to multi-region.
Default to active-passive unless the workload genuinely needs active-active's latency and recovery.
Design the data-replication model and its RPO first — it dictates the rest of the architecture.
Route with Front Door for HTTP edge performance and health-based failover, or Traffic Manager for DNS-level direction.
Test regional failover regularly so the design is a proven capability, not an assumption.
Account for the doubled infrastructure, replication, and operational cost before committing.

Comparable servicesAWS Multi-Region architectureGCP Multi-region / global resources

Knowledge Check

When is multi-region architecture the wrong choice?

When zone redundancy within one region would meet the availability target far more cheaply
When the workload serves a global user base spread across several continents and time zones
When an availability target requires surviving a full region outage
When network latency to distant users genuinely matters to them

What is the hardest part of a multi-region design?

Data — replication, consistency, and conflict resolution across regions; stateless compute is the easy part
Choosing the right VM sizes and instance families for each region
Configuring the DNS records and health probes for Front Door or Traffic Manager routing across all the regions
Tagging every resource consistently for cost allocation per region

What is the main trade-off of active-active versus active-passive?

Active-active gives the lowest latency and fastest recovery but forces hard cross-region data-consistency and conflict problems
Active-active is the simpler and cheaper option to build and operate, with no added data-consistency or conflict-resolution cost at all
Active-passive serves live traffic from every region at the same time
Active-passive cannot survive a region outage even with a standby ready

You got correct