Topic 60

Multi-Cluster and Multi-Region

TopologyScale

One cluster is enough for most workloads, but not all. Multi-cluster and multi-region topologies exist for blast-radius isolation, geographic proximity, and regulatory boundaries — and they bring their own hard problems in networking, identity, and data.

The temptation is to reach for many clusters too early. This topic is about when the complexity is justified, the shapes it takes, and which parts are genuinely difficult.

Why More Than One Cluster

The legitimate reasons are specific. Blast radius: a cluster is a failure domain, so separating prod from non-prod, or critical from experimental, limits how far a problem spreads. Geography: serving users from a nearby region cuts latency, and a region is also a disaster boundary. Regulation: data-residency rules may require workloads and data in a specific country. Scale: a single cluster has practical size limits. Absent one of these, a single well-run cluster with namespaces is simpler and usually better.

Topologies

Hub-and-spoke fleeta management cluster governs many workload clusters

Common shapes: per-region clusters serving local users (with global traffic routing in front); per-environment or per-team clusters for isolation; and hub-and-spoke, where a management cluster runs shared tooling (GitOps, observability, policy) that governs many workload clusters. Most real fleets are some combination. The key design choice is what is global (traffic entry, identity, config source) versus what is per-cluster (workloads, data).

The Hard Parts

Multi-cluster is hard in three places. Networking: cross-cluster service discovery and connectivity needs a multi-cluster mesh or fleet networking, not just DNS. Identity and config: keeping RBAC, policy, and workloads consistent across clusters demands GitOps as the single source of truth (Topic 43) — drift across a fleet is far worse than in one cluster. Data: this is the genuinely hard one — replicating stateful data across regions involves real latency and consistency trade-offs that Kubernetes does not solve for you.

Concern	Approach
Traffic routing	Global LB / DNS in front of regional clusters
Cross-cluster networking	Multi-cluster mesh / fleet networking
Config & policy consistency	GitOps as single source of truth
Stateful data	Hardest — region-aware replication, explicit trade-offs

Data Gravity

The recurring constraint is data gravity: compute is easy to spread across clusters and regions, but data is heavy — replicating it costs latency, money, and consistency. Many multi-region designs fail not on the Kubernetes layer but on naive assumptions about synchronously replicating a database across continents. Decide the data strategy first (which region is authoritative, what consistency you need, how failover works); the cluster topology follows from it, not the other way around.

Single cluster vs many clusters

Single cluster — namespaces for separation; simplest to operate. The right default for most workloads.

Multiple clusters — blast-radius, geo, regulatory, or scale boundaries — at the cost of networking, consistency, and data complexity.

Common Mistakes

Going multi-cluster before a concrete reason (blast radius, geo, regulation, scale) justifies it.
Replicating stateful data naively across regions and discovering the latency/consistency cost late.
Letting cluster configs drift instead of governing the fleet with GitOps.
Assuming cross-cluster service discovery works like in-cluster DNS — it needs a mesh or fleet networking.
Designing the cluster topology before the data strategy.

Best Practices

Default to one well-run cluster with namespaces unless a specific driver requires more.
Decide the data strategy (authoritative region, consistency, failover) first; topology follows.
Govern the fleet with GitOps so config and policy stay consistent across clusters.
Use a multi-cluster mesh or fleet networking for cross-cluster connectivity, not bare DNS.
Treat each cluster as a failure domain and place workloads to bound blast radius.

RelatedGitOps — keeps a fleet's config consistent (Topic 43)Service mesh — multi-cluster connectivity (Topic 24)High availability and DR — the reliability side of topology (Topic 61)

Knowledge Check

Which is a legitimate reason to run multiple clusters?

Blast-radius isolation, geographic proximity, regulatory data residency, or scale limits
A general preference for owning more infrastructure and standing up more control planes
To avoid having to learn and use namespaces
Because a single cluster cannot run more than one application at a time

What is the genuinely hard part of multi-region architecture?

Replicating stateful data across regions, with its latency and consistency trade-offs
Running identical stateless Deployments in each region and scaling them with an HPA
Installing and configuring a CNI plugin in each regional cluster
Creating the right set of namespaces in every region

How should config and policy be kept consistent across a cluster fleet?

GitOps as the single source of truth, reconciled into each cluster
Manually running kubectl apply against each cluster in turn whenever config changes
Periodically copying the etcd datastore between clusters
A single shared NodePort Service in front of the fleet

You got correct