Skip to content

Module 4.2: Multi-Cluster and Multi-Region Architectures

Complexity: [COMPLEX]

Time to Complete: 3 hours

Prerequisites: Module 4.1: Managed vs Self-Managed Kubernetes

Track: Cloud Architecture Patterns

After completing this module, you will be able to:

  • Design multi-cluster architectures for fault isolation, regulatory compliance, and team autonomy across regions
  • Implement cross-cluster service discovery and traffic routing using service mesh or DNS-based approaches
  • Configure cluster federation patterns for workload placement, failover, and capacity management
  • Evaluate single-cluster vs multi-cluster tradeoffs for latency, blast radius, and operational complexity

October 25, 2021. Facebook (now Meta).

At 15:39 UTC, a routine maintenance command issued to Facebook’s backbone routers went wrong. The command was intended to assess the capacity of the backbone network. Instead, it disconnected every Facebook data center from the internet simultaneously. Not gradually. Not region by region. All at once.

BGP routes for Facebook, Instagram, WhatsApp, and Oculus were withdrawn from the global routing table. DNS servers, now unreachable, started returning SERVFAIL. Within minutes, 3.5 billion people lost access to the services they used for communication, business, and (in some countries) emergency coordination. Facebook’s own engineers couldn’t access internal tools to diagnose the problem because those tools ran on the same infrastructure. They had to physically drive to data centers and manually reconfigure routers.

The outage lasted nearly six hours. Revenue impact: approximately $65 million. Market cap loss during the outage: $47 billion. WhatsApp-dependent businesses in India, Brazil, and Southeast Asia lost an entire day of commerce.

The root cause wasn’t a hardware failure or a cyberattack. It was a single-cluster, single-plane-of-control architecture where one bad command could reach every region simultaneously. There was no blast radius containment. No regional isolation. No independent failure domain that could keep operating while the rest recovered.

This module teaches you how to design architectures where that can’t happen. You’ll learn to think in failure domains, route traffic across regions, manage state across distance, and build systems where the worst-case scenario is a regional degradation — not a global outage.


Failure Domains: The Foundation of Multi-Cluster Design

Section titled “Failure Domains: The Foundation of Multi-Cluster Design”

Before you can design a multi-cluster architecture, you need to understand failure domains — the boundaries within which a failure is contained.

Think of failure domains like bulkheads on a ship. A breach in one compartment doesn’t sink the ship because the bulkheads contain the flooding. In cloud infrastructure, failure domains work the same way: a failure within one domain shouldn’t propagate to others.

CLOUD FAILURE DOMAIN HIERARCHY
═══════════════════════════════════════════════════════════════
Level 0: Pod
Blast radius: Single container group
Example: OOMKilled pod, CrashLoopBackOff
Recovery: Automatic (kubelet restart, ReplicaSet replacement)
Level 1: Node
Blast radius: All pods on one machine
Example: Hardware failure, kernel panic, disk full
Recovery: Minutes (pod rescheduling to healthy nodes)
Level 2: Availability Zone (AZ)
Blast radius: All resources in one data center
Example: Power outage, network partition, cooling failure
Recovery: Automatic if workloads span AZs (anti-affinity)
Level 3: Cluster
Blast radius: All workloads in one Kubernetes cluster
Example: etcd corruption, control plane outage, bad admission webhook
Recovery: Requires second cluster (failover)
Level 4: Region
Blast radius: All resources in one geographic region
Example: Major natural disaster, regional network partition
Recovery: Requires multi-region deployment
Level 5: Cloud Provider
Blast radius: All resources on one provider
Example: Global provider outage (rare but catastrophic)
Recovery: Requires multi-cloud deployment
┌─────────────────────────────────────────────────────────┐
│ REGION: us-east-1 │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ AZ: us-east-1a │ │ AZ: us-east-1b │ │
│ │ ┌───────────────┐ │ │ ┌───────────────┐ │ │
│ │ │ Cluster: prod │ │ │ │ Cluster: prod │ │ │
│ │ │ ┌────┐┌────┐ │ │ │ │ ┌────┐┌────┐ │ │ │
│ │ │ │Node││Node│ │ │ │ │ │Node││Node│ │ │ │
│ │ │ └────┘└────┘ │ │ │ │ └────┘└────┘ │ │ │
│ │ └───────────────┘ │ │ └───────────────┘ │ │
│ └─────────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ REGION: eu-west-1 │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ AZ: eu-west-1a │ │ AZ: eu-west-1b │ │
│ │ ┌───────────────┐ │ │ ┌───────────────┐ │ │
│ │ │ Cluster: prod │ │ │ │ Cluster: prod │ │ │
│ │ │ ┌────┐┌────┐ │ │ │ │ ┌────┐┌────┐ │ │ │
│ │ │ │Node││Node│ │ │ │ │ │Node││Node│ │ │ │
│ │ │ └────┘└────┘ │ │ │ │ └────┘└────┘ │ │ │
│ │ └───────────────┘ │ │ └───────────────┘ │ │
│ └─────────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Level 2 failure (AZ): Lose one box above. Others survive.
Level 4 failure (Region): Lose top half. Bottom half survives.

Pause and predict: If a bad Kubernetes mutating admission webhook is deployed and blocks all pod creation across your environment, what level of failure domain does this represent? How would you recover?

StrategyProtects AgainstCost MultiplierComplexity
Multi-AZ (single cluster)Node/AZ failures1x (just spread pods)Low
Multi-cluster (same region)Cluster-level failures1.5-2xMedium
Multi-regionRegional failures2-3xHigh
Multi-cloudProvider-level failures3-5xVery High

Most organizations should start with multi-AZ, move to multi-cluster when they need blast radius isolation between teams or environments, and go multi-region only for tier-1 services that require geographic redundancy or compliance with data residency laws.

Multi-cloud is almost never worth the complexity unless regulation demands it (banking, government) or you’re genuinely concerned about provider lock-in at a strategic level.


Once you have clusters in multiple regions, you need to route users to the right one. This is where things get architecturally interesting.

The simplest approach. Use weighted or latency-based DNS records to direct traffic.

DNS-BASED MULTI-REGION ROUTING
═══════════════════════════════════════════════════════════════
User in New York User in London
│ │
▼ ▼
DNS Query: DNS Query:
api.example.com api.example.com
│ │
▼ ▼
Route 53 (latency-based routing)
│ │
▼ ▼
Returns: 52.1.2.3 Returns: 18.4.5.6
(us-east-1 NLB) (eu-west-1 NLB)
│ │
▼ ▼
US Cluster EU Cluster
Terminal window
# AWS Route 53: Latency-based routing
# Create a hosted zone and latency records
# Record for US region
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890 \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "us-east-1",
"Region": "us-east-1",
"AliasTarget": {
"HostedZoneId": "Z26RNL4JYFTOTI",
"DNSName": "us-nlb-1234.elb.us-east-1.amazonaws.com",
"EvaluateTargetHealth": true
}
}
}]
}'
# Record for EU region
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890 \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "eu-west-1",
"Region": "eu-west-1",
"AliasTarget": {
"HostedZoneId": "Z32O12XQLNTSW2",
"DNSName": "eu-nlb-5678.elb.eu-west-1.amazonaws.com",
"EvaluateTargetHealth": true
}
}
}]
}'

DNS Routing Trade-offs:

AdvantageDisadvantage
Simple to implementDNS TTL creates stale routing (clients cache)
Works with any backendFailover speed limited by TTL (30s-300s typical)
Provider-native health checksClient DNS resolvers may ignore TTL
Low costNo connection draining during failover

Stop and think: If you use DNS routing with a 5-minute TTL, and your active region goes down, what exactly is the user experience for the next 5 minutes? How does this impact your RTO?

Cloud providers offer global load balancers that use Anycast IP addresses. A single IP address is advertised from multiple locations, and BGP routing sends users to the nearest one.

GLOBAL LOAD BALANCER (ANYCAST)
═══════════════════════════════════════════════════════════════
User in Tokyo User in Sao Paulo
│ │
▼ ▼
Same IP: 34.120.0.1 Same IP: 34.120.0.1
│ │
▼ ▼
BGP routes to BGP routes to
nearest PoP nearest PoP
(Tokyo PoP) (Sao Paulo PoP)
│ │
▼ ▼
Google Front End Google Front End
(TLS termination) (TLS termination)
│ │
▼ ▼
asia-northeast1 southamerica-east1
GKE Cluster GKE Cluster
# GKE: Multi-cluster Ingress with Anycast
# First, register clusters in a fleet
# Then create a MultiClusterIngress resource
apiVersion: networking.gke.io/v1
kind: MultiClusterIngress
metadata:
name: api-ingress
namespace: production
annotations:
networking.gke.io/static-ip: "34.120.0.1"
spec:
template:
spec:
backend:
serviceName: api-multicluster-svc
servicePort: 443
---
apiVersion: networking.gke.io/v1
kind: MultiClusterService
metadata:
name: api-multicluster-svc
namespace: production
spec:
template:
spec:
selector:
app: api-server
ports:
- name: https
protocol: TCP
port: 443
targetPort: 8443
clusters:
- link: "us-east1/production-us"
- link: "europe-west1/production-eu"
- link: "asia-northeast1/production-asia"

Global LB vs DNS Routing:

FactorDNS RoutingGlobal LB (Anycast)
Failover speed30-300 seconds (TTL)5-30 seconds (BGP convergence)
TLS terminationAt each cluster’s ingressAt edge PoP (closer to user)
DDoS protectionYou configure per-regionBuilt into edge network
CostLow (~$1/million queries)Higher ($18-50/month + per-GB)
Provider lock-inLow (DNS is portable)High (provider-specific)
Health checkingDNS-level (binary: up/down)Request-level (HTTP status, latency)

For east-west traffic (service-to-service) rather than north-south (user-to-service), a multi-cluster service mesh provides fine-grained routing.

MULTI-CLUSTER SERVICE MESH
═══════════════════════════════════════════════════════════════
Cluster: us-east-1 Cluster: eu-west-1
┌──────────────────────┐ ┌──────────────────────┐
│ │ │ │
│ ┌──────┐ ┌──────┐ │ │ ┌──────┐ ┌──────┐ │
│ │ App │─▶│ Cart │ │ │ │ App │─▶│ Cart │ │
│ │ (v2) │ │ Svc │ │ │ │ (v2) │ │ Svc │ │
│ └──┬───┘ └──────┘ │ │ └──┬───┘ └──────┘ │
│ │ │ │ │ │
│ ┌──▼───┐ │ mTLS │ ┌──▼───┐ │
│ │ Pay │ │◀─────────▶│ │ Pay │ │
│ │ Svc │ │ Gateway │ │ Pay │ │
│ └──────┘ │ │ └──────┘ │
│ │ │ │
│ Istio Control Plane │ │ Istio Control Plane │
│ (local to cluster) │ │ (local to cluster) │
└──────────────────────┘ └──────────────────────┘
│ │
└──────── Shared Root CA ────────────┘
# Istio: Locality-aware load balancing
# Prefer local cluster, failover to remote
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: payment-service
namespace: production
spec:
host: payment-service.production.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: DEFAULT
maxRequestsPerConnection: 10
outlierDetection:
consecutive5xxErrors: 3
interval: 30s
baseEjectionTime: 60s
loadBalancer:
localityLbSetting:
enabled: true
failover:
- from: us-east-1
to: eu-west-1
- from: eu-west-1
to: us-east-1
warmupDurationSecs: "30s"

When you go from one cluster to many, your deployment tooling must evolve. Manually applying manifests to 15 clusters is a recipe for configuration drift and missed deployments.

ArgoCD’s ApplicationSet controller lets you define a template that generates Application resources for every cluster.

# Centralized GitOps for multi-cluster
# One ApplicationSet generates Applications for all clusters
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: payment-service
namespace: argocd
spec:
goTemplate: true
goTemplateOptions: ["missingkey=error"]
generators:
# Generate one Application per cluster
- clusters:
selector:
matchLabels:
env: production
values:
revision: main
- clusters:
selector:
matchLabels:
env: staging
values:
revision: staging
template:
metadata:
name: 'payment-{{.name}}'
spec:
project: production
source:
repoURL: https://github.com/company/k8s-manifests.git
targetRevision: '{{.values.revision}}'
path: 'apps/payment-service/overlays/{{.metadata.labels.env}}'
destination:
server: '{{.server}}'
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
retry:
limit: 3
backoff:
duration: 5s
factor: 2
maxDuration: 3m
RECOMMENDED REPOSITORY STRUCTURE
═══════════════════════════════════════════════════════════════
k8s-manifests/
├── apps/
│ ├── payment-service/
│ │ ├── base/ # Shared across all clusters
│ │ │ ├── deployment.yaml
│ │ │ ├── service.yaml
│ │ │ ├── hpa.yaml
│ │ │ └── kustomization.yaml
│ │ └── overlays/
│ │ ├── staging/ # Staging-specific overrides
│ │ │ ├── replicas.yaml # Lower replica count
│ │ │ ├── resources.yaml # Smaller resource limits
│ │ │ └── kustomization.yaml
│ │ ├── production/ # Production overrides
│ │ │ ├── replicas.yaml # Higher replica count
│ │ │ ├── resources.yaml # Larger resource limits
│ │ │ ├── pdb.yaml # PodDisruptionBudget
│ │ │ └── kustomization.yaml
│ │ └── production-eu/ # Region-specific overrides
│ │ ├── configmap.yaml # EU-specific config (endpoints)
│ │ └── kustomization.yaml
│ └── cart-service/
│ ├── base/
│ └── overlays/
├── infrastructure/
│ ├── cert-manager/
│ ├── external-dns/
│ ├── istio/
│ └── monitoring/
└── clusters/ # Cluster-specific bootstrapping
├── us-east-1-prod/
├── eu-west-1-prod/
└── staging/

The key principle: base manifests should work identically across all clusters. Differences (replica counts, resource limits, region-specific endpoints) live in overlays. If you find yourself maintaining entirely different manifests per cluster, your architecture has diverged too far.


Here’s the hard truth: stateful workloads are the primary reason multi-region architecture is difficult. Stateless services can run anywhere — they just need the right configuration. But databases, queues, and caches hold data that must be consistent (or at least eventually consistent) across regions.

You cannot have all three simultaneously across regions:

  • Consistency: Every read receives the most recent write
  • Availability: Every request receives a response
  • Partition tolerance: The system continues operating despite network partitions

Since network partitions between regions are inevitable (they happen several times per year on every cloud provider), you must choose between consistency and availability during a partition.

CAP THEOREM: YOUR TWO CHOICES DURING A PARTITION
═══════════════════════════════════════════════════════════════
Choice 1: CP (Consistency + Partition Tolerance)
During partition: Refuse writes to the partitioned region
Result: Some users get errors, but data is never wrong
Use for: Financial transactions, inventory counts, user accounts
Tools: CockroachDB, Google Spanner, etcd
Region A Region B
┌──────────┐ X ┌──────────┐
│ Write OK │ ──X── │ Write │
│ │ X │ REJECTED │
│ Primary │ network │ Standby │
└──────────┘ partition└──────────┘
Choice 2: AP (Availability + Partition Tolerance)
During partition: Accept writes in both regions, reconcile later
Result: All users can write, but data may temporarily conflict
Use for: Shopping carts, user preferences, social media posts
Tools: DynamoDB Global Tables, Cassandra, CRDTs
Region A Region B
┌──────────┐ X ┌──────────┐
│ Write OK │ ──X── │ Write OK │
│ │ X │ │
│ Replica │ network │ Replica │
└──────────┘ partition└──────────┘
│ reconcile when │
└───── partition heals ─┘
(conflict resolution needed)

Pause and predict: If you use active-active database replication across the Atlantic (80ms latency) and require strong consistency, what happens to the response time of a simple HTTP POST request that writes to the database?

PatternHow It WorksLatencyConsistencyComplexity
Single-region primary + read replicasAll writes go to one region; other regions read from replicasWrites: low in primary, high elsewhereStrong (reads may lag)Low
Active-active with conflict resolutionBoth regions accept writes; conflicts resolved by last-write-wins or custom logicLow everywhereEventualHigh
Consensus-based (Spanner, CockroachDB)Distributed consensus across regions for every writeHigher (cross-region round trip)StrongMedium (database handles it)
Event sourcing + CQRSWrite events to a log; each region builds its own read modelWrites: low; reads: eventualEventual (tunable)High

War Story: The Shopping Cart That Bought Two Couches

Section titled “War Story: The Shopping Cart That Bought Two Couches”

An e-commerce company ran active-active across US and EU regions. A customer in transit (flying from New York to London) started shopping on the US cluster, added a couch to their cart, then continued browsing after landing in London (now hitting the EU cluster). The cart replication had a 2-second lag.

In those 2 seconds, a background process in the US cluster ran a “cart reminder” campaign that duplicated the cart for A/B testing. When the EU cluster reconciled, it merged the original cart, the test cart, and the customer’s continued browsing. The customer saw two couches in their cart, assumed it was a quantity they’d set, and placed the order.

The fix: CRDTs (Conflict-free Replicated Data Types) for cart state, where add/remove operations are commutative and idempotent. Merging two replicas always produces the same correct result regardless of order.


Clusters need to communicate. Services in Cluster A need to call services in Cluster B. This requires cross-cluster networking that’s secure, observable, and performant.

APPROACH 1: VPC PEERING + DNS
═══════════════════════════════════════════════════════════════
Simple. Each cluster's services are exposed via internal LBs.
Services discover each other through DNS.
Cluster A (VPC 10.1.0.0/16) Cluster B (VPC 10.2.0.0/16)
┌───────────────────────┐ ┌───────────────────────┐
│ payment-svc │ │ inventory-svc │
│ → Internal NLB │───VPC───▶ │ → Internal NLB │
│ 10.1.50.23 │ Peering │ 10.2.50.44 │
└───────────────────────┘ └───────────────────────┘
DNS: inventory.internal.company.com → 10.2.50.44
Pros: Simple, no service mesh needed
Cons: No mTLS by default, limited traffic management
APPROACH 2: MULTI-CLUSTER SERVICE MESH
═══════════════════════════════════════════════════════════════
Service mesh spans clusters. Automatic mTLS, traffic shifting,
observability across cluster boundaries.
Cluster A Cluster B
┌───────────────────────┐ ┌───────────────────────┐
│ ┌─────┐ ┌─────────┐│ │┌─────────┐ ┌─────┐ │
│ │ App │──▶│ Envoy ││───mTLS────▶││ Envoy │──▶│ Svc │ │
│ │ │ │ Sidecar ││ ││ Sidecar │ │ │ │
│ └─────┘ └─────────┘│ │└─────────┘ └─────┘ │
│ │ │ │
│ Istio Control Plane │ │ Istio Control Plane │
└───────────────────────┘ └───────────────────────┘
Shared trust domain (common root CA)
Pros: mTLS everywhere, traffic policies, observability
Cons: Complexity, mesh overhead, operational burden
APPROACH 3: GATEWAY API + MULTI-CLUSTER
═══════════════════════════════════════════════════════════════
Kubernetes Gateway API with multi-cluster extensions.
The emerging standard approach.
Cluster A Cluster B
┌───────────────────────┐ ┌───────────────────────┐
│ ┌─────┐ │ │ ┌─────┐ │
│ │ App │──▶ Gateway ───│───TLS─────▶│──▶ Gateway ──▶│ Svc │ │
│ └─────┘ │ │ └─────┘ │
└───────────────────────┘ └───────────────────────┘
Pros: Standard API, growing ecosystem, simpler than full mesh
Cons: Still maturing, fewer features than service mesh

When you operate 5, 10, or 50 clusters, you need tooling to manage them as a fleet rather than individually.

ToolProviderApproachBest For
Cluster APICNCFDeclarative cluster lifecycle via K8s CRDsMulti-cloud, self-managed
RancherSUSECentral management consoleMixed environments
GKE FleetGoogleNative GKE multi-clusterGKE-only shops
EKS ConnectorAWSRegister external clusters into EKS consoleAWS-centric with some non-EKS
Azure ArcMicrosoftExtend Azure management to any K8sAzure-centric with hybrid
ArgoCDCNCFGitOps-based config managementGitOps-native teams
# Cluster API: Declarative cluster lifecycle management
# Define a cluster like any other Kubernetes resource
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: production-eu
namespace: fleet
labels:
env: production
region: eu-west-1
spec:
clusterNetwork:
pods:
cidrBlocks: ["10.244.0.0/16"]
services:
cidrBlocks: ["10.96.0.0/12"]
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
name: production-eu-control-plane
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
name: production-eu
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
metadata:
name: production-eu
namespace: fleet
spec:
region: eu-west-1
sshKeyName: fleet-key
network:
vpc:
cidrBlock: "10.2.0.0/16"
subnets:
- availabilityZone: eu-west-1a
cidrBlock: "10.2.1.0/24"
isPublic: false
- availabilityZone: eu-west-1b
cidrBlock: "10.2.2.0/24"
isPublic: false
---
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
name: production-eu-control-plane
namespace: fleet
spec:
replicas: 3
version: v1.35.0
machineTemplate:
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSMachineTemplate
name: production-eu-cp

The beauty of Cluster API is that adding a new cluster is a kubectl apply. Upgrading a cluster’s Kubernetes version is changing the version field. The controllers handle the rest — draining nodes, upgrading control planes, rolling worker machines.


  • Google’s internal container orchestrator, Borg, manages clusters of up to 10,000 machines each. But even Google doesn’t run one giant cluster. They use a “cell” architecture where each Borg cell is an independent failure domain. When they designed Kubernetes for the outside world, they made the same architectural assumption: clusters are failure domains, and you’ll run many of them.

  • Cross-region network latency follows the speed of light. US East to EU West is approximately 80ms round trip. US East to Asia Pacific is approximately 200ms. No amount of engineering can reduce this below the physical limit. This is why consensus-based databases like Spanner achieve strong consistency at the cost of write latency — every write must wait for a cross-region round trip to achieve quorum.

  • AWS had 28 documented service disruptions in us-east-1 between 2017 and 2024, making it statistically the least reliable major region. Despite this, it remains the most popular region because it was the first, has the most services, and many companies hardcoded it into their infrastructure before multi-region was common. Running multi-region with us-east-1 as one of your regions is prudent.

  • The Kubernetes Multi-Cluster SIG has been working on the MCS (Multi-Cluster Services) API since 2020. The ServiceExport and ServiceImport resources define a standard way to expose services across clusters. As of 2026, this API is in beta and supported by GKE, Istio, and Submariner — making cross-cluster service discovery a first-class Kubernetes concept rather than a vendor-specific extension.


MistakeWhy It HappensHow to Fix It
Going multi-region for all services”Everything must be highly available”Tier your services. Only tier-1 services need multi-region. Tier-2 can be single-region with fast recovery
Active-active without conflict resolution”We’ll figure out conflicts later”Design your data model for multi-region BEFORE deploying. Use CRDTs, event sourcing, or consensus databases
Ignoring cross-region data transfer costsTransfer fees are hidden in the billAt $0.02/GB, a chatty service sending 1TB/month cross-region costs $240/yr just in transfer. Profile your traffic first
Same configuration across all regions”They should be identical”Regions differ: instance types, pricing, available AZs, compliance requirements. Use Kustomize overlays per region
No cluster-level health checkingRouting layer doesn’t know a cluster is unhealthyImplement deep health checks (not just TCP) at the global LB or DNS level. Check actual application health
Single ArgoCD managing all clustersCentral point of failure for deploymentsRun ArgoCD per-cluster or per-region. Use ApplicationSets from a hub cluster, but each cluster’s ArgoCD is independent
Testing failover only in production”We’ll do a DR drill someday”Schedule quarterly DR drills. Simulate region failure by withdrawing traffic. If you’ve never tested failover, it doesn’t work
Assuming cloud provider handles everything”EKS is multi-AZ, so we’re fine”Multi-AZ protects against AZ failure, not cluster or region failure. You still need multi-cluster for full resilience

1. A company's payment service runs in us-east-1. They want to add eu-west-1 for disaster recovery. Should they choose active-active or active-passive, and why?

For a payment service, active-passive is usually the safer choice. Payments require strong consistency — you cannot risk processing the same payment twice or losing a payment due to conflict resolution between regions. Active-passive means all payment writes go to us-east-1 (primary), with eu-west-1 as a hot standby that receives replicated data but doesn’t serve write traffic.

During a failover, eu-west-1 is promoted to primary. This involves brief downtime (seconds to minutes depending on replication lag), but the data is consistent. Active-active payments would require either distributed consensus (adding latency to every transaction) or eventual consistency (risking double-charges or lost payments).

The exception: if the company uses a consensus database like Spanner or CockroachDB, active-active with strong consistency is possible, but each write incurs cross-region latency.

2. Your team is debating how to route traffic between a Tokyo and a London cluster. One engineer suggests Route 53 latency records, while another advocates for Google Cloud's Global Load Balancer (Anycast). If a complete regional outage occurs in Tokyo, how will the failover experience differ between these two approaches?

With DNS-based routing (Route 53), the failover relies on the client’s DNS cache expiring (TTL), meaning users might experience errors for several minutes if their local resolvers ignore the TTL. Anycast-based global load balancing, however, relies on BGP routing at the network level rather than client-side DNS caching. When the Tokyo region fails, BGP routes automatically converge within seconds to send traffic to the next nearest healthy point of presence (London). This provides a much faster, more deterministic failover experience that isn’t at the mercy of client-side ISP caching behaviors.

3. Your CTO returns from a conference and mandates that the new Kubernetes platform must run simultaneously across AWS (EKS) and Azure (AKS) to "avoid vendor lock-in." As the lead architect, explain why this multi-cloud approach might actually decrease overall system reliability and delivery speed.

Running a true multi-cloud Kubernetes environment forces you to rely on the “lowest common denominator” of features or build complex abstractions to hide provider differences, dramatically slowing down feature delivery. Your team must maintain duplicate expertise in two entirely different IAM models, networking stacks, and storage classes, which doubles the operational burden and surface area for misconfigurations. Because system reliability is heavily dependent on deep expertise and proven operational runbooks, splitting the team’s focus across two cloud providers typically results in more outages, not fewer. Unless you have strict regulatory requirements or massive leverage to negotiate vendor pricing, a multi-region deployment on a single cloud provider offers far better resilience for a fraction of the engineering cost.

4. A service mesh is configured for locality-aware load balancing. Traffic should prefer local pods, fail over to the same region, then fail over to remote regions. If the outlier detection threshold is set too aggressively (e.g., ejecting after a single 5xx error), what cascading failure could this trigger during a minor transient network blip?

If outlier detection is too aggressive, a minor transient error can cause healthy local pods to be immediately ejected from the load balancing pool. This forces the service mesh to shift that traffic to the next locality (same-region or remote-region), artificially increasing the load on those fallback pods. The sudden surge in traffic to the fallback pods can cause them to overload and throw their own 5xx errors, leading to their ejection as well. This creates a cascading failure where traffic violently oscillates between regions, turning a brief localized blip into a widespread system degradation. To prevent this, outlier detection must require multiple consecutive errors before ejection.

5. During Black Friday, a backhoe severs a major fiber line, causing a hard network partition between your US and EU clusters. For your multi-region shopping cart service, which CAP theorem trade-off (CP or AP) should you have designed for, and what is the user experience during this partition?

For a shopping cart service, you should strictly design for an AP (Availability + Partition Tolerance) architecture because refusing a customer’s ability to add items to their cart directly translates to lost revenue. During the network partition, users in both the US and EU will continue to see a fast, responsive site and can add or remove items from their carts without errors. The trade-off is that the data will temporarily become inconsistent between the two regions, meaning a user somehow accessing both regions simultaneously would see different cart states. Once the partition heals, the system must use conflict resolution mechanisms, like Conflict-free Replicated Data Types (CRDTs), to merge the carts seamlessly in the background.

6. Your platform team wants to simplify GitOps by deploying a single, centralized ArgoCD instance in a "management cluster" to deploy applications to 15 production clusters globally. What is the critical architectural flaw in this design when a regional disaster strikes?

Placing a single ArgoCD instance in a centralized management cluster creates a massive single point of failure for your entire global deployment pipeline. If the region hosting that management cluster goes offline, you completely lose the ability to deploy emergency hotfixes or configuration changes to the remaining 14 healthy clusters exactly when you might need them most. Instead, a resilient architecture uses a hub-and-spoke model where ArgoCD is deployed per-cluster or per-region to ensure local autonomy. This decentralized approach guarantees that each cluster can continue to sync state from Git independently, preserving your ability to manage healthy regions during a localized outage.

7. A product team is excited to make their stateless microservice "multi-region" and immediately begins writing Terraform for a new EKS cluster in eu-west-1. As the platform architect, what foundational architectural decisions must they finalize before writing any infrastructure code?

Before writing any infrastructure code, the team must first define their actual Recovery Time Objective (RTO) and Recovery Point Objective (RPO) to determine if the extreme cost and complexity of a multi-region deployment is even justified. If it is justified, they must finalize their data replication strategy for any stateful dependencies (like databases or caches), as data gravity dictates the entire traffic routing and failover architecture. Additionally, they must carefully plan their network IP address space (CIDR blocks) to ensure there is no overlap between regions, which would permanently block VPC peering or service mesh integration. Jumping straight to infrastructure code without these decisions usually results in a multi-region setup that either fails to replicate data correctly or cannot route traffic during a real outage.


Hands-On Exercise: Design a DR Architecture for a Payment Service

Section titled “Hands-On Exercise: Design a DR Architecture for a Payment Service”

You’re the lead architect for a fintech company. Your tier-1 payment processing service needs to survive a full regional outage. Design a complete multi-region architecture.

The payment service:

  • Processes 2,000 transactions per second at peak
  • Has a PostgreSQL database (currently single-region, primary + 2 read replicas)
  • Uses Redis for session management and rate limiting
  • Integrates with 3 external payment processors (Stripe, Adyen, PayPal)
  • Current region: us-east-1
  • Target second region: eu-west-1
  • RTO (Recovery Time Objective): 5 minutes
  • RPO (Recovery Point Objective): 0 (no data loss)

Decide between active-passive and active-active for the payment service. Document your reasoning.

Solution

Recommended: Active-Passive with Hot Standby

Reasoning:

  • RPO of 0 (no data loss) rules out simple async replication for the database
  • Payment processing requires strong consistency (cannot process same payment twice)
  • Active-active with strong consistency is possible (CockroachDB/Spanner) but adds write latency to every transaction
  • Active-passive with synchronous replication to a hot standby achieves RPO=0 without impacting normal write latency (writes go to primary only)

Architecture:

  • us-east-1: Active (serves all traffic)
  • eu-west-1: Hot standby (receives synchronous replication, ready to promote)
  • Global load balancer with health checks on us-east-1
  • Automated failover triggers promotion of eu-west-1 when us-east-1 is unhealthy

The 5-minute RTO is achievable because:

  • Database promotion: ~30 seconds (synchronous replica, no data replay needed)
  • DNS/LB failover: ~10-60 seconds (Anycast or low-TTL DNS)
  • Application warmup: ~60-120 seconds (connection pools, caches)
  • Total: ~2-4 minutes, within the 5-minute RTO

Draw the database architecture. Address: Where is the primary? How does replication work? What happens to Redis during failover?

Solution
DATA LAYER ARCHITECTURE
═══════════════════════════════════════════════════════════════
us-east-1 (ACTIVE) eu-west-1 (STANDBY)
┌──────────────────────┐ ┌──────────────────────┐
│ │ │ │
│ PostgreSQL Primary │──sync──▶ │ PostgreSQL Standby │
│ (RDS Multi-AZ) │ repl │ (RDS Cross-Region) │
│ │ │ │ │ │
│ │ async │ │ │ async │
│ ▼ │ │ ▼ │
│ Read Replica x2 │ │ Read Replica x1 │
│ (for read traffic) │ │ (warm, not serving) │
│ │ │ │
│ Redis Primary │ │ Redis Primary │
│ (ElastiCache) │ │ (ElastiCache) │
│ - Sessions │ │ - Pre-warmed │
│ - Rate limits │ │ - Empty on failover │
│ - Idempotency keys │ │ - Rebuilt from DB │
└──────────────────────┘ └──────────────────────┘
FAILOVER SEQUENCE:
1. Health check detects us-east-1 failure
2. Global LB stops sending traffic to us-east-1
3. RDS promotes eu-west-1 standby to primary
4. Application pods in eu-west-1 connect to local (now primary) DB
5. Redis in eu-west-1 rebuilds rate limit counters from DB
6. Global LB sends all traffic to eu-west-1
7. New read replicas provisioned in eu-west-1
REDIS STRATEGY:
Redis is treated as ephemeral. Sessions can be regenerated
(force re-authentication -- acceptable for 5-min RTO).
Rate limit counters are rebuilt from recent transaction history.
Idempotency keys are stored in BOTH Redis and PostgreSQL --
Redis for fast lookup, PostgreSQL as source of truth.

Key decisions:

  • Synchronous replication for PostgreSQL achieves RPO=0 at the cost of ~80ms additional write latency (cross-Atlantic round trip). This is acceptable for a payment service where correctness matters more than milliseconds.
  • Redis is NOT replicated cross-region. It’s cheaper and simpler to rebuild session state and rate limit counters from the database after failover. Trying to replicate Redis cross-region adds complexity with little benefit for a 5-minute RTO scenario.
  • Idempotency keys must survive failover. Store them in PostgreSQL (replicated) and cache in Redis (local). During failover, the PostgreSQL replica has all idempotency keys, preventing duplicate payment processing.

How does traffic reach the correct region? What health checks determine failover? How do you prevent split-brain?

Solution
# Route 53 Health Check for us-east-1 cluster
# Checks the actual payment processing capability, not just TCP
# Health check endpoint: GET /healthz/deep
# Returns 200 only if: API server up, DB writable, Redis reachable
# Primary record (us-east-1) -- failover routing policy
# Route 53 configuration:
# Record name: payments.example.com
# Type: A (Alias to NLB)
# Routing policy: Failover
# Failover type: Primary
# Health check: payments-us-east-1-deep
# Target: us-east-1 NLB
# Secondary record (eu-west-1) -- failover routing policy
# Record name: payments.example.com
# Type: A (Alias to NLB)
# Routing policy: Failover
# Failover type: Secondary
# Health check: payments-eu-west-1-deep
# Target: eu-west-1 NLB

Split-brain prevention:

  • The database is the source of truth, not the routing layer
  • Only ONE PostgreSQL instance accepts writes at a time (enforced by RDS)
  • If both regions somehow receive traffic simultaneously, idempotency keys in PostgreSQL prevent duplicate processing
  • A “fencing token” pattern: after failover, the old primary’s write credentials are revoked
  • Route 53 failover routing is deterministic — primary is always preferred when healthy

Health check design: The deep health check endpoint must verify:

  1. API server is responding (basic liveness)
  2. PostgreSQL primary is writable (execute a test write)
  3. Redis is reachable (SET/GET test key)
  4. At least one payment processor is reachable
  5. Certificate is valid (not about to expire)

If ANY of these fail, the health check returns 503, triggering failover.

Write a step-by-step runbook for both automated and manual failover scenarios.

Solution

Automated Failover (health check triggered):

  1. Route 53 health check fails for us-east-1 (3 consecutive failures, 10s intervals = 30s detection)
  2. Route 53 automatically returns eu-west-1 NLB IP for payments.example.com
  3. DNS TTL (60 seconds) expires; clients begin hitting eu-west-1
  4. RDS automated failover promotes eu-west-1 standby (triggered by separate RDS monitoring, ~30s)
  5. eu-west-1 application pods detect new primary DB via DNS (RDS endpoint stays the same)
  6. eu-west-1 Redis warms up (rate limit counters from recent transactions table, ~15s)
  7. PagerDuty alert fires: “PAYMENT SERVICE: Automated failover to eu-west-1 complete”
  8. On-call engineer verifies: transaction success rate, latency, error rates

Manual Failover (planned maintenance or engineer-triggered):

Terminal window
# Step 1: Verify eu-west-1 readiness
kubectl --context eu-west-1 get pods -n payments
# Expect: All pods Running, health checks passing
# Step 2: Scale down us-east-1 to drain traffic gracefully
kubectl --context us-east-1 scale deployment payment-api --replicas=0 -n payments
# Wait for in-flight requests to complete (watch active connections)
# Step 3: Promote eu-west-1 database
aws rds promote-read-replica-db-cluster \
--db-cluster-identifier payments-eu-west-1
# Step 4: Update Route 53 to point to eu-west-1
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890 \
--change-batch file://failover-to-eu.json
# Step 5: Verify
curl -s https://payments.example.com/healthz/deep | jq .
# Expect: {"status": "ok", "region": "eu-west-1", "db": "primary"}
# Step 6: Monitor for 15 minutes
# Watch: transaction success rate, p99 latency, error rate

Failback procedure (returning to us-east-1):

  1. Establish new replication from eu-west-1 (now primary) to us-east-1 (now standby)
  2. Wait for replication lag to reach 0
  3. Execute manual failover procedure in reverse
  4. Re-establish original replication direction
  • Chose and justified an architecture pattern (active-active vs active-passive)
  • Designed data replication strategy with RPO=0 guarantee
  • Addressed Redis state management during failover
  • Designed health checks that verify actual service capability
  • Included split-brain prevention mechanism
  • Created both automated and manual failover runbooks
  • Failover achieves RTO of 5 minutes or less

Module 4.3: Cloud IAM Integration for Kubernetes — Your clusters are designed for resilience, but who gets to access them? We’ll explore how cloud IAM integrates with Kubernetes to provide identity-based access without ever passing secrets around.