Module 10.7: Multi-Cloud Service Mesh (Istio Multi-Cluster)

Complexity: [COMPLEX] | Time to Complete: 3h | Prerequisites: Kubernetes Networking, Service Mesh Basics, Hybrid Cloud Architecture (Module 10.4)

What You’ll Be Able to Do

After completing this module, you will be able to:

Deploy Istio multi-cluster mesh across EKS, GKE, and AKS clusters with cross-cloud service discovery
Configure mTLS-encrypted communication between services running in different cloud providers
Implement traffic management policies (fault injection, traffic shifting, circuit breaking) across multi-cloud meshes
Design multi-cloud mesh architectures that balance latency, security, and operational complexity trade-offs

Why This Module Matters

In October 2023, a ride-sharing company operated two Kubernetes clusters: their primary on AWS in us-east-1 and a disaster recovery cluster on GCP in us-central1. When AWS us-east-1 experienced a partial outage affecting their EKS control plane, their failover plan — a manual DNS switch to GCP — took 43 minutes to execute. During those 43 minutes, the booking service was down for 12 million users. Post-incident analysis revealed that the DNS failover was slow because it required three teams to coordinate: the platform team to verify GCP was ready, the networking team to switch DNS records, and the on-call lead to approve the switch. They estimated the outage cost $3.8 million in lost bookings and $1.2 million in rider credits issued to angry customers.

After the incident, the company implemented Istio multi-cluster across both environments. With Istio’s locality-aware load balancing, traffic from users near us-east-1 went to AWS by default, while traffic near us-central1 went to GCP. When a service on AWS became unhealthy, Istio automatically routed traffic to GCP — no DNS changes, no human coordination, no 43-minute outage. The next partial AWS outage (two months later) resulted in zero user-visible downtime. Traffic shifted to GCP within 8 seconds.

This module teaches you how to build that capability. You will learn Istio’s multi-cluster topologies, how to establish trust across clusters using SPIFFE/SPIRE, how to configure cross-cloud routing and failover, and how to troubleshoot mTLS in a multi-cluster environment.

Istio Multi-Cluster Topologies

Istio supports multiple ways to connect clusters. The choice depends on your network topology and your trust model.

Topology 1: Primary-Remote

One cluster runs the full Istio control plane (the “primary”), while other clusters connect as “remotes” that share the same control plane.

flowchart LR
    subgraph Primary["PRIMARY CLUSTER (AWS)"]
        direction TB
        Istiod["Istiod (Control Plane)<br>Service Registry (both clusters)"]
        SvcA["Svc-A sidecar"]
        SvcB["Svc-B sidecar"]
        SvcC["Svc-C sidecar"]
    end

    subgraph Remote["REMOTE CLUSTER (GCP)"]
        direction TB
        NoCP["(No Istiod)<br>Proxies connect to primary's Istiod"]
        SvcC2["Svc-C sidecar"]
        SvcD["Svc-D sidecar"]
        SvcE["Svc-E sidecar"]
    end

    SvcC2 -.->|Connects to| Istiod
    SvcD -.->|Connects to| Istiod
    SvcE -.->|Connects to| Istiod

Pros: Simple, single control plane to manage
Cons: Primary is SPOF for config distribution
Best for: Active-passive, DR scenarios

Topology 2: Multi-Primary

Each cluster runs its own Istio control plane, and the clusters share service discovery information.

flowchart LR
    subgraph Cluster1["CLUSTER-1 (AWS)"]
        direction TB
        Istiod1["Istiod-1 (Local CP)<br>Knows about Cluster-1 & 2 svcs"]
        SvcA1["Svc-A sidecar"]
        SvcB1["Svc-B sidecar"]
        SvcC["Svc-C sidecar"]
    end

    subgraph Cluster2["CLUSTER-2 (GCP)"]
        direction TB
        Istiod2["Istiod-2 (Local CP)<br>Knows about Cluster-1 & 2 svcs"]
        SvcA2["Svc-A sidecar"]
        SvcB2["Svc-B sidecar"]
        SvcD["Svc-D sidecar"]
    end

    Istiod1 <-->|Shares service discovery| Istiod2

Svc-A and Svc-B exist in BOTH clusters (multi-region)
Svc-C only in Cluster-1, Svc-D only in Cluster-2
Both Istiods know about ALL services across both clusters
Pros: No SPOF, each cluster independent if other fails
Cons: More complex, config must be synchronized
Best for: Active-active, multi-region production

Topology Decision Matrix

Feature	Primary-Remote	Multi-Primary
Control plane redundancy	No (primary is SPOF)	Yes (each cluster has its own)
Config distribution	All proxies connect to primary	Each cluster’s proxies connect locally
Cross-cluster latency impact	Remote proxies add latency for config	Only data plane cross-cluster calls add latency
Complexity	Lower	Higher
Network requirement	Remote must reach primary’s Istiod	Cross-cluster pod connectivity (or east-west gateway)
Best for	DR, dev/test, hub-spoke	Active-active production, multi-region

Establishing Trust Across Clusters

For mTLS to work across clusters, every sidecar proxy needs to trust certificates issued by proxies in other clusters. This requires a shared root of trust.

Stop and think: If Cluster 1 and Cluster 2 have completely different, self-signed root CAs, what exact error would a client sidecar proxy throw when attempting an mTLS handshake with a server proxy in the other cluster?

Root CA Distribution

flowchart TD
    Root["Shared Root CA<br>(offline, HSM)"]

    Int1["Intermediate CA (Cluster1)"]
    Int2["Intermediate CA (Cluster2)"]
    Int3["Intermediate CA (Cluster3)"]

    Root --> Int1
    Root --> Int2
    Root --> Int3

    W1["Workload Certs"]
    W2["Workload Certs"]
    W3["Workload Certs"]

    Int1 --> W1
    Int2 --> W2
    Int3 --> W3

All workload certs chain to the SAME root CA. Therefore: Cluster1 trusts Cluster2’s certificates.

Creating a Shared Root CA

# Generate a root CA certificate (in production, use a hardware security module)
mkdir -p /tmp/istio-certs

# Root CA (shared across all clusters)
openssl req -new -newkey rsa:4096 -x509 -sha256 \
  -days 3650 -nodes \
  -subj "/O=Company Inc./CN=Root CA" \
  -keyout /tmp/istio-certs/root-key.pem \
  -out /tmp/istio-certs/root-cert.pem

# Intermediate CA for Cluster 1
openssl req -new -newkey rsa:4096 -nodes \
  -subj "/O=Company Inc./CN=Cluster-1 Intermediate CA" \
  -keyout /tmp/istio-certs/cluster1-ca-key.pem \
  -out /tmp/istio-certs/cluster1-ca-csr.pem

openssl x509 -req -sha256 -days 1825 \
  -CA /tmp/istio-certs/root-cert.pem \
  -CAkey /tmp/istio-certs/root-key.pem \
  -CAcreateserial \
  -in /tmp/istio-certs/cluster1-ca-csr.pem \
  -out /tmp/istio-certs/cluster1-ca-cert.pem \
  -extfile <(echo -e "basicConstraints=CA:TRUE\nkeyUsage=critical,keyCertSign,cRLSign")

# Create cert chain for Cluster 1
cat /tmp/istio-certs/cluster1-ca-cert.pem /tmp/istio-certs/root-cert.pem \
  > /tmp/istio-certs/cluster1-cert-chain.pem

# Repeat for Cluster 2 (different intermediate, same root)
openssl req -new -newkey rsa:4096 -nodes \
  -subj "/O=Company Inc./CN=Cluster-2 Intermediate CA" \
  -keyout /tmp/istio-certs/cluster2-ca-key.pem \
  -out /tmp/istio-certs/cluster2-ca-csr.pem

openssl x509 -req -sha256 -days 1825 \
  -CA /tmp/istio-certs/root-cert.pem \
  -CAkey /tmp/istio-certs/root-key.pem \
  -CAcreateserial \
  -in /tmp/istio-certs/cluster2-ca-csr.pem \
  -out /tmp/istio-certs/cluster2-ca-cert.pem \
  -extfile <(echo -e "basicConstraints=CA:TRUE\nkeyUsage=critical,keyCertSign,cRLSign")

cat /tmp/istio-certs/cluster2-ca-cert.pem /tmp/istio-certs/root-cert.pem \
  > /tmp/istio-certs/cluster2-cert-chain.pem

# Install certs as secrets in each cluster's istio-system namespace
kubectl --context cluster1 create namespace istio-system
kubectl --context cluster1 create secret generic cacerts -n istio-system \
  --from-file=ca-cert.pem=/tmp/istio-certs/cluster1-ca-cert.pem \
  --from-file=ca-key.pem=/tmp/istio-certs/cluster1-ca-key.pem \
  --from-file=root-cert.pem=/tmp/istio-certs/root-cert.pem \
  --from-file=cert-chain.pem=/tmp/istio-certs/cluster1-cert-chain.pem

kubectl --context cluster2 create namespace istio-system
kubectl --context cluster2 create secret generic cacerts -n istio-system \
  --from-file=ca-cert.pem=/tmp/istio-certs/cluster2-ca-cert.pem \
  --from-file=ca-key.pem=/tmp/istio-certs/cluster2-ca-key.pem \
  --from-file=root-cert.pem=/tmp/istio-certs/root-cert.pem \
  --from-file=cert-chain.pem=/tmp/istio-certs/cluster2-cert-chain.pem

SPIFFE/SPIRE for Enterprise Identity

For production environments, SPIFFE (Secure Production Identity Framework For Everyone) and SPIRE (SPIFFE Runtime Environment) provide a more robust identity system than Istio’s built-in CA.

flowchart TD
    subgraph C1["Cluster 1"]
        SS1["SPIRE Server<br>Trust Domain: company.com"]
        SA1["SPIRE Agent (per node)<br>Issues SVIDs to sidecars"]
        SS1 --> SA1
    end

    subgraph C2["Cluster 2"]
        SS2["SPIRE Server<br>Trust Domain: company.com"]
        SA2["SPIRE Agent (per node)<br>Issues SVIDs to sidecars"]
        SS2 --> SA2
    end

    SS1 <-->|Federated Trust| SS2

Both clusters use the same trust domain (company.com). SVIDs from Cluster 1 are trusted by Cluster 2 and vice versa. No shared CA key needed - SPIRE federation handles trust.

Multi-Primary Istio Installation

Installing Istio on Multiple Clusters

# Set cluster contexts
CTX_CLUSTER1=kind-cluster1
CTX_CLUSTER2=kind-cluster2

# Label clusters for Istio topology awareness
kubectl --context $CTX_CLUSTER1 label namespace istio-system topology.istio.io/network=network1
kubectl --context $CTX_CLUSTER2 label namespace istio-system topology.istio.io/network=network2

# Install Istio on Cluster 1
cat <<'EOF' > /tmp/istio-cluster1.yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  values:
    global:
      meshID: company-mesh
      multiCluster:
        clusterName: cluster1
      network: network1
  meshConfig:
    defaultConfig:
      proxyMetadata:
        ISTIO_META_DNS_CAPTURE: "true"
        ISTIO_META_DNS_AUTO_ALLOCATE: "true"
  components:
    ingressGateways:
      - name: istio-eastwestgateway
        label:
          istio: eastwestgateway
          app: istio-eastwestgateway
          topology.istio.io/network: network1
        enabled: true
        k8s:
          env:
            - name: ISTIO_META_REQUESTED_NETWORK_VIEW
              value: network1
          service:
            ports:
              - name: status-port
                port: 15021
                targetPort: 15021
              - name: tls
                port: 15443
                targetPort: 15443
              - name: tls-istiod
                port: 15012
                targetPort: 15012
              - name: tls-webhook
                port: 15017
                targetPort: 15017
EOF

istioctl install --context $CTX_CLUSTER1 -f /tmp/istio-cluster1.yaml -y

# Install Istio on Cluster 2 (similar but different cluster name and network)
cat <<'EOF' > /tmp/istio-cluster2.yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  values:
    global:
      meshID: company-mesh
      multiCluster:
        clusterName: cluster2
      network: network2
  meshConfig:
    defaultConfig:
      proxyMetadata:
        ISTIO_META_DNS_CAPTURE: "true"
        ISTIO_META_DNS_AUTO_ALLOCATE: "true"
  components:
    ingressGateways:
      - name: istio-eastwestgateway
        label:
          istio: eastwestgateway
          app: istio-eastwestgateway
          topology.istio.io/network: network2
        enabled: true
        k8s:
          env:
            - name: ISTIO_META_REQUESTED_NETWORK_VIEW
              value: network2
          service:
            ports:
              - name: status-port
                port: 15021
                targetPort: 15021
              - name: tls
                port: 15443
                targetPort: 15443
              - name: tls-istiod
                port: 15012
                targetPort: 15012
              - name: tls-webhook
                port: 15017
                targetPort: 15017
EOF

istioctl install --context $CTX_CLUSTER2 -f /tmp/istio-cluster2.yaml -y

# Exchange remote secrets for cross-cluster discovery
istioctl create-remote-secret --context $CTX_CLUSTER1 --name=cluster1 | \
  kubectl apply -f - --context $CTX_CLUSTER2

istioctl create-remote-secret --context $CTX_CLUSTER2 --name=cluster2 | \
  kubectl apply -f - --context $CTX_CLUSTER1

Exposing Services via East-West Gateway

The east-west gateway handles cross-cluster traffic. Unlike the ingress gateway (north-south), it uses mTLS for all connections.

# Expose services through the east-west gateway on both clusters
for CTX in $CTX_CLUSTER1 $CTX_CLUSTER2; do
  kubectl --context $CTX apply -n istio-system -f - <<'EOF'
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: cross-network-gateway
spec:
  selector:
    istio: eastwestgateway
  servers:
    - port:
        number: 15443
        name: tls
        protocol: TLS
      tls:
        mode: AUTO_PASSTHROUGH
      hosts:
        - "*.local"
EOF
done

Cross-Cloud Routing and Failover

Pause and predict: If you configure a failover from us-east-1 to us-central1, but forget to define an outlierDetection policy in your DestinationRule, what behavior will you observe when us-east-1 endpoints start returning HTTP 500 errors?

Locality-Aware Load Balancing

Istio’s locality-aware load balancing routes traffic to the nearest healthy endpoint, falling back to remote endpoints when local ones fail.

# DestinationRule with locality failover
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service
  namespace: production
spec:
  host: payment-service.production.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: DEFAULT
        maxRequestsPerConnection: 10
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    loadBalancer:
      localityLbSetting:
        enabled: true
        failover:
          - from: us-east-1
            to: us-central1
          - from: us-central1
            to: us-east-1
        failoverPriority:
          - "topology.kubernetes.io/region"
          - "topology.kubernetes.io/zone"
      warmupDurationSecs: 30

Weighted Cross-Cluster Traffic Splitting

# VirtualService for canary-style cross-cluster routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
  namespace: production
spec:
  hosts:
    - payment-service.production.svc.cluster.local
  http:
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: payment-service.production.svc.cluster.local
            subset: cluster2-canary
          weight: 100
    - route:
        - destination:
            host: payment-service.production.svc.cluster.local
            subset: cluster1-primary
          weight: 90
        - destination:
            host: payment-service.production.svc.cluster.local
            subset: cluster2-secondary
          weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service-subsets
  namespace: production
spec:
  host: payment-service.production.svc.cluster.local
  subsets:
    - name: cluster1-primary
      labels:
        topology.istio.io/cluster: cluster1
    - name: cluster2-secondary
      labels:
        topology.istio.io/cluster: cluster2
    - name: cluster2-canary
      labels:
        topology.istio.io/cluster: cluster2
        version: canary

mTLS Troubleshooting in Multi-Cluster

mTLS issues in multi-cluster Istio are the most common and frustrating problems. Here is a systematic troubleshooting guide.

Common mTLS Failure Patterns

Symptom	Likely Cause	Diagnostic Command
503 between clusters	Root CA mismatch	`istioctl proxy-config secret <pod> -o json`
Connection reset	TLS version mismatch	`istioctl proxy-status`
Intermittent failures	Certificate expiry	`openssl s_client -connect <svc>:443`
”upstream connect error”	East-west gateway not reachable	`kubectl get svc istio-eastwestgateway -n istio-system`
RBAC denied	Authorization policy too restrictive	`istioctl analyze -n production`

Troubleshooting Workflow

# Step 1: Verify mesh-wide mTLS mode
kubectl get peerauthentication -A

# Step 2: Check if both clusters have the same root CA
for CTX in kind-cluster1 kind-cluster2; do
  echo "=== $CTX Root CA ==="
  kubectl --context $CTX get secret cacerts -n istio-system \
    -o jsonpath='{.data.root-cert\.pem}' | base64 -d | \
    openssl x509 -noout -subject -issuer -fingerprint
done

# Step 3: Verify cross-cluster service discovery
istioctl --context kind-cluster1 proxy-config endpoints \
  $(kubectl --context kind-cluster1 get pod -n production -l app=frontend -o jsonpath='{.items[0].metadata.name}') \
  --cluster "outbound|80||payment-service.production.svc.cluster.local"

# Step 4: Check proxy certificate chain
istioctl --context kind-cluster1 proxy-config secret \
  $(kubectl --context kind-cluster1 get pod -n production -l app=frontend -o jsonpath='{.items[0].metadata.name}') \
  -o json | jq '.dynamicActiveSecrets[0].secret.tlsCertificate.certificateChain'

# Step 5: Test cross-cluster connectivity
kubectl --context kind-cluster1 exec -n production deploy/frontend -- \
  curl -sI payment-service.production.svc.cluster.local:80

# Step 6: Check east-west gateway logs for errors
kubectl --context kind-cluster1 logs -n istio-system \
  -l istio=eastwestgateway --tail=50

# Step 7: Run Istio diagnostics
istioctl --context kind-cluster1 analyze -n production --all-namespaces

Did You Know?

Istio’s multi-cluster feature was first introduced in Istio 1.1 (March 2019) as an experimental feature. It took until Istio 1.7 (August 2020) — over a year later — before it was considered production-ready. The delay was primarily due to the complexity of certificate management across clusters. Early adopters reported spending more time debugging mTLS certificate chains than on any other aspect of the mesh.
SPIFFE (Secure Production Identity Framework For Everyone) was created at Scytale, a company founded by former IETF and Kubernetes security engineers. SPIRE, the reference implementation, can issue over 10,000 SVIDs (SPIFFE Verifiable Identity Documents) per second per server. Netflix uses SPIFFE for workload identity across their entire fleet of over 1,000 microservices, replacing X.509 certificates that previously had to be manually distributed.
Istio’s east-west gateway uses a feature called AUTO_PASSTHROUGH that allows the gateway to route mTLS traffic without terminating TLS. This means the gateway never sees the plaintext traffic — it examines only the SNI (Server Name Indication) in the TLS handshake to determine routing. This is both a security benefit (the gateway cannot be compromised to read traffic) and a performance benefit (no double TLS termination/origination).
The locality load balancing feature in Istio was contributed by Google engineers who had built a similar system internally for Borg (Google’s predecessor to Kubernetes). The internal system at Google handles over 50 billion requests per second with locality awareness. When a Google data center goes offline, traffic automatically shifts to the nearest available data center within 5 seconds — the same capability that Istio brings to Kubernetes clusters.

Common Mistakes

Mistake	Why It Happens	How to Fix It
Different root CAs per cluster	Each cluster was set up independently, using Istio’s self-signed CA. Cross-cluster mTLS fails because certificates are not trusted.	Generate a shared root CA before installing Istio. Distribute intermediate CAs per cluster. All must chain to the same root.
East-west gateway not exposed	Gateway is deployed but its LoadBalancer service is internal-only or blocked by security groups. Cross-cluster traffic cannot reach the gateway.	Verify the east-west gateway has a reachable external IP. Check security groups/NSGs allow port 15443 between clusters.
No outlier detection configured	Locality failover is enabled but there is no mechanism to detect unhealthy endpoints. Istio keeps sending traffic to a failing cluster.	Always configure outlierDetection in DestinationRules. Set appropriate thresholds (e.g., 3 consecutive 5xx errors) and ejection times.
Remote secrets with wrong kubeconfig	The remote secret was generated with a kubeconfig that uses internal DNS names not reachable from the other cluster.	Ensure the API server address in remote secrets uses a reachable endpoint (public IP or DNS that resolves across clusters).
Strict mTLS without exception for health checks	Cloud load balancer health checks cannot do mTLS. Services become unhealthy because the LB cannot reach them.	Configure PeerAuthentication with permissive mode for health check ports, or use Istio’s built-in health check rewriting.
Authorization policies blocking cross-cluster traffic	AuthorizationPolicy specifies source principals using cluster-1 identities. Traffic from cluster-2 has different SPIFFE URIs and is denied.	Use trust-domain-aware principal patterns. In multi-cluster, use `principals: ["cluster.local/ns//sa/"]` or specific trust domain aliases.

Quiz

Question 1: Your organization is designing a multi-cluster mesh across two data centers (Active and Standby). The network team wants to minimize the number of control planes to manage, but the architecture board is concerned about single points of failure. How do Primary-Remote and Multi-Primary topologies differ in addressing these concerns, and which would you recommend for this specific Active/Standby scenario?

In the Primary-Remote topology, one cluster hosts the centralized control plane (Istiod), and the other cluster’s Envoy proxies connect across the network to retrieve their configurations. This significantly reduces management overhead since there is only one control plane to maintain, but introduces a single point of failure if the primary cluster experiences an outage. Multi-Primary addresses this by running an independent Istiod in every cluster, ensuring that each environment can continue operating autonomously even if network connectivity between data centers is lost. For an Active/Standby architecture, Primary-Remote is often recommended because the Standby data center is inherently dependent on the Active one anyway, and managing a single control plane simplifies disaster recovery workflows.

Question 2: Two development teams merge their independent Kubernetes clusters into a multi-cluster Istio mesh. They configure cross-cluster service discovery, but all cross-cluster traffic immediately fails with TLS handshake errors. Based on how mTLS establishes trust, what is the root cause of this failure, and how must they reconfigure their certificate authorities to fix it?

The root cause of the failure is that the two independent clusters are using different, unshared root Certificate Authorities (CAs). For mTLS to succeed, the Envoy proxy in the client cluster must be able to cryptographically verify the certificate presented by the server proxy in the destination cluster. This verification requires both proxies to share a common root of trust in their trust stores. To fix this, the teams must generate a single shared Root CA, use it to sign intermediate CA certificates for each specific cluster, and distribute those intermediate certificates to their respective Istio control planes. Once both clusters issue workload certificates derived from the same root, the TLS handshakes will successfully validate.

Question 3: Your multi-cluster Istio setup uses locality-aware load balancing. Service-A in us-east-1 calls Service-B, which has endpoints in both us-east-1 and eu-west-1. Under normal conditions, where does the traffic go? What if all us-east-1 endpoints for Service-B fail?

Under normal conditions, traffic goes to us-east-1 endpoints because locality-aware load balancing prefers the closest endpoints. The preference order dictates that requests stay within the same zone first, then the same region, and finally a different region. When all us-east-1 endpoints fail—detected via outlier detection rules such as consecutive 5xx errors—Istio ejects those local endpoints from the load balancing pool. It then falls back to the next available locality defined in your failover configuration, shifting traffic to the eu-west-1 endpoints. This failover process occurs transparently to Service-A, ensuring high availability while automatically reverting back to local endpoints once the us-east-1 instances recover.

Question 4: A junior engineer is confused about why they need an east-west gateway when they already have an ingress gateway routing external traffic into the mesh. How would you explain a scenario where the east-west gateway is explicitly required, and how does its traffic handling differ from the standard ingress gateway?

The standard ingress gateway is designed for north-south traffic, meaning it typically terminates client-facing TLS connections, applies HTTP routing rules, and forwards the requests to internal mesh services. However, when services in Cluster A need to communicate with services in Cluster B across a network boundary, they require an east-west gateway to bridge the two environments. Unlike the ingress gateway, the east-west gateway uses AUTO_PASSTHROUGH mode, which means it does not terminate the mTLS connection established by the client sidecar. Instead, it inspects the Server Name Indication (SNI) in the TLS handshake to identify the target service and routes the encrypted connection directly to the destination pod, preserving end-to-end encryption across clusters.

Question 5: A developer reports that cross-cluster calls from Cluster-1 to Cluster-2 fail with "upstream connect error or disconnect/reset before headers." What is your troubleshooting process to isolate the cause of this connection failure?

This specific error indicates a connection-level failure between the Envoy proxy in Cluster-1 and the target in Cluster-2. Your first troubleshooting step should be verifying east-west gateway reachability to ensure Cluster-1 pods can successfully connect to Cluster-2’s gateway IP on port 15443. If network paths are clear, you must verify the remote secrets by running istioctl proxy-config endpoints on the source pod to confirm it has correctly populated endpoints for the target service. If endpoints are present but connections still fail, you should compare the root CA fingerprints across both clusters to rule out an mTLS trust mismatch. Finally, enabling debug logging on the source proxy can reveal detailed TLS handshake errors that pinpoint whether the issue is related to certificate validation or network timeouts.

Question 6: Your infrastructure team wants to deploy a Multi-Primary Istio mesh stretching across an AWS EKS cluster and an on-premises Kubernetes cluster connected via a standard site-to-site VPN. What specific network and reliability challenges will this scenario introduce, and how should you adapt your Istio configuration to mitigate them?

Running a multi-cluster mesh across a standard VPN introduces significant latency and reliability challenges, as VPN tunnels over the public internet often experience variable ping times and occasional packet loss. This added network friction can cause Istio’s outlier detection to falsely identify healthy on-premises endpoints as failing during temporary latency spikes, triggering unnecessary cross-cluster failovers. To mitigate this, you must carefully tune your outlier detection thresholds in your DestinationRules, using longer evaluation intervals and higher error count limits for cross-cluster traffic. Furthermore, the on-premises cluster must provide a stable, routable IP address for its east-west gateway that remains accessible through the VPN tunnel, which often requires complex NAT configuration. For true production reliability, migrating from a VPN to a dedicated private link like AWS Direct Connect is highly recommended.

Hands-On Exercise: Multi-Cluster Service Discovery with Simulated Mesh

In this exercise, you will create two kind clusters, establish cross-cluster service discovery, and demonstrate locality-aware routing with failover.

What you will build:

flowchart LR
    subgraph C1["cluster1 (kind)"]
        direction TB
        F["frontend"]
        B1["backend (local)<br>Priority: local"]
        F --> B1
    end

    subgraph C2["cluster2 (kind)"]
        B2["backend (remote)<br>Failover target"]
    end

    C1 <-->|"Docker network"| C2

Task 1: Create Two Clusters

Solution

# Create two clusters
kind create cluster --name mesh-cluster1
kind create cluster --name mesh-cluster2

# Connect via Docker network for cross-cluster communication
docker network create mesh-net 2>/dev/null || true
docker network connect mesh-net mesh-cluster1-control-plane
docker network connect mesh-net mesh-cluster2-control-plane

echo "=== Cluster 1 ==="
kubectl --context kind-mesh-cluster1 get nodes
echo "=== Cluster 2 ==="
kubectl --context kind-mesh-cluster2 get nodes

Task 2: Deploy Services Across Both Clusters

Solution

# Deploy backend service on BOTH clusters (simulating multi-region)
for CTX in kind-mesh-cluster1 kind-mesh-cluster2; do
  CLUSTER=$(echo $CTX | sed 's/kind-mesh-//')
  kubectl --context $CTX create namespace production

  cat <<EOF | kubectl --context $CTX apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: backend
  namespace: production
  labels:
    app: backend
spec:
  replicas: 2
  selector:
    matchLabels:
      app: backend
  template:
    metadata:
      labels:
        app: backend
        cluster: $CLUSTER
    spec:
      containers:
        - name: backend
          image: nginx:1.27.3
          ports:
            - containerPort: 80
          resources:
            limits:
              cpu: 100m
              memory: 128Mi
          # Custom response to identify which cluster served the request
          command: ["/bin/sh", "-c"]
          args:
            - |
              echo "server { listen 80; location / { return 200 'Response from $CLUSTER\n'; } }" > /etc/nginx/conf.d/default.conf
              nginx -g 'daemon off;'
---
apiVersion: v1
kind: Service
metadata:
  name: backend
  namespace: production
spec:
  selector:
    app: backend
  ports:
    - port: 80
      targetPort: 80
EOF

  echo "Backend deployed on $CLUSTER"
done

# Deploy frontend ONLY on cluster1
cat <<'EOF' | kubectl --context kind-mesh-cluster1 apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: frontend
  namespace: production
spec:
  replicas: 1
  selector:
    matchLabels:
      app: frontend
  template:
    metadata:
      labels:
        app: frontend
    spec:
      containers:
        - name: frontend
          image: curlimages/curl:8.11.1
          command: ["sleep", "infinity"]
          resources:
            limits:
              cpu: 100m
              memory: 128Mi
EOF

kubectl --context kind-mesh-cluster1 wait --for=condition=ready \
  pod -l app=frontend -n production --timeout=60s

Task 3: Test Local Service Communication

Solution

# From frontend on cluster1, call the local backend
echo "=== Testing local service call (cluster1 → cluster1 backend) ==="
kubectl --context kind-mesh-cluster1 exec -n production deploy/frontend -- \
  curl -s backend.production.svc.cluster.local

# Verify that only cluster1 backend responds (no mesh yet)
for i in 1 2 3 4 5; do
  RESPONSE=$(kubectl --context kind-mesh-cluster1 exec -n production deploy/frontend -- \
    curl -s backend.production.svc.cluster.local)
  echo "  Request $i: $RESPONSE"
done

Task 4: Simulate Failover Behavior

Solution

# Simulate "local backend failure" by scaling to 0
echo "=== Simulating local backend failure on cluster1 ==="
kubectl --context kind-mesh-cluster1 scale deployment backend \
  -n production --replicas=0

# Wait for pods to terminate
kubectl --context kind-mesh-cluster1 wait --for=delete \
  pod -l app=backend -n production --timeout=30s 2>/dev/null || true

# Verify backend is gone on cluster1
echo "Cluster1 backend pods: $(kubectl --context kind-mesh-cluster1 get pods -n production -l app=backend --no-headers 2>/dev/null | wc -l | tr -d ' ')"
echo "Cluster2 backend pods: $(kubectl --context kind-mesh-cluster2 get pods -n production -l app=backend --no-headers | wc -l | tr -d ' ')"

# In a real mesh, Istio would route to cluster2's backend
# For our simulation, let's demonstrate the concept
echo ""
echo "=== In a production Istio multi-cluster mesh: ==="
echo "  1. Frontend's Envoy proxy detects all cluster1 backend endpoints are gone"
echo "  2. Locality failover kicks in (configured via DestinationRule)"
echo "  3. Traffic automatically routes to cluster2's backend"
echo "  4. Frontend sees no errors -- just slightly higher latency"
echo "  5. When cluster1 backend recovers, traffic shifts back"

# Recover
echo ""
echo "=== Recovering cluster1 backend ==="
kubectl --context kind-mesh-cluster1 scale deployment backend \
  -n production --replicas=2
kubectl --context kind-mesh-cluster1 wait --for=condition=ready \
  pod -l app=backend -n production --timeout=60s

# Verify recovery
echo "Backend pods restored:"
kubectl --context kind-mesh-cluster1 get pods -n production -l app=backend

Task 5: Build a Multi-Cluster Service Map

Solution

cat <<'SCRIPT' > /tmp/mesh-service-map.sh
#!/bin/bash
echo "============================================="
echo "  MULTI-CLUSTER SERVICE MAP"
echo "  $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "============================================="

for CTX in kind-mesh-cluster1 kind-mesh-cluster2; do
  CLUSTER=$(echo $CTX | sed 's/kind-mesh-//')
  echo ""
  echo "--- Cluster: $CLUSTER ---"

  for NS in $(kubectl --context $CTX get namespaces -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | grep -v '^kube-' | grep -v '^default$' | grep -v '^local-path-storage$'); do
    SVCS=$(kubectl --context $CTX get services -n $NS --no-headers 2>/dev/null | wc -l | tr -d ' ')
    if [ "$SVCS" -gt 0 ]; then
      echo "  Namespace: $NS"
      kubectl --context $CTX get services -n $NS --no-headers 2>/dev/null | while read SVC_LINE; do
        SVC_NAME=$(echo $SVC_LINE | awk '{print $1}')
        SVC_TYPE=$(echo $SVC_LINE | awk '{print $2}')
        SVC_PORT=$(echo $SVC_LINE | awk '{print $5}')

        # Count endpoints
        ENDPOINTS=$(kubectl --context $CTX get endpoints $SVC_NAME -n $NS -o jsonpath='{.subsets[*].addresses[*].ip}' 2>/dev/null | wc -w | tr -d ' ')
        echo "    Service: $SVC_NAME (type=$SVC_TYPE, ports=$SVC_PORT, endpoints=$ENDPOINTS)"
      done
    fi
  done
done

echo ""
echo "============================================="
echo "  CROSS-CLUSTER SERVICE OVERLAP"
echo "============================================="
echo "  (Services that exist in multiple clusters)"

# Find services that exist in both clusters
C1_SVCS=$(kubectl --context kind-mesh-cluster1 get services -n production -o jsonpath='{.items[*].metadata.name}' 2>/dev/null)
C2_SVCS=$(kubectl --context kind-mesh-cluster2 get services -n production -o jsonpath='{.items[*].metadata.name}' 2>/dev/null)

for SVC in $C1_SVCS; do
  if echo "$C2_SVCS" | grep -qw "$SVC"; then
    C1_EP=$(kubectl --context kind-mesh-cluster1 get endpoints $SVC -n production -o jsonpath='{.subsets[*].addresses[*].ip}' 2>/dev/null | wc -w | tr -d ' ')
    C2_EP=$(kubectl --context kind-mesh-cluster2 get endpoints $SVC -n production -o jsonpath='{.subsets[*].addresses[*].ip}' 2>/dev/null | wc -w | tr -d ' ')
    echo "  $SVC: cluster1=$C1_EP endpoints, cluster2=$C2_EP endpoints"
  fi
done
SCRIPT

chmod +x /tmp/mesh-service-map.sh
bash /tmp/mesh-service-map.sh

Clean Up

kind delete cluster --name mesh-cluster1
kind delete cluster --name mesh-cluster2
docker network rm mesh-net 2>/dev/null || true
rm /tmp/mesh-service-map.sh /tmp/istio-cluster1.yaml /tmp/istio-cluster2.yaml 2>/dev/null

Success Criteria

I created two kind clusters simulating a multi-cloud mesh environment
I deployed the same service (backend) across both clusters
I verified local service communication works
I simulated a failover scenario by scaling down the local backend
I built a multi-cluster service map showing service overlap
I can explain the difference between Primary-Remote and Multi-Primary Istio
I can describe how a shared root CA enables cross-cluster mTLS

Next Module

With services connected across clusters, it is time to manage the deployment lifecycle at enterprise scale. Head to Module 10.8: Enterprise GitOps & Platform Engineering to learn how Backstage, ArgoCD ApplicationSets, and multi-tenant repository strategies enable self-service platform engineering for large organizations.