Module 8.2: Hybrid Cloud Connectivity
Complexity:
[ADVANCED]| Time: 60 minutesPrerequisites: Datacenter Networking, Module 8.1: Multi-Site & Disaster Recovery
Why This Module Matters
Section titled “Why This Module Matters”A global retail company ran customer-facing applications on AWS EKS but kept inventory management on-premises due to latency requirements — warehouse scanners needed sub-5ms response times. For two years, their cloud and on-prem clusters operated as isolated islands with separate CI/CD pipelines, monitoring, service discovery, and network policies.
When a product launch required real-time inventory checks from the cloud storefront to the on-prem API, the team patched together public internet endpoints and manual firewall rules. It took six weeks and was fragile — response times varied 40-400ms. During Black Friday, a BGP route leak by an upstream ISP made the endpoint unreachable for 47 minutes. The storefront showed “out of stock” for items sitting in warehouses.
The company then invested three months in proper hybrid connectivity: a dedicated interconnect, WireGuard tunnels, Submariner for cross-cluster service discovery, and Istio for unified traffic management. The next Black Friday ran without incident. Inventory API latency was a consistent 8ms.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Implement hybrid connectivity between on-premises and cloud Kubernetes clusters using dedicated interconnects and encrypted tunnels
- Configure Submariner or Cilium ClusterMesh for cross-cluster service discovery and pod-to-pod communication
- Design network architectures that provide consistent latency between on-premises and cloud workloads with proper failover
- Troubleshoot hybrid connectivity issues including BGP route leaks, tunnel MTU problems, and cross-cluster DNS resolution failures
What You’ll Learn
Section titled “What You’ll Learn”- VPN tunnel options for on-prem to cloud (WireGuard and IPsec)
- Dedicated interconnect services (Direct Connect, ExpressRoute, Cloud Interconnect)
- Submariner for multi-cluster Kubernetes networking
- Istio service mesh spanning cloud and on-prem clusters
- Consistent policy enforcement with OPA/Gatekeeper across environments
VPN Tunnels: WireGuard and IPsec
Section titled “VPN Tunnels: WireGuard and IPsec” On-Prem DC Cloud VPC ┌──────────────────┐ ┌──────────────────┐ │ Pod CIDR: │ │ Pod CIDR: │ │ 10.244.0.0/16 │ Encrypted │ 10.100.0.0/16 │ │ │ Tunnel │ │ │ ┌──────────────┐ │◄───────────────►│ ┌──────────────┐ │ │ │ WireGuard GW │ │ │ │ WireGuard GW │ │ │ │ 203.0.113.10 │ │ │ │ 198.51.100.5 │ │ │ └──────────────┘ │ │ └──────────────┘ │ └──────────────────┘ └──────────────────┘| Factor | WireGuard | IPsec (IKEv2) |
|---|---|---|
| Code complexity | ~4,000 lines | ~400,000 lines |
| Performance | 1-3 Gbps per core | 0.5-1.5 Gbps per core |
| Latency overhead | ~0.5ms | ~1-2ms |
| Configuration | Simple (key pair, endpoint, allowed IPs) | Complex (certs, proposals, policies) |
| Cloud native support | Manual setup | Native (AWS/Azure VPN Gateway) |
| Key rotation | Built-in (every 2 minutes) | Manual or via IKE rekey |
Pause and predict: WireGuard uses ~4,000 lines of code while IPsec uses ~400,000. Both encrypt traffic. Why would the smaller codebase matter for a security-critical component like a VPN tunnel?
WireGuard Configuration
Section titled “WireGuard Configuration”This configuration creates an encrypted tunnel between the on-premises gateway and a cloud-side gateway. The AllowedIPs field acts as both an access control list and a routing table — only traffic destined for the specified CIDRs enters the tunnel.
# On the on-prem gateway nodeapt-get install -y wireguardwg genkey | tee /etc/wireguard/private.key | wg pubkey > /etc/wireguard/public.key
cat > /etc/wireguard/wg0.conf <<EOF[Interface]Address = 10.200.0.1/24ListenPort = 51820PrivateKey = $(cat /etc/wireguard/private.key)PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; iptables -A FORWARD -o wg0 -j ACCEPTPostDown = iptables -D FORWARD -i wg0 -j ACCEPT; iptables -D FORWARD -o wg0 -j ACCEPT
[Peer]PublicKey = <CLOUD_GATEWAY_PUBLIC_KEY>Endpoint = 198.51.100.5:51820AllowedIPs = 10.100.0.0/16, 172.20.0.0/16PersistentKeepalive = 25EOF
systemctl enable --now wg-quick@wg0
# Add routes for cross-cluster communicationip route add 10.100.0.0/16 via 10.200.0.1 dev wg0ip route add 172.20.0.0/16 via 10.200.0.1 dev wg0Dedicated Interconnects
Section titled “Dedicated Interconnects”VPN tunnels run over the public internet. Dedicated interconnects provide private, low-latency connections.
On-Prem DC Colocation Meet-Me Room Cloud Provider ┌────────┐ ┌──────────────┐ ┌──────────┐ │ Router │──────►│ Cross Connect│───────────►│ Cloud │ └────────┘ Dark └──────────────┘ Private │ Router │ Fiber Peering └──────────┘ 1-100 Gbps| Feature | AWS Direct Connect | Azure ExpressRoute | GCP Cloud Interconnect |
|---|---|---|---|
| Bandwidth | 1, 10, 100 Gbps | 50 Mbps - 100 Gbps | 10, 100 Gbps |
| Latency | <5ms typical | <5ms typical | <5ms typical |
| Setup time | 2-4 weeks | 2-4 weeks | 1-3 weeks |
| Monthly cost (10G) | ~$2,200/port | ~$5,000/port | ~$1,700/port |
Use interconnect when: >1 Gbps sustained traffic, <5ms latency required, or compliance demands a private path. Use VPN for <100 Mbps, non-critical, or DR-only traffic.
Submariner: Multi-Cluster Networking
Section titled “Submariner: Multi-Cluster Networking”Submariner connects Kubernetes clusters so pods and services in one cluster can reach those in another, handling cross-cluster DNS, encrypted tunnels, and service discovery.
Cluster A (On-Prem) Cluster B (Cloud) ┌──────────────────────┐ ┌──────────────────────┐ │ Gateway Engine │ IPsec / │ Gateway Engine │ │ (tunnel endpoint) ◄┼──tunnel────►┼─(tunnel endpoint) │ │ │ │ │ │ Lighthouse (DNS) ◄┼──svc sync──►┼─Lighthouse (DNS) │ │ │ │ │ │ Pod: curl nginx.ns. │ │ Pod: nginx (svc) │ │ svc.clusterset.local │ │ │ └──────────────────────┘ └──────────────────────┘Stop and think: Submariner requires non-overlapping pod and service CIDRs between clusters. Both your on-prem and EKS clusters use the default 10.244.0.0/16 pod CIDR. What are your options, and which one avoids rebuilding either cluster?
Install Submariner
Section titled “Install Submariner”Submariner uses a broker (deployed on one cluster) for service discovery metadata exchange. Each cluster then joins the broker, establishing encrypted tunnels for pod-to-pod traffic and a Lighthouse DNS service for cross-cluster name resolution.
# Install subctlcurl -Ls https://get.submariner.io | VERSION=v0.18.0 bash
# Deploy broker and join clusterskubectl config use-context on-prem-clustersubctl deploy-brokersubctl join broker-info.subm --clusterid on-prem --natt=false --cable-driver libreswan
kubectl config use-context cloud-clustersubctl join broker-info.subm --clusterid cloud --natt=true --cable-driver libreswan
# Export a service for cross-cluster accesssubctl export service nginx-service -n production
# From the other cluster, reach it via:# nginx-service.production.svc.clusterset.localsubctl show allRequirements: non-overlapping pod/service CIDRs, gateway nodes with routable IPs, UDP ports 500 and 4500 open, supported CNIs (Calico, Flannel, Canal, OVN-Kubernetes).
Unified Service Mesh with Istio
Section titled “Unified Service Mesh with Istio”Istio adds traffic management, observability, and mTLS security across clusters.
On-Prem (Primary) Cloud (Remote) ┌────────────────────┐ ┌────────────────────┐ │ istiod │──config────►│ istiod (remote) │ │ East-West GW ◄┼──mTLS─────►┼─East-West GW │ │ ┌────┐ ┌────┐ │ │ ┌────┐ ┌────┐ │ │ │A v1│ │ B │ │ │ │A v2│ │ C │ │ │ └────┘ └────┘ │ │ └────┘ └────┘ │ └────────────────────┘ └────────────────────┘ svc-A traffic: 80% on-prem (v1), 20% cloud (v2)A shared root CA is required for cross-cluster mTLS. Without it, sidecars in different clusters cannot verify each other’s certificates and all cross-cluster traffic fails with 503 errors even though network connectivity works.
Pause and predict: Istio uses mTLS between sidecars in different clusters. Why does each cluster need a certificate derived from the same root CA? What symptom would you see if the root CAs were different?
Setting Up Multi-Cluster Istio
Section titled “Setting Up Multi-Cluster Istio”The shared root CA is the foundation of cross-cluster mTLS. Each cluster gets its own intermediate CA (derived from the shared root), so certificates can be validated across cluster boundaries.
# 1. Generate a shared root CAmkdir -p certsopenssl req -new -x509 -nodes -days 3650 \ -keyout certs/root-key.pem -out certs/root-cert.pem \ -subj "/O=KubeDojo/CN=Root CA"
# 2. Create per-cluster intermediate CAs from the shared rootfor CLUSTER in on-prem cloud; do openssl genrsa -out certs/${CLUSTER}-ca-key.pem 4096 openssl req -new -key certs/${CLUSTER}-ca-key.pem \ -out certs/${CLUSTER}-ca-csr.pem -subj "/O=KubeDojo/CN=${CLUSTER} CA" openssl x509 -req -days 3650 -CA certs/root-cert.pem -CAkey certs/root-key.pem \ -set_serial "0x$(openssl rand -hex 8)" \ -in certs/${CLUSTER}-ca-csr.pem -out certs/${CLUSTER}-ca-cert.pemdone
# 3. Install Istio on the primary cluster with the shared CAkubectl create namespace istio-systemkubectl create secret generic cacerts -n istio-system \ --from-file=ca-cert.pem=certs/on-prem-ca-cert.pem \ --from-file=ca-key.pem=certs/on-prem-ca-key.pem \ --from-file=root-cert.pem=certs/root-cert.pem \ --from-file=cert-chain.pem=certs/on-prem-ca-cert.pem
istioctl install -y -f - <<EOFapiVersion: install.istio.io/v1alpha1kind: IstioOperatorspec: values: global: meshID: kubedojo-mesh multiCluster: clusterName: on-prem network: on-prem-networkEOFCross-Cluster Traffic Routing
Section titled “Cross-Cluster Traffic Routing”# VirtualService for weighted routing between on-prem and cloudapiVersion: networking.istio.io/v1kind: VirtualServicemetadata: name: svc-a namespace: productionspec: hosts: - svc-a.production.svc.cluster.local http: - route: - destination: host: svc-a.production.svc.cluster.local subset: on-prem weight: 80 - destination: host: svc-a.production.svc.cluster.local subset: cloud weight: 20Consistent Policy with OPA/Gatekeeper
Section titled “Consistent Policy with OPA/Gatekeeper”When workloads span environments, policy drift is inevitable without enforcement.
Git Repository (single source of truth) ├── no-privileged.yaml ├── allowed-registries.yaml └── require-resource-limits.yaml │ ArgoCD syncs to both ┌─────┴─────┐ ▼ ▼ On-Prem Cloud (identical) (identical)# ConstraintTemplate: enforce allowed image registriesapiVersion: templates.gatekeeper.sh/v1kind: ConstraintTemplatemetadata: name: k8sallowedregistriesspec: crd: spec: names: kind: K8sAllowedRegistries validation: openAPIV3Schema: type: object properties: registries: type: array items: type: string targets: - target: admission.k8s.gatekeeper.sh rego: | package k8sallowedregistries violation[{"msg": msg}] { container := input.review.object.spec.containers[_] not startswith(container.image, input.parameters.registries[_]) msg := sprintf("Container '%v' uses image '%v' from unauthorized registry", [container.name, container.image]) }---apiVersion: constraints.gatekeeper.sh/v1beta1kind: K8sAllowedRegistriesmetadata: name: allowed-registriesspec: enforcementAction: deny match: kinds: - apiGroups: [""] kinds: ["Pod"] parameters: registries: - "registry.internal.example.com/" - "gcr.io/distroless/" - "registry.k8s.io/"Sync policies to all clusters via ArgoCD Applications pointing to the same Git repository.
Did You Know?
Section titled “Did You Know?”-
WireGuard is in the Linux kernel since 5.6 (March 2020). Linus Torvalds called it a “work of art” compared to IPsec. At ~4,000 lines of code versus IPsec’s ~400,000, its attack surface is dramatically smaller.
-
AWS Direct Connect locations are not AWS datacenters. They are colocation facilities (Equinix, CoreSite). Your router connects to an AWS router via a physical fiber patch cable in a shared “meet-me room.”
-
Submariner’s name references submarine cables connecting continents. Created by Rancher Labs (now SUSE), it is a CNCF Sandbox project supporting both IPsec and WireGuard as cable drivers.
-
Istio’s locality-aware load balancing prefers local endpoints over remote ones automatically, reducing cross-cluster traffic by 60-80% in typical deployments.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | What To Do Instead |
|---|---|---|
| Overlapping pod CIDRs | Default CNIs use 10.244.0.0/16 | Plan unique CIDRs per cluster before deployment |
| Single VPN gateway | ”We’ll add HA later” | Deploy gateways in active-passive pairs from day one |
| Ignoring MTU in tunnels | Encapsulation adds 50-70 bytes | Set MTU to 1400 on tunnel interfaces |
| No encryption between clusters | ”Private network” | Always encrypt; even private networks can be compromised |
| No shared root CA for Istio | Each cluster auto-generates its own | Create shared root CA before installing Istio |
| Manual per-cluster policies | ”Only two clusters” | Use GitOps; drift begins with the first manual change |
Question 1
Section titled “Question 1”Your on-premises Kubernetes cluster uses pod CIDR 10.244.0.0/16. Your EKS cluster also uses the default 10.244.0.0/16. You connect them via WireGuard and developers report that cross-cluster service calls randomly fail. What is happening and how do you fix it?
Answer
The CIDR overlap causes routing ambiguity. When a pod on the on-prem cluster sends traffic to 10.244.50.3 (intending to reach a pod on the EKS cluster), the local routing table matches it to the local pod CIDR and routes it locally — it never enters the WireGuard tunnel. The same happens in reverse. Cross-cluster traffic is essentially impossible because both clusters claim ownership of the same IP range.
Fix options (in order of preference):
-
Rebuild one cluster with a different CIDR (e.g., 10.100.0.0/16 for EKS). This is the cleanest solution but requires recreating the cluster and migrating workloads. For EKS, this means creating a new cluster with
--kubernetes-network-config serviceIpv4Cidrand a custom VPC CNI configuration. -
Use Submariner with Globalnet, which assigns virtual global IPs from a non-overlapping range (e.g., 242.0.0.0/8). Submariner handles the NAT transparently, and cross-cluster DNS resolves to global IPs. This avoids rebuilding either cluster but adds complexity.
-
NAT at the gateway (fragile, last resort). Configure SNAT/DNAT rules on the WireGuard gateways to translate pod IPs. This breaks source IP visibility, complicates network policy enforcement, and is operationally painful to maintain.
Prevention: Always plan unique pod and service CIDRs across all clusters before deployment. Document them in a central IPAM registry.
Question 2
Section titled “Question 2”Your on-premises to cloud VPN tunnel has 50ms RTT and 200 Mbps bandwidth. The database team wants to set up PostgreSQL streaming replication from the on-premises primary to a cloud replica for disaster recovery. What concerns should you raise, and what would you recommend instead?
Answer
Three critical concerns:
-
Bandwidth saturation: A write-heavy PostgreSQL database generating 50-100 MB/s of WAL (Write-Ahead Log) data would consume 400-800 Mbps — far exceeding the 200 Mbps tunnel capacity. Replication lag would grow unbounded until the tunnel is upgraded or write volume decreases. This means the DR replica is perpetually behind, defeating the purpose.
-
Latency impact on synchronous replication: Synchronous replication adds the full 50ms RTT to every transaction commit. For a workload doing 1,000 transactions/second, this adds 50 seconds of cumulative latency per second — transactions would queue up, causing application timeouts. Synchronous replication at 50ms RTT is impractical for any write-intensive workload.
-
VPN reliability: VPN tunnels over the public internet have variable latency (50ms average but 200ms+ during congestion). Reconnections after tunnel drops cause replication lag spikes and potentially require WAL replay to catch up.
Recommendations: Upgrade to a Direct Connect or ExpressRoute (1-10 Gbps, <5ms latency) if synchronous replication is needed. If budget does not allow a dedicated interconnect, use asynchronous replication (accepting RPO of seconds to minutes) or consider logical replication (lower bandwidth, replicates only specific tables).
Question 3
Section titled “Question 3”Submariner is deployed between your on-premises and cloud clusters. A developer runs curl nginx.production.svc.clusterset.local from a pod on the on-premises cluster and gets a DNS resolution error. The nginx service is running fine on the cloud cluster. Walk through your debugging process.
Answer
Systematic debugging from network layer up to DNS:
-
Check Submariner components are Running:
kubectl get pods -n submariner-operator. If the gateway engine or Lighthouse pods are in CrashLoopBackOff, the tunnel or DNS integration is broken. -
Verify ServiceExport and ServiceImport: On the cloud cluster, check
kubectl get serviceexport nginx -n production. On the on-premises cluster, checkkubectl get serviceimport -n submariner-operator. If the ServiceImport does not exist, Submariner has not synced the service metadata across clusters. -
Check Lighthouse DNS integration: Verify the CoreDNS configmap includes the Lighthouse plugin:
kubectl get cm coredns -n kube-system -o yaml | grep lighthouse. If missing, Lighthouse did not inject itself into CoreDNS configuration. -
Check tunnel connectivity: Run
subctl show connections— the status should show “connected” for the remote cluster. If “connecting” or “error,” check firewall rules for UDP ports 500 and 4500 (IPsec) or the WireGuard port. -
Test DNS directly:
kubectl exec -it test-pod -- nslookup nginx.production.svc.clusterset.local. If this returns NXDOMAIN, the issue is DNS. If it resolves but curl fails, the issue is network connectivity through the tunnel. -
Check for CIDR overlap: If Globalnet is not enabled and pod CIDRs overlap, traffic cannot be routed correctly even if the tunnel is up.
Question 4
Section titled “Question 4”Why is a shared root CA necessary for Istio multi-cluster?
Answer
Istio uses mTLS between all sidecars. Cross-cluster, sidecar A presents a cert signed by Cluster A’s CA. Sidecar B must verify that cert. Without a shared root CA, Cluster B does not trust Cluster A’s CA, so the TLS handshake fails. Symptoms: ping works but Istio services return 503. Fix: generate one root CA, derive per-cluster intermediate CAs, distribute root-cert.pem to all clusters before installing Istio.
Hands-On Exercise: Cross-Cluster Service Discovery
Section titled “Hands-On Exercise: Cross-Cluster Service Discovery”Objective: Connect two kind clusters with Submariner and access a service across clusters.
# 1. Create clusters with unique CIDRscat <<EOF | kind create cluster --name cluster-a --config -kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4networking: podSubnet: "10.10.0.0/16" serviceSubnet: "10.110.0.0/16"nodes:- role: control-plane- role: workerEOF
cat <<EOF | kind create cluster --name cluster-b --config -kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4networking: podSubnet: "10.20.0.0/16" serviceSubnet: "10.120.0.0/16"nodes:- role: control-plane- role: workerEOF
# 2. Deploy Submarinercurl -Ls https://get.submariner.io | VERSION=v0.18.0 bashkubectl config use-context kind-cluster-asubctl deploy-brokersubctl join broker-info.subm --clusterid cluster-a --natt=falsekubectl config use-context kind-cluster-bsubctl join broker-info.subm --clusterid cluster-b --natt=false
# 3. Deploy and export a service on cluster-bkubectl create namespace webkubectl create deployment nginx --image=nginx:1.27 -n web --replicas=2kubectl expose deployment nginx -n web --port=80subctl export service nginx -n web
# 4. Test from cluster-akubectl config use-context kind-cluster-akubectl run test --rm -it --image=curlimages/curl --restart=Never -- \ curl -s http://nginx.web.svc.clusterset.localSuccess Criteria
Section titled “Success Criteria”- Two kind clusters with non-overlapping CIDRs
- Submariner broker deployed and both clusters joined
- nginx service exported from cluster-b
- curl from cluster-a reaches nginx on cluster-b
-
subctl show connectionsshows “connected”
Next Module
Section titled “Next Module”Continue to Module 8.3: Cloud Repatriation & Migration to learn how to move workloads from cloud to on-premises, translating cloud services to their on-prem equivalents.