Skip to content

Module 8.2: Hybrid Cloud Connectivity

Complexity: [ADVANCED] | Time: 60 minutes

Prerequisites: Datacenter Networking, Module 8.1: Multi-Site & Disaster Recovery


A global retail company ran customer-facing applications on AWS EKS but kept inventory management on-premises due to latency requirements — warehouse scanners needed sub-5ms response times. For two years, their cloud and on-prem clusters operated as isolated islands with separate CI/CD pipelines, monitoring, service discovery, and network policies.

When a product launch required real-time inventory checks from the cloud storefront to the on-prem API, the team patched together public internet endpoints and manual firewall rules. It took six weeks and was fragile — response times varied 40-400ms. During Black Friday, a BGP route leak by an upstream ISP made the endpoint unreachable for 47 minutes. The storefront showed “out of stock” for items sitting in warehouses.

The company then invested three months in proper hybrid connectivity: a dedicated interconnect, WireGuard tunnels, Submariner for cross-cluster service discovery, and Istio for unified traffic management. The next Black Friday ran without incident. Inventory API latency was a consistent 8ms.


After completing this module, you will be able to:

  1. Implement hybrid connectivity between on-premises and cloud Kubernetes clusters using dedicated interconnects and encrypted tunnels
  2. Configure Submariner or Cilium ClusterMesh for cross-cluster service discovery and pod-to-pod communication
  3. Design network architectures that provide consistent latency between on-premises and cloud workloads with proper failover
  4. Troubleshoot hybrid connectivity issues including BGP route leaks, tunnel MTU problems, and cross-cluster DNS resolution failures

  • VPN tunnel options for on-prem to cloud (WireGuard and IPsec)
  • Dedicated interconnect services (Direct Connect, ExpressRoute, Cloud Interconnect)
  • Submariner for multi-cluster Kubernetes networking
  • Istio service mesh spanning cloud and on-prem clusters
  • Consistent policy enforcement with OPA/Gatekeeper across environments

On-Prem DC Cloud VPC
┌──────────────────┐ ┌──────────────────┐
│ Pod CIDR: │ │ Pod CIDR: │
│ 10.244.0.0/16 │ Encrypted │ 10.100.0.0/16 │
│ │ Tunnel │ │
│ ┌──────────────┐ │◄───────────────►│ ┌──────────────┐ │
│ │ WireGuard GW │ │ │ │ WireGuard GW │ │
│ │ 203.0.113.10 │ │ │ │ 198.51.100.5 │ │
│ └──────────────┘ │ │ └──────────────┘ │
└──────────────────┘ └──────────────────┘
FactorWireGuardIPsec (IKEv2)
Code complexity~4,000 lines~400,000 lines
Performance1-3 Gbps per core0.5-1.5 Gbps per core
Latency overhead~0.5ms~1-2ms
ConfigurationSimple (key pair, endpoint, allowed IPs)Complex (certs, proposals, policies)
Cloud native supportManual setupNative (AWS/Azure VPN Gateway)
Key rotationBuilt-in (every 2 minutes)Manual or via IKE rekey

Pause and predict: WireGuard uses ~4,000 lines of code while IPsec uses ~400,000. Both encrypt traffic. Why would the smaller codebase matter for a security-critical component like a VPN tunnel?

This configuration creates an encrypted tunnel between the on-premises gateway and a cloud-side gateway. The AllowedIPs field acts as both an access control list and a routing table — only traffic destined for the specified CIDRs enters the tunnel.

Terminal window
# On the on-prem gateway node
apt-get install -y wireguard
wg genkey | tee /etc/wireguard/private.key | wg pubkey > /etc/wireguard/public.key
cat > /etc/wireguard/wg0.conf <<EOF
[Interface]
Address = 10.200.0.1/24
ListenPort = 51820
PrivateKey = $(cat /etc/wireguard/private.key)
PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; iptables -A FORWARD -o wg0 -j ACCEPT
PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; iptables -D FORWARD -o wg0 -j ACCEPT
[Peer]
PublicKey = <CLOUD_GATEWAY_PUBLIC_KEY>
Endpoint = 198.51.100.5:51820
AllowedIPs = 10.100.0.0/16, 172.20.0.0/16
PersistentKeepalive = 25
EOF
systemctl enable --now wg-quick@wg0
# Add routes for cross-cluster communication
ip route add 10.100.0.0/16 via 10.200.0.1 dev wg0
ip route add 172.20.0.0/16 via 10.200.0.1 dev wg0

VPN tunnels run over the public internet. Dedicated interconnects provide private, low-latency connections.

On-Prem DC Colocation Meet-Me Room Cloud Provider
┌────────┐ ┌──────────────┐ ┌──────────┐
│ Router │──────►│ Cross Connect│───────────►│ Cloud │
└────────┘ Dark └──────────────┘ Private │ Router │
Fiber Peering └──────────┘
1-100 Gbps
FeatureAWS Direct ConnectAzure ExpressRouteGCP Cloud Interconnect
Bandwidth1, 10, 100 Gbps50 Mbps - 100 Gbps10, 100 Gbps
Latency<5ms typical<5ms typical<5ms typical
Setup time2-4 weeks2-4 weeks1-3 weeks
Monthly cost (10G)~$2,200/port~$5,000/port~$1,700/port

Use interconnect when: >1 Gbps sustained traffic, <5ms latency required, or compliance demands a private path. Use VPN for <100 Mbps, non-critical, or DR-only traffic.


Submariner connects Kubernetes clusters so pods and services in one cluster can reach those in another, handling cross-cluster DNS, encrypted tunnels, and service discovery.

Cluster A (On-Prem) Cluster B (Cloud)
┌──────────────────────┐ ┌──────────────────────┐
│ Gateway Engine │ IPsec / │ Gateway Engine │
│ (tunnel endpoint) ◄┼──tunnel────►┼─(tunnel endpoint) │
│ │ │ │
│ Lighthouse (DNS) ◄┼──svc sync──►┼─Lighthouse (DNS) │
│ │ │ │
│ Pod: curl nginx.ns. │ │ Pod: nginx (svc) │
│ svc.clusterset.local │ │ │
└──────────────────────┘ └──────────────────────┘

Stop and think: Submariner requires non-overlapping pod and service CIDRs between clusters. Both your on-prem and EKS clusters use the default 10.244.0.0/16 pod CIDR. What are your options, and which one avoids rebuilding either cluster?

Submariner uses a broker (deployed on one cluster) for service discovery metadata exchange. Each cluster then joins the broker, establishing encrypted tunnels for pod-to-pod traffic and a Lighthouse DNS service for cross-cluster name resolution.

Terminal window
# Install subctl
curl -Ls https://get.submariner.io | VERSION=v0.18.0 bash
# Deploy broker and join clusters
kubectl config use-context on-prem-cluster
subctl deploy-broker
subctl join broker-info.subm --clusterid on-prem --natt=false --cable-driver libreswan
kubectl config use-context cloud-cluster
subctl join broker-info.subm --clusterid cloud --natt=true --cable-driver libreswan
# Export a service for cross-cluster access
subctl export service nginx-service -n production
# From the other cluster, reach it via:
# nginx-service.production.svc.clusterset.local
subctl show all

Requirements: non-overlapping pod/service CIDRs, gateway nodes with routable IPs, UDP ports 500 and 4500 open, supported CNIs (Calico, Flannel, Canal, OVN-Kubernetes).


Istio adds traffic management, observability, and mTLS security across clusters.

On-Prem (Primary) Cloud (Remote)
┌────────────────────┐ ┌────────────────────┐
│ istiod │──config────►│ istiod (remote) │
│ East-West GW ◄┼──mTLS─────►┼─East-West GW │
│ ┌────┐ ┌────┐ │ │ ┌────┐ ┌────┐ │
│ │A v1│ │ B │ │ │ │A v2│ │ C │ │
│ └────┘ └────┘ │ │ └────┘ └────┘ │
└────────────────────┘ └────────────────────┘
svc-A traffic: 80% on-prem (v1), 20% cloud (v2)

A shared root CA is required for cross-cluster mTLS. Without it, sidecars in different clusters cannot verify each other’s certificates and all cross-cluster traffic fails with 503 errors even though network connectivity works.

Pause and predict: Istio uses mTLS between sidecars in different clusters. Why does each cluster need a certificate derived from the same root CA? What symptom would you see if the root CAs were different?

The shared root CA is the foundation of cross-cluster mTLS. Each cluster gets its own intermediate CA (derived from the shared root), so certificates can be validated across cluster boundaries.

Terminal window
# 1. Generate a shared root CA
mkdir -p certs
openssl req -new -x509 -nodes -days 3650 \
-keyout certs/root-key.pem -out certs/root-cert.pem \
-subj "/O=KubeDojo/CN=Root CA"
# 2. Create per-cluster intermediate CAs from the shared root
for CLUSTER in on-prem cloud; do
openssl genrsa -out certs/${CLUSTER}-ca-key.pem 4096
openssl req -new -key certs/${CLUSTER}-ca-key.pem \
-out certs/${CLUSTER}-ca-csr.pem -subj "/O=KubeDojo/CN=${CLUSTER} CA"
openssl x509 -req -days 3650 -CA certs/root-cert.pem -CAkey certs/root-key.pem \
-set_serial "0x$(openssl rand -hex 8)" \
-in certs/${CLUSTER}-ca-csr.pem -out certs/${CLUSTER}-ca-cert.pem
done
# 3. Install Istio on the primary cluster with the shared CA
kubectl create namespace istio-system
kubectl create secret generic cacerts -n istio-system \
--from-file=ca-cert.pem=certs/on-prem-ca-cert.pem \
--from-file=ca-key.pem=certs/on-prem-ca-key.pem \
--from-file=root-cert.pem=certs/root-cert.pem \
--from-file=cert-chain.pem=certs/on-prem-ca-cert.pem
istioctl install -y -f - <<EOF
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
values:
global:
meshID: kubedojo-mesh
multiCluster:
clusterName: on-prem
network: on-prem-network
EOF
# VirtualService for weighted routing between on-prem and cloud
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: svc-a
namespace: production
spec:
hosts:
- svc-a.production.svc.cluster.local
http:
- route:
- destination:
host: svc-a.production.svc.cluster.local
subset: on-prem
weight: 80
- destination:
host: svc-a.production.svc.cluster.local
subset: cloud
weight: 20

When workloads span environments, policy drift is inevitable without enforcement.

Git Repository (single source of truth)
├── no-privileged.yaml
├── allowed-registries.yaml
└── require-resource-limits.yaml
ArgoCD syncs to both
┌─────┴─────┐
▼ ▼
On-Prem Cloud
(identical) (identical)
# ConstraintTemplate: enforce allowed image registries
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8sallowedregistries
spec:
crd:
spec:
names:
kind: K8sAllowedRegistries
validation:
openAPIV3Schema:
type: object
properties:
registries:
type: array
items:
type: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8sallowedregistries
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
not startswith(container.image, input.parameters.registries[_])
msg := sprintf("Container '%v' uses image '%v' from unauthorized registry",
[container.name, container.image])
}
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRegistries
metadata:
name: allowed-registries
spec:
enforcementAction: deny
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
parameters:
registries:
- "registry.internal.example.com/"
- "gcr.io/distroless/"
- "registry.k8s.io/"

Sync policies to all clusters via ArgoCD Applications pointing to the same Git repository.


  1. WireGuard is in the Linux kernel since 5.6 (March 2020). Linus Torvalds called it a “work of art” compared to IPsec. At ~4,000 lines of code versus IPsec’s ~400,000, its attack surface is dramatically smaller.

  2. AWS Direct Connect locations are not AWS datacenters. They are colocation facilities (Equinix, CoreSite). Your router connects to an AWS router via a physical fiber patch cable in a shared “meet-me room.”

  3. Submariner’s name references submarine cables connecting continents. Created by Rancher Labs (now SUSE), it is a CNCF Sandbox project supporting both IPsec and WireGuard as cable drivers.

  4. Istio’s locality-aware load balancing prefers local endpoints over remote ones automatically, reducing cross-cluster traffic by 60-80% in typical deployments.


MistakeWhy It HappensWhat To Do Instead
Overlapping pod CIDRsDefault CNIs use 10.244.0.0/16Plan unique CIDRs per cluster before deployment
Single VPN gateway”We’ll add HA later”Deploy gateways in active-passive pairs from day one
Ignoring MTU in tunnelsEncapsulation adds 50-70 bytesSet MTU to 1400 on tunnel interfaces
No encryption between clusters”Private network”Always encrypt; even private networks can be compromised
No shared root CA for IstioEach cluster auto-generates its ownCreate shared root CA before installing Istio
Manual per-cluster policies”Only two clusters”Use GitOps; drift begins with the first manual change

Your on-premises Kubernetes cluster uses pod CIDR 10.244.0.0/16. Your EKS cluster also uses the default 10.244.0.0/16. You connect them via WireGuard and developers report that cross-cluster service calls randomly fail. What is happening and how do you fix it?

Answer

The CIDR overlap causes routing ambiguity. When a pod on the on-prem cluster sends traffic to 10.244.50.3 (intending to reach a pod on the EKS cluster), the local routing table matches it to the local pod CIDR and routes it locally — it never enters the WireGuard tunnel. The same happens in reverse. Cross-cluster traffic is essentially impossible because both clusters claim ownership of the same IP range.

Fix options (in order of preference):

  1. Rebuild one cluster with a different CIDR (e.g., 10.100.0.0/16 for EKS). This is the cleanest solution but requires recreating the cluster and migrating workloads. For EKS, this means creating a new cluster with --kubernetes-network-config serviceIpv4Cidr and a custom VPC CNI configuration.

  2. Use Submariner with Globalnet, which assigns virtual global IPs from a non-overlapping range (e.g., 242.0.0.0/8). Submariner handles the NAT transparently, and cross-cluster DNS resolves to global IPs. This avoids rebuilding either cluster but adds complexity.

  3. NAT at the gateway (fragile, last resort). Configure SNAT/DNAT rules on the WireGuard gateways to translate pod IPs. This breaks source IP visibility, complicates network policy enforcement, and is operationally painful to maintain.

Prevention: Always plan unique pod and service CIDRs across all clusters before deployment. Document them in a central IPAM registry.

Your on-premises to cloud VPN tunnel has 50ms RTT and 200 Mbps bandwidth. The database team wants to set up PostgreSQL streaming replication from the on-premises primary to a cloud replica for disaster recovery. What concerns should you raise, and what would you recommend instead?

Answer

Three critical concerns:

  1. Bandwidth saturation: A write-heavy PostgreSQL database generating 50-100 MB/s of WAL (Write-Ahead Log) data would consume 400-800 Mbps — far exceeding the 200 Mbps tunnel capacity. Replication lag would grow unbounded until the tunnel is upgraded or write volume decreases. This means the DR replica is perpetually behind, defeating the purpose.

  2. Latency impact on synchronous replication: Synchronous replication adds the full 50ms RTT to every transaction commit. For a workload doing 1,000 transactions/second, this adds 50 seconds of cumulative latency per second — transactions would queue up, causing application timeouts. Synchronous replication at 50ms RTT is impractical for any write-intensive workload.

  3. VPN reliability: VPN tunnels over the public internet have variable latency (50ms average but 200ms+ during congestion). Reconnections after tunnel drops cause replication lag spikes and potentially require WAL replay to catch up.

Recommendations: Upgrade to a Direct Connect or ExpressRoute (1-10 Gbps, <5ms latency) if synchronous replication is needed. If budget does not allow a dedicated interconnect, use asynchronous replication (accepting RPO of seconds to minutes) or consider logical replication (lower bandwidth, replicates only specific tables).

Submariner is deployed between your on-premises and cloud clusters. A developer runs curl nginx.production.svc.clusterset.local from a pod on the on-premises cluster and gets a DNS resolution error. The nginx service is running fine on the cloud cluster. Walk through your debugging process.

Answer

Systematic debugging from network layer up to DNS:

  1. Check Submariner components are Running: kubectl get pods -n submariner-operator. If the gateway engine or Lighthouse pods are in CrashLoopBackOff, the tunnel or DNS integration is broken.

  2. Verify ServiceExport and ServiceImport: On the cloud cluster, check kubectl get serviceexport nginx -n production. On the on-premises cluster, check kubectl get serviceimport -n submariner-operator. If the ServiceImport does not exist, Submariner has not synced the service metadata across clusters.

  3. Check Lighthouse DNS integration: Verify the CoreDNS configmap includes the Lighthouse plugin: kubectl get cm coredns -n kube-system -o yaml | grep lighthouse. If missing, Lighthouse did not inject itself into CoreDNS configuration.

  4. Check tunnel connectivity: Run subctl show connections — the status should show “connected” for the remote cluster. If “connecting” or “error,” check firewall rules for UDP ports 500 and 4500 (IPsec) or the WireGuard port.

  5. Test DNS directly: kubectl exec -it test-pod -- nslookup nginx.production.svc.clusterset.local. If this returns NXDOMAIN, the issue is DNS. If it resolves but curl fails, the issue is network connectivity through the tunnel.

  6. Check for CIDR overlap: If Globalnet is not enabled and pod CIDRs overlap, traffic cannot be routed correctly even if the tunnel is up.

Why is a shared root CA necessary for Istio multi-cluster?

Answer

Istio uses mTLS between all sidecars. Cross-cluster, sidecar A presents a cert signed by Cluster A’s CA. Sidecar B must verify that cert. Without a shared root CA, Cluster B does not trust Cluster A’s CA, so the TLS handshake fails. Symptoms: ping works but Istio services return 503. Fix: generate one root CA, derive per-cluster intermediate CAs, distribute root-cert.pem to all clusters before installing Istio.


Hands-On Exercise: Cross-Cluster Service Discovery

Section titled “Hands-On Exercise: Cross-Cluster Service Discovery”

Objective: Connect two kind clusters with Submariner and access a service across clusters.

Terminal window
# 1. Create clusters with unique CIDRs
cat <<EOF | kind create cluster --name cluster-a --config -
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
podSubnet: "10.10.0.0/16"
serviceSubnet: "10.110.0.0/16"
nodes:
- role: control-plane
- role: worker
EOF
cat <<EOF | kind create cluster --name cluster-b --config -
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
podSubnet: "10.20.0.0/16"
serviceSubnet: "10.120.0.0/16"
nodes:
- role: control-plane
- role: worker
EOF
# 2. Deploy Submariner
curl -Ls https://get.submariner.io | VERSION=v0.18.0 bash
kubectl config use-context kind-cluster-a
subctl deploy-broker
subctl join broker-info.subm --clusterid cluster-a --natt=false
kubectl config use-context kind-cluster-b
subctl join broker-info.subm --clusterid cluster-b --natt=false
# 3. Deploy and export a service on cluster-b
kubectl create namespace web
kubectl create deployment nginx --image=nginx:1.27 -n web --replicas=2
kubectl expose deployment nginx -n web --port=80
subctl export service nginx -n web
# 4. Test from cluster-a
kubectl config use-context kind-cluster-a
kubectl run test --rm -it --image=curlimages/curl --restart=Never -- \
curl -s http://nginx.web.svc.clusterset.local
  • Two kind clusters with non-overlapping CIDRs
  • Submariner broker deployed and both clusters joined
  • nginx service exported from cluster-b
  • curl from cluster-a reaches nginx on cluster-b
  • subctl show connections shows “connected”

Continue to Module 8.3: Cloud Repatriation & Migration to learn how to move workloads from cloud to on-premises, translating cloud services to their on-prem equivalents.