Module 10.4: Hybrid Cloud Architecture (On-Prem to Cloud)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Design hybrid cloud architectures that securely connect on-premises Kubernetes clusters to cloud provider ecosystems.
- Implement site-to-site VPNs and dedicated connections (Direct Connect/ExpressRoute) to establish reliable hybrid network foundations.
- Evaluate unified identity federation mechanisms, such as Pinniped, to standardize authentication across disparate environments.
- Implement workload migration and data replication strategies that respect regulatory boundaries and latency constraints.
- Compare hybrid orchestration platforms like EKS Anywhere, Anthos, and Azure Arc for standardizing Kubernetes operations across infrastructure boundaries.
Why This Module Matters
Section titled “Why This Module Matters”In early 2023, a major European financial institution embarked on an ambitious initiative to migrate its high-frequency trading platform from its legacy Frankfurt data centers to Amazon Web Services. The project was conceived as a straightforward “lift and shift” to be completed over an 18-month timeline. However, six months into the architectural planning and initial deployment, the engineering team hit a catastrophic roadblock. They realized that their strict regulatory framework mandated that certain categories of raw market data could never leave physical servers located within the country’s borders.
To compound the issue, their on-premises Frankfurt data centers received low-latency market data feeds via dedicated, physical fiber cross-connects directly from the exchange. Attempting to shift the trading engine to the cloud introduced an unavoidable 3 to 8 milliseconds of network latency. In the realm of high-frequency trading, this microscopic delay was modeled to result in approximately $12 million per year in lost arbitrage opportunities. Furthermore, their core mainframe settlement system contained over two decades of immutable business logic; rewriting it for the cloud was projected to take five years minimum.
Stop and think: If the bank’s trading engine stayed on-premises but analytics moved to the cloud, how might they keep the data in sync without overwhelming the network?
Faced with these immutable constraints, the Chief Technology Officer made a pragmatic pivot: the institution would halt the full migration. Instead, they engineered a robust hybrid cloud architecture. The latency-sensitive trading engine and the legacy settlement system remained firmly on-premises. Meanwhile, customer-facing APIs, massive data analytics pipelines, machine learning workloads, and all newly developed microservices were deployed to Amazon EKS. This strategy required an extraordinarily resilient network fabric connecting the data center to the cloud, unified identity management spanning both domains, and sophisticated, asynchronous data replication techniques. This module teaches you how to design, build, and operate that exact hybrid architecture.
Connectivity: The Physical Foundation
Section titled “Connectivity: The Physical Foundation”The absolute bedrock of any hybrid cloud architecture is the network link connecting your physical data center to the cloud provider’s network edge. The mechanism you choose dictates your latency, bandwidth, reliability, and ultimately, which architectural patterns are viable.
Pause and predict: Given the 1.25 Gbps bandwidth limit of an AWS Site-to-Site VPN tunnel, how long would it take to transfer a 500GB database backup? What does this mean for disaster recovery planning?
Site-to-Site VPN
Section titled “Site-to-Site VPN”A Site-to-Site Virtual Private Network (VPN) creates a secure, IPsec-encrypted tunnel over the public internet. It connects your on-premises customer gateway router to the cloud provider’s virtual private gateway. Because it traverses the public internet, it is subject to the unpredictable routing paths and congestion of global ISPs.
flowchart LR subgraph OnPrem[On-Premises DC] direction TB K8s["K8s Nodes<br>10.1.0.0/16"] CGW["VPN Gateway<br>(Customer Gateway)"] K8s --- CGW end
subgraph Cloud[AWS VPC] direction TB VGW["Virtual Private Gateway"] EKS["EKS Nodes<br>10.2.0.0/16"] VGW --- EKS end
CGW <-->|"IPsec Tunnel<br>(2 tunnels for HA)"| VGWVPNs are exceptional for getting started quickly. They require no physical infrastructure provisioning and can be instantiated via APIs in minutes. However, their variable latency makes them unsuitable for synchronous database replication or latency-sensitive microservice communication across the hybrid boundary.
# AWS: Create a Site-to-Site VPN connection# Step 1: Create a Customer Gateway (your on-premises router's public IP)CGW_ID=$(aws ec2 create-customer-gateway \ --type ipsec.1 \ --public-ip 203.0.113.50 \ --bgp-asn 65000 \ --query 'CustomerGateway.CustomerGatewayId' --output text)
# Step 2: Create a Virtual Private Gateway and attach to VPCVGW_ID=$(aws ec2 create-vpn-gateway \ --type ipsec.1 \ --amazon-side-asn 64512 \ --query 'VpnGateway.VpnGatewayId' --output text)aws ec2 attach-vpn-gateway --vpn-gateway-id $VGW_ID --vpc-id $VPC_ID
# Step 3: Create the VPN connection (2 tunnels automatically)VPN_ID=$(aws ec2 create-vpn-connection \ --type ipsec.1 \ --customer-gateway-id $CGW_ID \ --vpn-gateway-id $VGW_ID \ --options '{"StaticRoutesOnly":false}' \ --query 'VpnConnection.VpnConnectionId' --output text)
# Step 4: Download the configuration for your on-premises routeraws ec2 describe-vpn-connections \ --vpn-connection-ids $VPN_ID \ --query 'VpnConnections[0].CustomerGatewayConfiguration' \ --output text > vpn-config.xmlDedicated Connections (Direct Connect / ExpressRoute / Cloud Interconnect)
Section titled “Dedicated Connections (Direct Connect / ExpressRoute / Cloud Interconnect)”For production workloads, enterprises utilize dedicated connections like AWS Direct Connect, Azure ExpressRoute, or Google Cloud Interconnect. These services provide a private, physical fiber-optic link from your data center (or colocation facility) directly into the cloud provider’s edge routers.
flowchart LR subgraph OnPrem[On-Premises DC] direction TB K8s["K8s Nodes<br>10.1.0.0/16"] CC["Cross-Connect<br>(your cage)"] K8s --- CC end
subgraph Cloud[Cloud Provider Edge Location] direction TB Router["Provider Router"] EKS["EKS/AKS/GKE Nodes"] Router --- EKS end
CC <-->|"Dedicated Fiber<br>(private, not internet)"| RouterDedicated connections bypass the public internet entirely. They offer highly predictable, consistent latency (often in the 1-5 millisecond range) and massive bandwidth capabilities. The trade-off is high cost and a provisioning timeline that spans weeks or months, as network engineers must physically run and splice cables in colocation cages.
Connectivity Comparison Matrix
Section titled “Connectivity Comparison Matrix”| Feature | Site-to-Site VPN | Dedicated Connection |
|---|---|---|
| Bandwidth | Up to 1.25 Gbps/tunnel | 1-100 Gbps |
| Latency | 20-100ms (variable) | 1-5ms (consistent) |
| Reliability | Internet-dependent | SLA-backed (99.9-99.99%) |
| Encryption | Built-in (IPsec) | Optional (MACsec on 10/100G) |
| Cost | Low ($36/month base) | High ($1,600+/month for 1Gbps) |
| Setup time | Hours | Weeks to months |
| Use case | Dev/test, failover, low bandwidth | Production, latency-sensitive, high bandwidth |
| Kubernetes impact | Acceptable for API calls, config sync | Required for data replication, cross-cluster traffic |
Routing the Hybrid Network: Transit Gateways
Section titled “Routing the Hybrid Network: Transit Gateways”As your hybrid footprint grows, managing point-to-point connections becomes an operational nightmare. Routing topologies become brittle. To solve this, cloud providers offer centralized routing hubs (AWS Transit Gateway, Azure Virtual WAN) that act as the single interconnection point for all on-premises and cloud networks.
flowchart TD TGW["Transit Gateway<br>(Central Hub)"]
VPC1["VPC: EKS Prod Cluster<br>10.1.0.0/16"] VPC2["VPC: EKS Dev<br>10.2.0.0/16"] VPC3["VPC: Shared Services<br>10.3.0.0/16"] OnPrem["On-Premises via Direct Connect<br>10.0.0.0/8"]
TGW <--> VPC1 TGW <--> VPC2 TGW <--> VPC3 TGW <--> OnPremWhen implementing a transit hub, you must ensure that your Kubernetes pod and service CIDR blocks are non-overlapping across all environments and are actively advertised via BGP to the transit gateway.
# Create Transit GatewayTGW_ID=$(aws ec2 create-transit-gateway \ --description "Hybrid-Hub" \ --options "AmazonSideAsn=64512,AutoAcceptSharedAttachments=disable,DefaultRouteTableAssociation=disable,DefaultRouteTablePropagation=disable,DnsSupport=enable" \ --query 'TransitGateway.TransitGatewayId' --output text)
# Attach VPCsaws ec2 create-transit-gateway-vpc-attachment \ --transit-gateway-id $TGW_ID \ --vpc-id $PROD_VPC_ID \ --subnet-ids $PROD_SUBNET_1 $PROD_SUBNET_2
# Attach Direct Connect Gatewayaws directconnect create-direct-connect-gateway-association \ --direct-connect-gateway-id $DX_GW_ID \ --gateway-id $TGW_ID \ --allowed-prefixes "10.1.0.0/16,10.2.0.0/16,10.3.0.0/16"
# Route on-prem traffic through Transit Gatewayaws ec2 create-transit-gateway-route \ --transit-gateway-route-table-id $TGW_RT_ID \ --destination-cidr-block 10.0.0.0/8 \ --transit-gateway-attachment-id $DX_ATTACHMENT_IDWar Story: A global logistics company connected twelve distinct cloud VPCs and three physical on-premises data centers through a Transit Gateway. Their Kubernetes clusters operated flawlessly in isolation. However, cross-cluster service mesh communication failed randomly. After weeks of debugging, the root cause was identified: multiple VPCs utilized the exact same secondary CIDR ranges for pod IPs, provided by the VPC CNI. Transit Gateways cannot route overlapping CIDRs. They were forced to dismantle their infrastructure and redesign their entire IP address management plan — an agonizing process that proper upfront planning would have completely avoided.
Unified Identity: Extending the Cloud Control Plane
Section titled “Unified Identity: Extending the Cloud Control Plane”Managing authentication separately for on-premises clusters and cloud clusters creates significant operational friction and security vulnerabilities. A true hybrid architecture requires a unified identity plane where a single set of credentials grants access everywhere based on centralized Role-Based Access Control (RBAC).
Stop and think: If your corporate Identity Provider goes down, what happens to developers trying to access the on-premises Kubernetes cluster via Pinniped? How would break-glass access work?
Identity Architecture Options
Section titled “Identity Architecture Options”The goal is to federate identity from a central provider (IdP) to every Kubernetes cluster, regardless of its hosting location.
flowchart TD IdP["Identity Provider (IdP)<br>Central Source of Truth<br>(Azure AD, Okta, Google Workspace)"]
subgraph Federation[OIDC Federation] direction LR CloudEKS["Cloud EKS<br>OIDC via IdP"] CloudAKS["Cloud AKS<br>Azure AD native"] OnPrem["On-Prem K8s<br>OIDC via Dex/Pinniped"] end
IdP --> CloudEKS IdP --> CloudAKS IdP --> OnPremPinniped: Unified Kubernetes Authentication
Section titled “Pinniped: Unified Kubernetes Authentication”Pinniped is an open-source project designed to provide identity federation for any Kubernetes cluster. It bridges the gap between modern OIDC providers and on-premises clusters that lack native integrations. Pinniped operates via a Supervisor that integrates with your IdP and a Concierge that sits on the target clusters to validate the tokens.
# Install Pinniped Supervisor (on a management cluster)# This acts as the OIDC bridge between your IdP and Kubernetes clusters
apiVersion: config.supervisor.pinniped.dev/v1alpha1kind: FederationDomainmetadata: name: company-federation namespace: pinniped-supervisorspec: issuer: https://pinniped.internal.company.com tls: secretName: pinniped-tls-cert# Connect Pinniped to your corporate IdP (e.g., Okta)apiVersion: idp.supervisor.pinniped.dev/v1alpha1kind: OIDCIdentityProvidermetadata: name: okta-idp namespace: pinniped-supervisorspec: issuer: https://company.okta.com/oauth2/default authorizationConfig: additionalScopes: - groups - email allowPasswordGrant: false claims: username: email groups: groups client: secretName: okta-client-secret# On each on-prem cluster, install Pinniped ConciergeapiVersion: authentication.concierge.pinniped.dev/v1alpha1kind: JWTAuthenticatormetadata: name: company-jwtspec: issuer: https://pinniped.internal.company.com audience: on-prem-cluster-1 tls: certificateAuthorityData: <base64-encoded-ca-cert>With Pinniped configured, developers utilize a standardized workflow to access any cluster using the Pinniped CLI plugin.
# Developer workflow (same for cloud and on-prem)# Install the Pinniped CLIbrew install vmware-tanzu/pinniped/pinniped-cli
# Generate kubeconfig for an on-prem clusterpinniped get kubeconfig \ --kubeconfig-context on-prem-cluster-1 \ > /tmp/on-prem-kubeconfig.yaml
# The kubeconfig triggers browser-based OIDC login# Same Okta credentials work for cloud and on-prem clusterskubectl --kubeconfig /tmp/on-prem-kubeconfig.yaml get nodesData Gravity: Replicating State Across Boundaries
Section titled “Data Gravity: Replicating State Across Boundaries”Stateless applications are trivial to move between environments. Data, however, has immense gravity. Moving terabytes of stateful data across a WAN link is slow, expensive, and technically complex. Synchronous replication is rarely feasible across long distances due to the laws of physics dictating minimum latency.
Data Replication Patterns
Section titled “Data Replication Patterns”| Pattern | Use Case | Latency Tolerance | Tools |
|---|---|---|---|
| Active-Passive | DR, read replicas | Minutes | AWS DMS, Azure Site Recovery |
| Active-Active | Multi-region writes | Sub-second | CockroachDB, YugabyteDB, Cassandra |
| Event Streaming | Real-time sync | Seconds | Kafka MirrorMaker, Confluent Replicator |
| Batch Sync | Analytics, reporting | Hours | AWS DataSync, Rclone, rsync |
| Cache-Aside | Read-heavy, latency-sensitive | Milliseconds | Redis Enterprise, Hazelcast |
Cross-Environment Database Replication
Section titled “Cross-Environment Database Replication”To mitigate latency issues, you can implement asynchronous streaming replication. The on-premises database acts as the primary writer, and changes are streamed to a read-replica running in the cloud.
# PostgreSQL streaming replication across hybrid boundary# On-prem primary → Cloud read replica
# On the on-prem primary (postgresql.conf)# wal_level = replica# max_wal_senders = 5# wal_keep_size = 1GB
# On the cloud replica (Kubernetes StatefulSet)apiVersion: apps/v1kind: StatefulSetmetadata: name: postgres-replica namespace: databasespec: serviceName: postgres-replica replicas: 1 selector: matchLabels: app: postgres-replica template: metadata: labels: app: postgres-replica spec: containers: - name: postgres image: postgres:16.2 env: - name: PGDATA value: /var/lib/postgresql/data/pgdata command: - bash - -c - | # Initialize as a streaming replica of the on-prem primary if [ ! -f "$PGDATA/PG_VERSION" ]; then pg_basebackup -h 10.0.50.100 -U replicator \ -D $PGDATA -Fp -Xs -P -R fi exec postgres \ -c primary_conninfo='host=10.0.50.100 port=5432 user=replicator password=secret' \ -c primary_slot_name='cloud_replica' ports: - containerPort: 5432 volumeMounts: - name: pgdata mountPath: /var/lib/postgresql/data resources: limits: cpu: "2" memory: 4Gi volumeClaimTemplates: - metadata: name: pgdata spec: accessModes: ["ReadWriteOnce"] storageClassName: gp3-encrypted resources: requests: storage: 500GiKafka for Cross-Environment Event Streaming
Section titled “Kafka for Cross-Environment Event Streaming”For modern microservices, event streaming using tools like Kafka MirrorMaker provides an elegant solution to data replication. Events generated on-premises are automatically mirrored to cloud clusters, enabling decoupled architectures.
# Kafka MirrorMaker 2 for hybrid event streaming# Replicates topics from on-prem Kafka to cloud KafkaapiVersion: kafka.strimzi.io/v1beta2kind: KafkaMirrorMaker2metadata: name: hybrid-mirror namespace: kafkaspec: version: 3.7.0 replicas: 3 connectCluster: cloud-kafka clusters: - alias: onprem-kafka bootstrapServers: onprem-kafka-bootstrap.datacenter.internal:9093 tls: trustedCertificates: - secretName: onprem-ca-cert certificate: ca.crt authentication: type: tls certificateAndKey: secretName: mirror-maker-cert certificate: tls.crt key: tls.key - alias: cloud-kafka bootstrapServers: kafka-bootstrap.kafka.svc:9092 config: config.storage.replication.factor: 3 offset.storage.replication.factor: 3 status.storage.replication.factor: 3 mirrors: - sourceCluster: onprem-kafka targetCluster: cloud-kafka sourceConnector: config: replication.factor: 3 offset-syncs.topic.replication.factor: 3 sync.topic.acls.enabled: false replication.policy.class: "org.apache.kafka.connect.mirror.IdentityReplicationPolicy" topicsPattern: "trading\\..*|settlement\\..*" groupsPattern: ".*"Workload Migration Strategies: Shifting the Traffic
Section titled “Workload Migration Strategies: Shifting the Traffic”When you are ready to begin moving applications to your hybrid cloud environment, a “big bang” switch is highly discouraged. Instead, employ progressive traffic shifting to iteratively test and validate your cloud clusters.
Pause and predict: If you shift 1% of traffic to a new cloud cluster and monitor it for 24 hours, what specific metrics would tell you it is safe to increase the traffic to 10%?
Pattern 1: Weighted DNS Routing
Section titled “Pattern 1: Weighted DNS Routing”DNS-level traffic shifting involves configuring multiple records for a single domain, weighting the responses so that only a fraction of users resolve to the new cloud ingress.
# Example: AWS Route 53 Weighted Record via ExternalDNS annotationapiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: api-gateway annotations: external-dns.alpha.kubernetes.io/hostname: api.company.com external-dns.alpha.kubernetes.io/aws-weight: "10" # 10% to cloud external-dns.alpha.kubernetes.io/set-identifier: "cloud-eks-cluster"spec: rules: - host: api.company.com http: paths: - path: / pathType: Prefix backend: service: name: api-gateway port: number: 80While straightforward, DNS caching by ISPs and client browsers means traffic changes can be heavily delayed, making quick rollbacks difficult.
Pattern 2: Multi-Cluster Ingress
Section titled “Pattern 2: Multi-Cluster Ingress”A vastly superior approach utilizes a multi-cluster ingress controller or a global load balancer. These tools terminate the connection centrally and distribute HTTP traffic deterministically based on dynamic rules.
flowchart TD DNS["api.company.com<br>(100% traffic)"]
subgraph OnPrem["On-Premises Data Center"] direction TB IngressOP["Ingress Controller"] PodsOP["API Pods"] IngressOP --> PodsOP end
subgraph Cloud["Cloud EKS Cluster"] direction TB IngressCloud["Ingress Controller"] PodsCloud["API Pods"] IngressCloud --> PodsCloud end
DNS -->|90% traffic| IngressOP DNS -->|10% traffic| IngressCloudThis pattern eliminates the risks associated with DNS caching and allows for immediate, highly granular traffic shifting based on headers, paths, or geographic origins.
On-Premises Cloud Parity: EKS Anywhere and Anthos
Section titled “On-Premises Cloud Parity: EKS Anywhere and Anthos”To prevent operational silos, you want the experience of deploying to on-premises Kubernetes to be identical to deploying to the cloud. Platforms like Amazon EKS Anywhere, Google GKE Enterprise (Anthos), and Azure Arc accomplish this by packaging the cloud provider’s Kubernetes distribution for local infrastructure.
Hybrid Platform Comparison
Section titled “Hybrid Platform Comparison”| Feature | EKS Anywhere | GKE Enterprise (Anthos) | Azure Arc-enabled K8s | Rancher |
|---|---|---|---|---|
| Provider | AWS | Microsoft | SUSE | |
| On-prem infra | VMware, bare metal, Nutanix | VMware, bare metal | Any K8s cluster | Any K8s cluster |
| Cloud parity | EKS API compatible | GKE API compatible | AKS policy/monitoring | Cloud-agnostic |
| Management plane | Optional EKS connector | Mandatory GCP connection | Mandatory Azure connection | Self-hosted |
| Cost | Free (support extra) | Per-vCPU licensing | Free (extensions extra) | Free (Rancher Prime extra) |
| GitOps | Flux (built-in) | Config Sync (built-in) | GitOps with Flux | Fleet (built-in) |
| Best for | AWS-centric orgs | GCP-centric orgs | Azure-centric orgs | Multi-cloud, vendor-neutral |
EKS Anywhere Architecture
Section titled “EKS Anywhere Architecture”EKS Anywhere brings the EKS control plane to your VMware vSphere environments or bare-metal servers. It heavily leverages Cluster API for declarative provisioning and Flux for built-in GitOps.
Pause and predict: If the EKS Anywhere Management Cluster loses connectivity to the Workload Cluster, do the applications on the Workload Cluster stop running? Why or why not?
flowchart TD Admin["Admin Machine<br>- eksctl-anywhere CLI<br>- kubectl"]
subgraph DC["ON-PREMISES DATA CENTER"] direction TB Mgmt["EKS Anywhere Management Cluster<br>- Cluster API (CAPI)<br>- Flux (GitOps)<br>- Curated Packages"]
Workload["EKS Anywhere Workload Cluster<br>- CP-1, CP-2, CP-3<br>- Worker Nodes<br>Running on: VMware/Bare Metal"]
Mgmt -->|manages| Workload end
Admin --> Mgmt Workload -.->|"Optional: EKS Connector"| AWS["Visible in AWS Console"]Deploying an EKS Anywhere cluster is entirely declarative. First, generate your target specifications.
# Create an EKS Anywhere cluster on VMware# Step 1: Generate cluster configurationeksctl anywhere generate clusterconfig hybrid-prod \ --provider vsphere > cluster-config.yamlThe resulting configuration file contains all the necessary Cluster API definitions, defining network CIDRs, vCenter integration, and node pool sizes.
# cluster-config.yaml (simplified)apiVersion: anywhere.eks.amazonaws.com/v1alpha1kind: Clustermetadata: name: hybrid-prodspec: clusterNetwork: cniConfig: cilium: {} pods: cidrBlocks: - 192.168.0.0/16 services: cidrBlocks: - 10.96.0.0/12 controlPlaneConfiguration: count: 3 endpoint: host: 10.0.100.10 machineGroupRef: kind: VSphereMachineConfig name: hybrid-prod-cp datacenterRef: kind: VSphereDatacenterConfig name: hybrid-prod-dc kubernetesVersion: "1.35" workerNodeGroupConfigurations: - count: 5 machineGroupRef: kind: VSphereMachineConfig name: hybrid-prod-worker name: workers gitOpsRef: kind: FluxConfig name: hybrid-prod-fluxapiVersion: anywhere.eks.amazonaws.com/v1alpha1kind: VSphereDatacenterConfigmetadata: name: hybrid-prod-dcspec: datacenter: dc-frankfurt server: vcenter.internal.company.com network: /dc-frankfurt/network/k8s-prod thumbprint: "AB:CD:EF:..." insecure: falseapiVersion: anywhere.eks.amazonaws.com/v1alpha1kind: VSphereMachineConfigmetadata: name: hybrid-prod-workerspec: diskGiB: 100 folder: /dc-frankfurt/vm/k8s memoryMiB: 16384 numCPUs: 4 osFamily: ubuntu resourcePool: /dc-frankfurt/host/cluster-1/Resources/k8s-pool template: /dc-frankfurt/vm/templates/ubuntu-2204-k8s-1.35With the file modified to match your vSphere environment, the cluster creation takes place.
# Step 2: Create the clustereksctl anywhere create cluster -f cluster-config.yaml
# Step 3: (Optional) Connect to AWS for visibilityeksctl anywhere register cluster hybrid-prod \ --aws-region us-east-1
# Step 4: Install curated packages (same add-ons as EKS)eksctl anywhere install package harbor \ --cluster hybrid-prod \ --config harbor-config.yamlLatency Budget For Hybrid Operations
Section titled “Latency Budget For Hybrid Operations”When architecting hybrid systems, understanding latency tolerances is critical. Some operations fail spectacularly if network latency exceeds acceptable thresholds.
| Operation | VPN | Direct Connect |
|---|---|---|
| kubectl get pods | 50-150ms | 5-15ms |
| ArgoCD sync check | 50-150ms | 5-15ms |
| Cross-cluster service call | 40-120ms | 3-10ms |
| Database replication (streaming) | 40-120ms | 3-10ms |
| Prometheus remote write | 50-150ms | 5-15ms |
| Container image pull (1GB) | 8-25s | 0.8-2s |
| Velero backup (100GB) | 13-40min | 1.5-4min |
Unified Control Plane Patterns: Fleet Management
Section titled “Unified Control Plane Patterns: Fleet Management”As your environment matures, managing dozens of hybrid clusters via independent scripts is a recipe for drift. Implementing a centralized GitOps and observability pipeline is the final piece of the hybrid architecture.
Stop and think: In a Hub-Spoke GitOps architecture, what happens if the network link between the Cloud Hub and the On-Prem Spoke goes down for 4 hours while developers are merging code to the main branch?
Pattern 1: Hub-Spoke with GitOps
Section titled “Pattern 1: Hub-Spoke with GitOps”A Hub-Spoke architecture centralizes GitOps operators (like ArgoCD) and monitoring aggregators on a primary “Hub” cluster in the cloud, syncing changes outward to the spoke clusters on-premises and pulling metrics backward.
flowchart TD subgraph Hub["HUB CLUSTER (cloud)"] direction TB Argo["ArgoCD (centralized)<br>├── ApplicationSet: on-prem clusters<br>├── ApplicationSet: cloud clusters<br>└── App of Apps: platform services"] Prom["Prometheus (federated)<br>├── remote_read: on-prem prometheus<br>└── remote_read: cloud prometheus"] end
subgraph Spoke1["On-Prem Cluster (Spoke)"] direction TB Agent1["ArgoCD agent"] Prom1["Prometheus"] end
subgraph Spoke2["Cloud EKS Cluster (Spoke)"] direction TB Agent2["ArgoCD agent"] Prom2["Prometheus"] end
Argo -->|syncs| Agent1 Argo -->|syncs| Agent2 Prom -->|reads| Prom1 Prom -->|reads| Prom2Using ArgoCD ApplicationSets, you dynamically target clusters based on labels rather than maintaining individual app configurations per environment.
# ArgoCD ApplicationSet for hybrid fleet managementapiVersion: argoproj.io/v1alpha1kind: ApplicationSetmetadata: name: platform-services namespace: argocdspec: generators: - clusters: selector: matchLabels: environment: production template: metadata: name: 'platform-{{name}}' spec: project: platform source: repoURL: https://github.com/company/platform-services.git targetRevision: main path: 'overlays/{{metadata.labels.location}}' destination: server: '{{server}}' namespace: platform-system syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=trueDid You Know?
Section titled “Did You Know?”- AWS Direct Connect has over 115 locations globally as of 2025, but provisioning a new connection still takes 2-12 weeks because it involves physical fiber cross-connects. Some enterprises maintain “dark fiber” connections — provisioned but unused circuits — specifically so they can activate new Direct Connect links in hours instead of weeks. These dark fiber circuits cost about $500/month in cross-connect fees alone.
- Google’s Anthos was rebranded to “GKE Enterprise” in 2023 after Google found that the “Anthos” name confused customers who did not associate it with Kubernetes. The per-vCPU pricing (2,900/month just for the Anthos license, on top of the infrastructure costs.
- EKS Anywhere was launched in 2021 as a free, open-source project. AWS makes money not from EKS Anywhere itself but from the “EKS Anywhere Enterprise Subscription” ($24,000/year per cluster for 24/7 support) and from workloads that eventually migrate to cloud EKS. Internal AWS metrics show that 68% of EKS Anywhere clusters also connect to at least one cloud EKS cluster within their first year.
- The average enterprise with a hybrid cloud strategy maintains connections to 2.3 cloud providers and 1.8 data centers simultaneously, according to a 2024 Flexera survey. The most common combination is AWS + Azure + one on-premises data center. The “single cloud” strategy that analysts predicted in 2018 has not materialized — instead, enterprises have become deliberately multi-cloud, though usually with one primary and one secondary provider.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| VPN as the sole production connection | Quick to set up. “We will upgrade to Direct Connect later.” Then production grows to depend on internet stability. | Use VPN for non-production and as a failover path. Direct Connect for production workloads. Design for this from day one. |
| Overlapping IP ranges between on-prem and cloud | On-prem uses 10.0.0.0/8 extensively. Cloud VPCs also default to 10.x. Pod CIDRs overlap because no one coordinated. | Centralized IPAM from the start. Reserve distinct ranges: on-prem 10.0-10.63, cloud 10.64-10.127, pods 10.128-10.191. Document and enforce. |
| Separate identity systems for cloud and on-prem K8s | Cloud K8s uses cloud-native auth. On-prem K8s uses static tokens or client certs. Different credentials, different RBAC, inconsistent access. | Deploy Pinniped or Dex as a unified OIDC bridge. One IdP, one login, consistent RBAC across all clusters. |
| Trying to do active-active across the hybrid boundary | Architect designs active-active database replication across 50ms VPN link. Application assumes single-digit-ms latency for distributed locks. | Be honest about latency constraints. Active-active across a WAN requires CRDT-based or conflict-free databases (CockroachDB, YugabyteDB). Not all workloads can tolerate this. |
| No local container registry on-prem | On-prem clusters pull images from cloud ECR/ACR/Artifact Registry across the WAN link. Slow pulls, failed deployments during network blips. | Deploy Harbor or a registry mirror on-prem. Pre-cache images. Set imagePullPolicy: IfNotPresent for on-prem workloads. |
| Managing on-prem clusters with SSH and scripts | ”We have always managed servers this way.” But Kubernetes clusters need declarative management, not imperative scripts. | Use GitOps (ArgoCD/Flux) for all clusters, including on-prem. Cluster API or EKS Anywhere for infrastructure lifecycle. No SSH management. |
| Ignoring DNS split-horizon | On-prem services use .internal.company.com. Cloud services use different domains. Cross-environment service discovery breaks. | Design a unified DNS strategy. Use CoreDNS forwarding, Route53 Resolver endpoints, or a service mesh for cross-environment service discovery. |
| No monitoring for the connection itself | Teams monitor applications but not the VPN/Direct Connect link. When the link degrades, everything breaks and no one knows why. | Monitor connection latency, packet loss, and bandwidth utilization. Alert when latency exceeds baseline by 2x. CloudWatch metrics for Direct Connect, custom probes for VPN. |
Question 1: Your on-premises Kubernetes cluster needs to pull container images from Amazon ECR. The cluster connects to AWS via a site-to-site VPN. Image pulls take 90 seconds for a 500MB image. How would you improve this?
Several approaches can significantly improve this process. First, you should deploy an on-premises registry mirror (such as Harbor with a proxy cache) that pulls images from ECR once and serves them locally to all nodes. Subsequent pulls will happen at local-network speeds, eliminating the WAN latency. Second, you can implement an automated process to pre-pull images as part of the deployment pipeline, ensuring they are cached on the nodes before the new pods are scheduled. Third, consider optimizing the image size using multi-stage builds or distroless base images, which often reduce a 500MB footprint down to 50-100MB. Finally, if the business budget allows, upgrading to a Direct Connect circuit would drastically reduce the transfer time from 90 seconds to just a few seconds.
Question 2: Your network engineering team is allocating IP ranges for a new hybrid cloud expansion. They suggest reusing the 10.244.0.0/16 range for pods in both the on-premises and AWS EKS clusters, arguing that the clusters are separate. Why will this cause a major outage when you deploy a multi-cluster service mesh?
In a single-cloud or fully isolated deployment, pod CIDRs only need to be routable within their local VPC or cluster network. However, in a hybrid architecture with a multi-cluster service mesh, traffic must be routed directly between pods across the transit gateway or VPN. If both the on-premises and cloud clusters use the exact same 10.244.0.0/16 pod CIDR, the underlying network routers will experience a conflict and cannot determine the correct destination for packets. Cross-cluster service calls, database connections initiated from pods, and centralized monitoring scrapes will typically fail once traffic hits that overlapping CIDR conflict. To prevent this, you must implement centralized IPAM that assigns unique, non-overlapping CIDR ranges to every cluster’s pod and service networks.
Question 3: Your company's CTO has mandated a unified Kubernetes strategy across AWS and your VMware-based on-premises data centers. The platform team is debating between using `kubeadm` to build a custom distribution versus adopting EKS Anywhere. What are the operational trade-offs they must consider before making this decision?
Opting for EKS Anywhere provides significant operational advantages, including declarative lifecycle management via Cluster API and pre-integrated tools like Flux for GitOps. It also ensures strict version compatibility with cloud-based EKS and provides curated, heavily tested add-ons right out of the box. However, this convenience comes with strict trade-offs, primarily a deep vendor dependency on AWS’s release cycles and limited support for underlying infrastructure (e.g., VMware or Bare Metal, but not Hyper-V). Conversely, using kubeadm offers complete architectural freedom and avoids vendor lock-in, but places the entire burden of engineering the cluster lifecycle, integrating add-ons, and building GitOps pipelines squarely on your platform team. Ultimately, the decision hinges on whether the organization prefers to buy a standardized operational model or build a highly customized one.
Question 4: Your company has a Direct Connect to AWS and an ExpressRoute to Azure. You want unified monitoring across all clusters. What architecture would you recommend?
The most robust approach is to implement a federated Prometheus architecture with a highly available central aggregation layer. You should deploy a local Prometheus instance on each cluster (on-premises, AWS, and Azure) to collect metrics and provide short-term buffering during network partitions. Because you have high-bandwidth dedicated connections available, you can reliably use Thanos or Prometheus remote_write to ship these metrics to a central storage tier without saturating the network links. This central store, handling long-term retention and global querying, should be placed in the cloud environment with the most reliable connectivity or in a managed service like Grafana Cloud. This design guarantees that if a network link drops, local Prometheus nodes will buffer the metrics, seamlessly backfilling the central dashboard once connectivity is restored.
Question 5: You have successfully connected your on-premises data center to AWS via Direct Connect. However, developers complain that they use their corporate Okta single sign-on for the EKS clusters, but must use static `kubeconfig` files with client certificates for the on-premises clusters. How does a tool like Pinniped solve this specific pain point?
Pinniped acts as a unified identity federation bridge that standardizes the authentication flow across any type of Kubernetes cluster. It features a Supervisor component that integrates directly with your corporate Identity Provider (like Okta) and a Concierge component installed on every target cluster to validate the resulting tokens. Instead of managing static certificates or setting up separate OIDC integrations for each on-premises cluster, administrators configure a single identity source. Developers can then use a single pinniped login command that triggers a familiar browser-based OIDC login flow. Ultimately, this ensures that the same corporate credentials and RBAC policies govern access across the entire hybrid fleet, dramatically reducing administrative overhead and improving security.
Question 6: Your startup is extending its on-premises development environment into the cloud to access specialized GPU nodes. The CTO wants to immediately order a 10Gbps Direct Connect circuit to link the environments. Under what specific conditions would you advise starting with a Site-to-Site VPN instead, and when would the Direct Connect become strictly necessary?
For an initial development environment expansion, a Site-to-Site VPN is generally the superior starting point because it can be provisioned in hours and costs a fraction of dedicated fiber. Because these are development workloads, occasional internet-induced latency spikes or minor packet loss will likely not cause business-impacting outages. You should advise starting with a VPN to rapidly unblock the engineering teams and validate the architectural patterns. A Direct Connect circuit becomes strictly necessary only when you transition to production workloads that require consistent single-digit millisecond latency, or when synchronous data replication and large-scale cross-cluster service mesh traffic saturate the VPN’s bandwidth. Ultimately, most mature enterprises maintain both, using Direct Connect for heavy production data and keeping the VPN as an automatic failover path.
Hands-On Exercise: Simulate a Hybrid Cloud Architecture
Section titled “Hands-On Exercise: Simulate a Hybrid Cloud Architecture”In this intensive exercise, you will synthesize everything discussed in this module. You will stand up a simulated hybrid environment utilizing two kind clusters interconnected by a shared Docker network representing your VPN or Direct Connect.
What you will build:
flowchart LR subgraph OnPrem["On-Premises<br>kind cluster"] direction TB Backend["- App backend"] PG["- PostgreSQL"] PromOP["- Prometheus"] end
subgraph Cloud["Cloud<br>kind cluster"] direction TB Frontend["- App frontend"] Argo["- ArgoCD (hub)"] PromCloud["- Prometheus"] end
OnPrem <-->|"Docker<br>network"| CloudTask 1: Create the Hybrid Clusters
Section titled “Task 1: Create the Hybrid Clusters”First, we must establish our distinct network environments and link them using a Docker bridge network.
Solution
# Create a shared Docker network (simulates VPN/Direct Connect)docker network create hybrid-net 2>/dev/null || true
# Create the "on-premises" clustercat <<'EOF' > /tmp/onprem-cluster.yamlkind: ClusterapiVersion: kind.x-k8s.io/v1alpha4name: onpremnetworking: podSubnet: "10.244.0.0/16" serviceSubnet: "10.96.0.0/12"nodes: - role: control-plane - role: workerEOF
# Create the "cloud" clustercat <<'EOF' > /tmp/cloud-cluster.yamlkind: ClusterapiVersion: kind.x-k8s.io/v1alpha4name: cloudnetworking: podSubnet: "10.245.0.0/16" serviceSubnet: "10.112.0.0/12"nodes: - role: control-plane - role: workerEOF
kind create cluster --config /tmp/onprem-cluster.yamlkind create cluster --config /tmp/cloud-cluster.yaml
# Connect both clusters to the shared networkdocker network connect hybrid-net onprem-control-planedocker network connect hybrid-net cloud-control-plane
echo "=== On-prem cluster ==="kubectl --context kind-onprem get nodesecho "=== Cloud cluster ==="kubectl --context kind-cloud get nodesTask 2: Deploy Workloads Simulating Hybrid Architecture
Section titled “Task 2: Deploy Workloads Simulating Hybrid Architecture”Next, we disperse our microservices across the boundary, deploying backend systems locally and frontends in the cloud.
Solution
# Deploy a backend service on the "on-prem" clusterkubectl --context kind-onprem create namespace backendcat <<'EOF' | kubectl --context kind-onprem apply -f -apiVersion: apps/v1kind: Deploymentmetadata: name: api-backend namespace: backendspec: replicas: 2 selector: matchLabels: app: api-backend template: metadata: labels: app: api-backend spec: containers: - name: api image: nginx:1.27.3 ports: - containerPort: 80 resources: limits: cpu: 100m memory: 128Mi---apiVersion: v1kind: Servicemetadata: name: api-backend namespace: backendspec: selector: app: api-backend ports: - port: 80 targetPort: 80EOF
# Deploy a frontend service on the "cloud" clusterkubectl --context kind-cloud create namespace frontendcat <<'EOF' | kubectl --context kind-cloud apply -f -apiVersion: apps/v1kind: Deploymentmetadata: name: web-frontend namespace: frontendspec: replicas: 2 selector: matchLabels: app: web-frontend template: metadata: labels: app: web-frontend spec: containers: - name: web image: nginx:1.27.3 ports: - containerPort: 80 resources: limits: cpu: 100m memory: 128Mi---apiVersion: v1kind: Servicemetadata: name: web-frontend namespace: frontendspec: selector: app: web-frontend ports: - port: 80 targetPort: 80EOF
echo "=== On-prem workloads ==="kubectl --context kind-onprem get pods -n backendecho "=== Cloud workloads ==="kubectl --context kind-cloud get pods -n frontendTask 3: Test Cross-Cluster Connectivity
Section titled “Task 3: Test Cross-Cluster Connectivity”Demonstrate the routing capabilities by pinging the opposite cluster from within the control plane container.
Solution
# Get the on-prem cluster's internal IP (simulates the Direct Connect path)ONPREM_IP=$(docker inspect onprem-control-plane --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' | head -1)CLOUD_IP=$(docker inspect cloud-control-plane --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' | head -1)
echo "On-prem cluster IP: $ONPREM_IP"echo "Cloud cluster IP: $CLOUD_IP"
# Verify connectivity between clusters (simulates VPN tunnel)docker exec cloud-control-plane ping -c 3 $ONPREM_IPdocker exec onprem-control-plane ping -c 3 $CLOUD_IP
echo ""echo "Cross-cluster connectivity verified."echo "In a real hybrid setup, this path would go through:"echo " - Direct Connect (1-5ms latency)"echo " - or VPN tunnel (20-100ms latency)"Task 4: Implement Cross-Cluster Monitoring
Section titled “Task 4: Implement Cross-Cluster Monitoring”Configure Prometheus definitions tailored for a multi-tenant, federated setup that pushes data upward.
Solution
# Deploy a simple monitoring ConfigMap on each cluster that# simulates federated monitoring configuration
for CTX in kind-onprem kind-cloud; do CLUSTER_NAME=$(echo $CTX | sed 's/kind-//') kubectl --context $CTX create namespace monitoring 2>/dev/null || true
cat <<EOF | kubectl --context $CTX apply -f -apiVersion: v1kind: ConfigMapmetadata: name: monitoring-config namespace: monitoringdata: cluster-name: "${CLUSTER_NAME}" cluster-type: "$([ $CLUSTER_NAME = 'onprem' ] && echo 'on-premises' || echo 'cloud')" prometheus-config: | global: scrape_interval: 30s external_labels: cluster: ${CLUSTER_NAME} environment: $([ $CLUSTER_NAME = 'onprem' ] && echo 'datacenter' || echo 'aws') remote_write: - url: http://thanos-receive.monitoring.svc:19291/api/v1/receive scrape_configs: - job_name: kubernetes-pods kubernetes_sd_configs: - role: podEOFdone
echo "=== On-prem monitoring config ==="kubectl --context kind-onprem get configmap monitoring-config -n monitoring -o yaml | grep -A5 "external_labels"echo "=== Cloud monitoring config ==="kubectl --context kind-cloud get configmap monitoring-config -n monitoring -o yaml | grep -A5 "external_labels"Task 5: Build a Hybrid Inventory Report
Section titled “Task 5: Build a Hybrid Inventory Report”Create a custom script that scrapes information from both Kubernetes environments simultaneously to prove they operate cohesively.
Solution
cat <<'SCRIPT' > /tmp/hybrid-inventory.sh#!/bin/bashecho "========================================"echo " HYBRID CLOUD INVENTORY REPORT"echo " $(date -u +%Y-%m-%dT%H:%M:%SZ)"echo "========================================"
for CTX in kind-onprem kind-cloud; do CLUSTER=$(echo $CTX | sed 's/kind-//') echo "" echo "--- Cluster: $CLUSTER ---" echo " Nodes: $(kubectl --context $CTX get nodes --no-headers | wc -l | tr -d ' ')" echo " Namespaces: $(kubectl --context $CTX get namespaces --no-headers | wc -l | tr -d ' ')" echo " Pods: $(kubectl --context $CTX get pods -A --no-headers | wc -l | tr -d ' ')" echo " Services: $(kubectl --context $CTX get services -A --no-headers | wc -l | tr -d ' ')" echo " Deployments: $(kubectl --context $CTX get deployments -A --no-headers | wc -l | tr -d ' ')"
echo " Workload Namespaces:" for NS in $(kubectl --context $CTX get namespaces -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | grep -v '^kube-' | grep -v '^default$' | grep -v '^local-path-storage$'); do PODS=$(kubectl --context $CTX get pods -n $NS --no-headers 2>/dev/null | wc -l | tr -d ' ') if [ "$PODS" -gt 0 ]; then echo " $NS: $PODS pods" fi donedone
echo ""echo "========================================"echo " CONNECTIVITY STATUS"echo "========================================"ONPREM_IP=$(docker inspect onprem-control-plane --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' | head -1)CLOUD_IP=$(docker inspect cloud-control-plane --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' | head -1)echo " On-prem IP: $ONPREM_IP"echo " Cloud IP: $CLOUD_IP"
LATENCY=$(docker exec cloud-control-plane ping -c 3 -q $ONPREM_IP 2>/dev/null | tail -1 | awk -F'/' '{print $5}')echo " Cross-cluster latency: ${LATENCY}ms (Docker network, simulated)"SCRIPT
chmod +x /tmp/hybrid-inventory.shbash /tmp/hybrid-inventory.shClean Up
Section titled “Clean Up”Always tear down infrastructure to free up computational resources when your hybrid validation testing completes.
kind delete cluster --name onpremkind delete cluster --name clouddocker network rm hybrid-net 2>/dev/null || truerm /tmp/onprem-cluster.yaml /tmp/cloud-cluster.yaml /tmp/hybrid-inventory.shSuccess Criteria
Section titled “Success Criteria”- I implemented two interconnected
kindclusters validating a hybrid scenario. - I deployed workload resources across both simulated local and cloud endpoints.
- I validated direct Docker bridge network routing between subnets.
- I built a simulated monitoring profile with environment-aware Prometheus tags.
- I authored an aggregated inventory report capturing state from dual control planes.
- I evaluated Direct Connect routing constraints against VPN variability metrics.
- I comprehend how identity federation services operate across hybrid architectures.
Next Module
Section titled “Next Module”With hybrid connectivity firmly established, it is time to manage multiple clusters at scale using advanced administrative controls. Head to Module 10.5: Multi-Cloud Fleet Management (Azure Arc / GKE Fleet) to learn how powerful tooling permits unified oversight, fleet policy deployments, and unified lifecycle tracking across disparate cloud boundaries.
Sources
Section titled “Sources”- AWS Site-to-Site VPN Tunnel Options — Useful for grounding tunnel behavior, bandwidth limits, and operational options discussed in the connectivity section.
- How AWS Transit Gateway Works — Explains route propagation, attachment behavior, and the overlapping-CIDR constraints behind hybrid transit design.
- What Is Azure Arc-enabled Kubernetes? — Covers what Azure Arc actually provides for attached clusters, including policy, monitoring, and GitOps capabilities.
- GKE Deployment Options — Shows how Google maps GKE enterprise features across Google Cloud, on-prem, and attached-cluster environments.