Module 5.2: Multi-Cluster Control Planes
Complexity:
[ADVANCED]| Time: 55–65 minutesPrerequisites: Module 5.1: Private Cloud Platforms, Module 1.3: Cluster Topology
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Design a management-cluster architecture that provisions and upgrades workload clusters with Cluster API on private cloud and bare metal.
- Compare fleet GitOps controllers including Rancher Fleet, Argo CD ApplicationSets, Flux multi-cluster reconciliation, Karmada, and Open Cluster Management.
- Implement policy distribution and RBAC federation patterns that keep tenant boundaries intact across on-premises clusters.
- Evaluate multi-cluster networking primitives such as Submariner, Cilium ClusterMesh, and Istio multi-cluster service mesh.
- Diagnose failover, disaster recovery, and observability fan-in failure domains when a hub control plane or spoke agent degrades.
Why This Module Matters
Section titled “Why This Module Matters”Hypothetical scenario: a regional manufacturer operates twelve Kubernetes clusters across two private data centers. Development, staging, production, and edge factory lines each run on isolated clusters for blast-radius reasons, yet the platform team still receives one change request: deploy the same monitoring agents, ingress controllers, and security policies everywhere before the quarterly audit window closes. Without a deliberate multi-cluster control plane, engineers SSH into twelve API servers, paste twelve slightly different Helm values files, and hope nobody skipped the cluster running on the older minor version.
The first week looks successful because diligent operators finished the rollout manually. The second week exposes the real cost: a hotfix to a PodSecurity admission policy lands on nine clusters but not the three edge sites behind a firewall rule that blocks the CI runner. A certificate expires on a management ingress that only two teams use, and nobody notices until a GitOps sync fails with an opaque TLS error. Meanwhile, leadership asks for a single dashboard that shows aggregate pod crash rates, yet Prometheus in each cluster uses a different cluster label convention. The organization has many Kubernetes control planes but no control plane for the fleet.
On-premises multi-cluster operations differ from public-cloud landing zones because you own every failure domain: BMC networks, BGP fabrics, corporate identity systems, air-gapped registries, and change windows that forbid control-plane upgrades during plant shifts. Public-cloud vendors can hide fleet APIs behind managed control planes; on bare metal and private cloud you assemble the stack from Cluster API providers, GitOps controllers, policy engines, service-mesh or overlay networks, and observability pipelines. This module teaches that assembly as an engineering discipline: separate infrastructure lifecycle from application delivery, choose push versus pull reconciliation based on network reality, and treat the management cluster as production-critical infrastructure with its own backup, upgrade, and disaster-recovery plan.
The payoff is operational leverage. When the management plane is healthy, provisioning a new cluster becomes a merge request to a Git repository, not a three-day runbook. When policy distribution is centralized, auditors review one Kyverno bundle instead of twelve forked copies. When observability fan-in is designed upfront, incident response starts from one Thanos or Grafana endpoint instead of twelve browser tabs. The remainder of this module walks through each layer so you can reason about tradeoffs before committing to Rancher Fleet, Open Cluster Management, Karmada, or a combination.
What You’ll Learn
Section titled “What You’ll Learn”- Management versus workload cluster roles and the control-plane-of-control-planes pattern
- Cluster API placement on vSphere, OpenStack, and bare-metal Metal3 providers
- Fleet controllers: Rancher Fleet, Argo CD ApplicationSets, Flux, Karmada, and OCM
- Declarative lifecycle: provision, upgrade, cordon, and decommission clusters safely
- Policy distribution with OCM, Kyverno, and Gatekeeper at fleet scope
- RBAC and identity federation across clusters without shared etcd
- Multi-cluster networking with Submariner, Cilium ClusterMesh, and Istio
- Failover semantics, backup boundaries, and observability fan-in architecture
The Control Plane of Control Planes
Section titled “The Control Plane of Control Planes”Every production Kubernetes cluster already has a control plane: API server, scheduler, controller manager, and etcd (or an equivalent datastore). A fleet management cluster adds another layer that does not replace those components but orchestrates them. The fleet cluster runs controllers that speak to downstream API servers, distribute custom resources, or reconcile Git-defined desired state. Workload clusters continue to schedule pods locally; the management cluster schedules fleet objects such as Cluster, ManagedCluster, ApplicationSet, or Fleet GitRepo resources.
flowchart TD subgraph Mgmt["Management cluster (fleet control plane)"] CAPI["Cluster API controllers"] GitOps["Fleet / Argo CD / Flux"] OCM["OCM hub / Karmada apiserver"] Pol["Policy & identity controllers"] end subgraph DC1["Datacenter A workload clusters"] W1["prod-a"] W2["staging-a"] end subgraph DC2["Datacenter B workload clusters"] W3["prod-b"] W4["edge-factory"] end Mgmt -->|declarative lifecycle| W1 Mgmt -->|declarative lifecycle| W2 Mgmt -->|GitOps sync| W3 Mgmt -->|agent + pull| W4Pause and predict: if the management cluster etcd backup fails but every workload cluster is healthy, which user-facing symptoms appear first? Typical answers include inability to approve new cluster provisioning, stale GitOps status, failed policy placement, and broken CI pipelines that target the hub API—not immediate pod crashes in production applications. That asymmetry is why mature teams monitor the management plane with the same rigor as production workload clusters.
The management cluster should remain workload-light. Running customer applications on the same nodes that host fleet controllers invites resource contention during large sync waves or Cluster API machine reconciles. Platform teams commonly dedicate three to five control-plane nodes plus workers sized for controllers only, with taints that repel general scheduling. High availability for etcd (or the datastore backing your chosen fleet tool) is non-negotiable; losing quorum on the hub does not instantly stop running workloads downstream, but it freezes fleet-wide change until recovery.
Cluster API on Private Cloud and Bare Metal
Section titled “Cluster API on Private Cloud and Bare Metal”Cluster API (CAPI) is the Kubernetes SIG project that applies declarative reconciliation to clusters themselves. You define a Cluster object, an infrastructure-specific cluster resource (for example VSphereCluster or Metal3Cluster), and machine templates; controllers create VMs or bare-metal hosts, bootstrap kubeadm, and join nodes. On private cloud, the vSphere and OpenStack infrastructure providers translate CAPI objects into VM lifecycle calls your virtualization team already operates. On bare metal, the Metal3 provider drives Ironic-backed provisioning over IPMI or Redfish, which pairs naturally with the PXE and BMC practices from earlier provisioning modules.
CAPI intentionally splits infrastructure lifecycle from application delivery. After a workload cluster reaches Ready, GitOps controllers deploy platform add-ons—CNI, CSI, ingress, monitoring—using the same patterns as single-cluster operations. Mixing application manifests into CAPI cluster templates is tempting for demos but creates tight coupling: upgrading a cluster Kubernetes minor version should not require editing application Helm charts embedded in machine bootstrap secrets.
A minimal management-cluster install uses clusterctl to initialize providers:
# Install clusterctl matching your target Kubernetes minor versioncurl -L https://github.com/kubernetes-sigs/cluster-api/releases/download/v1.12.1/clusterctl-linux-amd64 -o clusterctlchmod +x clusterctl && sudo mv clusterctl /usr/local/bin/
# Initialize core providers (Docker provider useful for local labs)clusterctl init --infrastructure dockerkubectl get pods -n capi-systemkubectl get pods -n capd-systemProduction on-premises stacks add CAPV (vSphere), CAPO (OpenStack), or CAPM3 (Metal3) instead of Docker. The Cluster object references a KubeadmControlPlane for rolling control-plane upgrades and MachineDeployment objects for workers—patterns you will deepen in Module 5.3: Cluster API on Bare Metal. For this module, remember that CAPI answers how nodes exist; fleet tools answer what runs on those nodes after they exist.
| Concern | CAPI responsibility | Fleet tool responsibility |
|---|---|---|
| VM or bare-metal host creation | Infrastructure provider | None |
| Kubernetes version on control plane | KubeadmControlPlane rollout | May trigger via GitOps after CAPI completes |
| Platform add-ons (CNI, CSI) | Bootstrap templates optionally | GitOps Application, HelmRelease, Bundle |
| Tenant application workloads | None | GitOps or CD pipelines targeting spoke |
Worked Example: Designing a Management Cluster on vSphere
Section titled “Worked Example: Designing a Management Cluster on vSphere”Exercise scenario: a platform team must design a management-cluster architecture that provisions and upgrades workload clusters on VMware vSphere while keeping factory-edge sites on bare metal. They choose one management cluster in the primary datacenter running CAPI with CAPV for virtualized workload clusters and CAPM3 for metal edge sites. The design document lists three etcd nodes on fast storage, two worker nodes tainted for fleet=capi, and outbound Git access to an internal Gitea mirror.
When a developer requests a staging cluster, the flow is: merge request adds a Cluster manifest to the clusters/staging/ directory; CAPI controllers create three control-plane VMs and three workers through vSphere templates; upon Ready, Rancher Fleet imports the cluster label env=staging and applies the platform bundle (Cilium, Longhorn, ingress-nginx); only then do application pipelines receive kubeconfig credentials via OIDC. Upgrades follow the same Git trail: bump spec.topology.version on the KubeadmControlPlane, watch rolling VM replacement, then advance Fleet bundle pins after conformance tests pass.
This design deliberately keeps CAPM3 edge clusters off the same MachineDeployment template as vSphere staging clusters because BMC timing, firmware validation, and PXE networks differ from DRS-backed VM creation. Attempting one template for both surfaces would force edge exceptions into manual runbooks—the exact toil the management plane exists to eliminate. Documenting those boundaries in architecture reviews prevents “temporary” SSH steps from becoming permanent operational debt.
Fleet Management Approaches
Section titled “Fleet Management Approaches”Fleet controllers solve configuration drift across many API servers. They differ in reconciliation model (push versus pull), Kubernetes API surface (CRDs versus Git repositories), and opinionated UX (Rancher UI versus raw YAML). None replaces CAPI; they complement it.
Rancher Fleet
Section titled “Rancher Fleet”Rancher Fleet targets operators who want Git-driven bundles applied to cluster groups selected by labels. A GitRepo resource on the Fleet controller watches a repository and builds bundles; a downstream Fleet agent in each workload cluster pulls those bundles by polling the controller—the controller does not initiate connections to spoke API servers. Fleet excels when platform teams already standardize on Rancher for authentication and cluster import, and when spokes can reach the management network (or a mirrored Git/Bundle endpoint) for agent polling.
Fleet bundles support diff comparisons, staged rollouts, and per-cluster customization through fleet.yaml overlays. On-premises teams often mirror Git repositories inside the corporate network so Fleet never depends on public GitHub webhooks during air-gapped maintenance windows.
Rancher integration matters for identity: operators authenticate to Rancher with corporate SSO, then receive scoped downstream cluster access without distributing separate kubeconfig files for every site. Fleet leverages those cluster registrations, so label hygiene on imported clusters (region, environment, compliance-tier) becomes the placement language for bundles. When a bundle fails on one cluster, Fleet surfaces per-cluster error states—platform engineers should treat a single red cluster in a fifty-cluster rollout as a blocking incident if that cluster is production.
Bundle design tips for on-premises: pin Helm chart versions explicitly, keep CRD upgrades in separate bundles from application charts, and use targetCustomizations for site-specific DNS names or storage classes. Testing bundles against a CAPD Docker cluster in CI catches YAML errors before they touch vSphere production.
Argo CD ApplicationSets
Section titled “Argo CD ApplicationSets”Argo CD ApplicationSets generate one Argo Application per target using List, Cluster, Git, Matrix, or Pull Request generators. The Argo CD instance on the management cluster pushes manifests to each spoke API server using credentials stored as cluster secrets. ApplicationSets fit teams already invested in Argo CD who need multi-cluster promotion pipelines with familiar UI and RBAC.
The push model requires stable network paths from hub to spoke API servers on port 6443 (or via bastion tunnels). Factory edge clusters behind outbound-only firewalls may be poor candidates unless you deploy Argo CD agents or shift to pull-based alternatives.
ApplicationSet generators differ by data source: the List generator iterates a static list of parameter sets you define in the ApplicationSet spec; the Cluster generator reacts dynamically to clusters registered in Argo CD (new registrations can spawn Applications without editing the list); the Git generator reads files or directories from a Git repository; the Matrix generator takes the cross-product of two other generators (for example Cluster × Git directory). For on-premises promotion pipelines, combine a Cluster generator with Git paths (overlays/production, overlays/staging) via Matrix so one ApplicationSet tracks registered clusters and environment overlays. Embed maintenance window annotations on cluster secrets so sync windows respect local change freezes.
Operational caution: Argo CD stores cluster credentials as Kubernetes secrets on the hub. Protect those secrets with encryption at rest, rotate tokens when joining clusters leave the fleet, and audit argocd cluster list output during quarterly access reviews. A compromised hub secret equals compromised spoke API access for every registered cluster.
Flux Multi-Cluster
Section titled “Flux Multi-Cluster”Flux traditionally reconciles Git sources inside each cluster. Multi-cluster patterns run a management Flux instance that dispatches KubeConfig secrets or uses Cluster API extension projects to fan out Kustomization objects. Flux’s pull-based DNA suits spokes that initiate connections to Git and OCI registries inside the corporate DMZ, reducing inbound firewall rules.
Flux v2’s composable controllers (source-controller, kustomize-controller, helm-controller) let platform teams split infrastructure platform charts from application teams’ repositories while enforcing OCI signing and policy at reconcile time.
Multi-cluster Flux often uses one management cluster to hold KubeConfig secrets referenced by Kustomization objects with spec.kubeConfig pointing at remote clusters—essentially push reconciliation executed by Flux controllers rather than Argo CD. Alternatively, each spoke runs its own Flux instance pulling from a central monorepo path (clusters/prod-west, clusters/edge-12), preserving pull semantics. The second pattern duplicates controllers but eliminates a single hub credential with access to every apiserver.
For private registries, configure OCIRepository sources mirroring upstream charts into Harbor. Flux reconciliation then never depends on internet availability during maintenance—an common on-premises requirement. Combine with cosign verification policies so only signed platform bundles apply during automated syncs.
Karmada
Section titled “Karmada”Karmada provides a federated Kubernetes API on the management cluster. You create PropagationPolicy objects that schedule resources to member clusters based on placement rules, failover policies, and replica splitting. Karmada’s karmada-apiserver presents familiar kinds (Deployment, Service) while controllers propagate spec to selected clusters—useful when application teams resist learning fleet-specific CRDs but still need active-active or failover semantics across datacenters.
Karmada includes its own cluster registration agent (karmada-agent) and supports auto-propagation of native resources plus custom resources when configured. On-premises teams use it for geographically distributed services that need controlled replica distribution without operating a separate CD pipeline per cluster.
Propagation policies can split replicas across datacenters (replicaScheduling.weightPreference) or require certain labels on member clusters before scheduling. Failover policies move workloads when cluster health taints appear—pair those policies with realistic health checks that distinguish apiserver unreachable from node pressure. Without that distinction, transient network blips may trigger unnecessary failovers.
Karmada does not replace observability or networking add-ons; it coordinates Kubernetes object placement. Plan ClusterMesh or Submariner separately if propagated Services must reach backends on multiple clusters.
Open Cluster Management
Section titled “Open Cluster Management”Open Cluster Management (OCM) uses a hub-and-spoke model with ManagedCluster registration and the work API to deliver manifest bundles. Klusterlet agents on spokes pull work from the hub, which suits restrictive outbound-only networks: factories initiate TLS to the management API instead of exposing every spoke apiserver to CI systems.
OCM integrates deeply with policy (Kubernetes Policy Controller), governance, and observability add-ons in the Red Hat Advanced Cluster Management ecosystem, but the open-source hub/spoke core stands alone. Platform engineers place Placement and ManifestWork resources to decide which clusters receive configuration—patterns aligned with policy distribution and compliance reporting later in this module.
| Tool | Reconciliation | Best on-premises fit | Primary CRD / object |
|---|---|---|---|
| Rancher Fleet | Pull bundles (downstream agents poll controller) | Rancher shops, label-selected clusters, spokes reach controller | GitRepo, Bundle |
| Argo ApplicationSets | Push Applications | Existing Argo CD, hub reachable to spokes | ApplicationSet |
| Flux | Pull per cluster or mgmt dispatch | DMZ-friendly, OCI/Git inside corp net | GitRepository, Kustomization |
| Karmada | Federated API propagation | Active-active app placement | PropagationPolicy |
| OCM | Hub assigns, agent pulls work | Outbound-only edge, governance focus | ManagedCluster, ManifestWork |
Decision Framework: Choosing Fleet Controllers
Section titled “Decision Framework: Choosing Fleet Controllers”Platform architects rarely pick exactly one tool. A pragmatic on-premises reference architecture stacks CAPI for infrastructure, OCM for edge policy and inventory, and Argo CD ApplicationSets for application teams already standardized on Argo. The decision framework below avoids ideology and focuses on constraints you can verify in workshop meetings.
| If your constraint is… | Lean toward… | Why |
|---|---|---|
| Corporate network allows hub→spoke API on 6443 | Argo CD ApplicationSets (push) | Hub applies manifests to spoke API servers directly |
| Edge sites are outbound-only | OCM klusterlet pull | Agents initiate TLS to hub; no inbound firewall holes |
Application teams want standard Deployment YAML without fleet CRDs | Karmada propagation | Federated API preserves familiar objects |
| Air-gapped Git mirrors and bundle diff UX | Rancher Fleet | Git-driven bundles; downstream agents pull from controller |
| Strict OCI artifact signing and Helm drift control | Flux with cluster-specific Kustomize overlays | Composable controllers and native OCI sources |
Stop and think: two teams argue—Team A wants Argo CD because they already run it for single-cluster apps; Team B insists OCM because half the factories cannot expose apiserver endpoints. The resolution is not a coin flip. Split responsibilities: OCM delivers platform policy and inventory on edge clusters; Argo CD on the hub continues to implement application distribution to datacenter clusters that permit push. Document network diagrams in the architecture review so auditors see why hybrid fleet models are intentional, not accidental drift.
Klusterlet registration uses bootstrap tokens or manual import YAML; after import, the hub shows cluster metadata, version, and addon status. ManifestWork objects apply raw manifests or kustomize overlays to selected placements—think of them as auditable batch kubectl applies driven by hub RBAC. For maintenance, ManagedCluster availability taints can pause policy propagation to clusters undergoing metal work.
Troubleshooting Hub and Spoke Failures
Section titled “Troubleshooting Hub and Spoke Failures”When fleet operations degrade, classify symptoms before swapping tools. Hub API unavailable presents as failed CI pipelines, stuck ManifestWork statuses, and Argo CD sync errors referencing authentication—all while workloads on spokes continue running unless they need new secrets from the hub. Spoke agent disconnected presents as one cluster drifting while others remain current; OCM shows ManagedCluster conditions false, Fleet shows bundle errors for a subset, and Argo CD marks a cluster Unknown. Network partition between sites may look like agent disconnect but is actually firewall or BGP change; curl -k tests from spoke nodes to hub endpoints distinguish TLS issues from routing loss.
Credential rotation causes subtle drift: if cluster join tokens expire, agents reconnect only after secret refresh—automate rotation with short-lived credentials and alerts on age. CRD version skew between hub bundles and spoke Kubernetes versions surfaces as sync errors referencing unknown fields; maintain a compatibility matrix tying fleet bundle versions to allowed Kubernetes minors.
API server exhaustion on the hub during large sync waves is real on undersized management clusters. Rate-limit concurrent ApplicationSet syncs, shard Fleet GitRepos, or schedule upgrades in waves. Horizontal scaling of Argo CD repo-server or Fleet controller deployments helps, but etcd remains the ultimate bottleneck—watch etcd latency during fleet-wide Helm upgrades.
Document escalation: if hub restore exceeds RTO, execute documented break-glass kubeconfigs stored in vault for direct spoke access. Break-glass access should be rare, audited, and sufficient to stabilize workloads until hub recovery completes.
Lifecycle: Provision, Upgrade, and Decommission
Section titled “Lifecycle: Provision, Upgrade, and Decommission”Fleet operations span the entire cluster lifetime—not only day-one provisioning.
Provision: CAPI (or vendor installers like TKG on vSphere) creates the cluster API endpoint and worker nodes. GitOps registers the new cluster secret, Fleet imports the cluster, or OCM accepts a ManagedCluster join. Platform add-ons deploy in waves: CNI before anything else, CSI before stateful sets, ingress before public routes, monitoring before production traffic. Skipping the wave order produces familiar failure modes—Pods stuck ContainerCreating because CNI never ready, or Prometheus crash loops because storage class missing.
Upgrade: Control-plane upgrades flow through KubeadmControlPlane rollouts or vendor supervisors; node upgrades use MachineDeployment surge settings. Application compatibility gates (CRD version skew, deprecated API removals) run in CI against a canary cluster before fleet-wide promotion. GitOps controllers should support pause labels on clusters during metal maintenance windows.
Decommission: Remove cluster entries from Fleet, Argo cluster secrets, OCM ManagedCluster, and Karmada membership before deleting infrastructure. Drain workloads, revoke OIDC clients, delete DNS records, and archive etcd backups per retention policy. Deleting VMs while GitOps still targets the cluster UUID causes orphaned object finalizers on the hub.
# Cordon and drain a node before hardware maintenance (Kubernetes-level safety)kubectl cordon worker-7.example.internalkubectl drain worker-7.example.internal \ --ignore-daemonsets --delete-emptydir-data --grace-period=300
# After maintenance, uncordon before returning the node to schedulingkubectl uncordon worker-7.example.internalDocument a cluster class matrix: which Kubernetes versions, CNI versions, and platform bundles are approved together. Without it, fleet controllers happily deploy charts to clusters that lack required APIs.
Integrating Cluster API Outputs with Fleet Controllers
Section titled “Integrating Cluster API Outputs with Fleet Controllers”The handoff between CAPI and fleet tools is where many implementations stumble. CAPI emits a kubeconfig secret when a workload cluster becomes ready—often named with the cluster name in the management namespace. Fleet and Argo expect cluster registration objects referencing that kubeconfig; OCM uses import commands or ManagedCluster auto-import addons; Flux may reference KubeConfig secrets via Kustomization spec fields.
Automate registration: when CAPI sets the Cluster Ready condition, a small controller or CI job creates the Fleet cluster registration, Argo cluster secret, or OCM import manifest. This prevents the common gap where infrastructure exists but platform bundles never attach because nobody ran the manual import step. Store kubeconfig secrets encrypted and rotate them when control-plane certificates renew.
Version skew gates belong in the same automation: if CAPI upgrades a cluster from Kubernetes 1.34 to 1.35, pause Fleet bundles that require deprecated APIs until validation jobs pass. Git tags on platform repositories (platform-v1.35) communicate compatible bundle versions to cluster classes. Treat that integration as part of lifecycle design, not day-two polish.
Upgrade ordering for production: upgrade management cluster fleet controllers first, then CAPI providers, then workload control planes in waves, then platform bundles, then tenant applications. Reversing the order—application charts before CNI compatibility—creates outages that appear as mysterious DNS failures rather than explicit version errors.
Policy Distribution
Section titled “Policy Distribution”Running Kyverno or Gatekeeper in every cluster manually does not scale. Fleet platforms distribute policy bundles as part of platform onboarding.
OCM’s policy framework attaches policies to PlacementRule or Placement decisions so only production clusters receive restrictive Pod Security levels, while labs stay permissive for training. Rancher Fleet bundles can include Kyverno ClusterPolicy manifests alongside Helm releases. Argo CD ApplicationSets excel at promoting the same policy kustomize overlay from staging clusters to production clusters after CI validation.
Design policies for fail-closed platform safety: require labels, block latest tags in production placements, enforce resource quotas via generated policies, and validate ingress TLS settings. Keep application-team policies in separate Git repositories to avoid merge contention with platform engineers.
Sync waves matter: admission policies must exist before workloads that violate them, or CI will flap. Use GitOps sync hooks or OCM ManifestWork dependencies to order installation.
To implement policy distribution safely, start from a golden repository structured by blast radius:
# Example OCM Policy placement (conceptual)apiVersion: policy.open-cluster-management.io/v1kind: Policymetadata: name: require-nonroot namespace: policiesspec: remediationAction: enforce---apiVersion: cluster.open-cluster-management.io/v1beta1kind: Placementmetadata: name: production-clustersspec: predicates: - requiredClusterSelector: labelSelector: matchLabels: environment: productionPair policy distribution with RBAC federation: the same IdP group platform-admins receives ClusterRoleBinding manifests generated for each newly registered cluster, while tenant groups receive namespace-scoped bindings only on clusters labeled with their cost center. Tenant boundaries fail when cluster labels are wrong—treat label accuracy as part of policy implementation, not an afterthought.
Kyverno generate rules can stamp standard labels on every namespace, ensuring downstream policy engines and cost dashboards agree on ownership. Gatekeeper constraint templates versioned in Git provide admission-time validation that complements Kyverno mutate rules. The patterns should be boring and repeatable; exciting one-off kubectl edits are where audits find violations.
RBAC and Identity Federation
Section titled “RBAC and Identity Federation”Each cluster maintains its own RBAC bindings, but humans should not hold twelve separate kubeconfig files with duplicated group memberships. Common patterns:
Central OIDC: Corporate IdP issues tokens; each cluster’s API server trusts the same issuer with different client IDs or audiences. Group claims map to ClusterRoleBinding objects generated by GitOps when a cluster joins the fleet.
Hub impersonation: OCM and Rancher provide proxy endpoints that honor hub RBAC while executing on spokes—convenient for support engineers if audited.
Certificate boundaries: Separate CAs for management versus workload clusters. Compromise of a workload cluster CA must not mint management-cluster admin certificates.
Avoid sharing service account tokens across clusters. Automation should use short-lived credentials scoped per cluster and per namespace.
For RBAC federation, generate ClusterRoleBinding manifests from a template parameterized by cluster name and OIDC group. GitOps commits those bindings when ManagedCluster labels match environment=production. Separate break-glass cluster-admin roles from daily operator roles; vault-stored credentials should require ticket IDs. Multi-cluster identity also spans corporate LDAP groups mapped differently per business unit—document which groups may access edge versus datacenter clusters to preserve tenant boundaries.
Audit hub proxy sessions if using Rancher or OCM console: record who impersonated which spoke and when. Federation without auditing creates compliance gaps even when technical RBAC is correct.
Multi-Cluster Networking Primitives
Section titled “Multi-Cluster Networking Primitives”Applications spanning clusters need L3/L4 connectivity or multi-cluster service discovery—not just GitOps configuration alignment.
Submariner (submariner.io) connects pod and service CIDRs across clusters with encrypted tunnels, enabling direct pod IP communication where routing allows. Submariner Lighthouse adds cross-cluster Kubernetes Service discovery via the Multi-Cluster Services (MCS) API and DNS—exported services resolve across member clusters without requiring full pod routability. It suits on-premises datacenters with controlled BGP or static routes between sites. Operators must ensure CIDR plans never overlap before joining clusters.
Cilium ClusterMesh (docs) shares services between independent Cilium data planes using clustermesh APIs. Global services can load-balance backends in multiple clusters when both run Cilium with compatible versions. ClusterMesh assumes mutual TLS trust between etcd-visible identities—plan certificate rotation carefully.
Istio multi-cluster (docs) provides L7 traffic management, mTLS, and failover via ServiceEntry and east-west gateways. It adds operational weight—control planes per cluster or primary-remote topologies—but delivers fine-grained traffic policies for brownfield enterprises already standardized on Istio.
flowchart LR subgraph SiteA["Site A cluster"] PA["Pod 10.40.1.5"] SA["Service A"] end subgraph SiteB["Site B cluster"] PB["Pod 10.50.2.8"] SB["Service B"] end PA -->|Submariner / ClusterMesh / Istio GW| PB SA -.->|exported service| SBChoose networking based on whether you need pod IP reachability (Submariner gateways), cross-cluster Service DNS/discovery (Submariner Lighthouse or Cilium ClusterMesh), or L7 policy with mTLS (Istio). Many teams implement only GitOps federation first, then discover application dependencies that require east-west connectivity—budget time accordingly.
Submariner deploys gateway nodes that encapsulate traffic between pod CIDRs. On-premises routing teams must install routes or BGP advertisements pointing remote pod CIDRs at Submariner gateways. Lighthouse publishes ServiceExport objects and serves MCS DNS so clients resolve remote cluster Services by name. NAT scenarios use Globalnet when overlapping service CIDRs cannot be renumbered. Latency-sensitive workloads should measure cross-site RTT before assuming pod-to-pod access is free.
Cilium ClusterMesh requires mutual trust between cluster etcd-visible identities and compatible Cilium versions. Enable clustermesh-apiserver with proper TLS rotation; stale certificates break service discovery silently while pods still run locally. Global services excel at spreading HTTP backends across datacenters when health checks propagate quickly.
Istio multi-cluster patterns split into single-network and multi-network topologies. East-west gateways bridge non-routable networks; primary-remote configurations reduce control-plane count but concentrate failure domains. Mesh upgrades demand coordinated revisions across clusters—schedule Istio upgrades in lockstep with fleet GitOps pause flags.
When evaluating these primitives, score each against operational headcount: Submariner adds network engineering touchpoints plus Lighthouse for MCS DNS, ClusterMesh adds Cilium expertise for global services, Istio adds L7 policy power with operational weight. Hybrid designs are valid—GitOps on OCM, Lighthouse or ClusterMesh for service discovery, Istio only on clusters needing advanced traffic policy.
Failover and Disaster Recovery Semantics
Section titled “Failover and Disaster Recovery Semantics”Fleet hub failure does not stop running workloads if spokes are healthy, but it blocks new deployments and may stall policy updates. Maintain etcd backups for the management cluster, documented restore drills, and optionally a warm standby management cluster in a secondary site. RTO for the hub is often measured in hours; RPO depends on etcd snapshot frequency.
Workload cluster failure triggers application DR: DNS or global load balancing shifts traffic to surviving clusters if data replication exists. Karmada failover policies can reschedule replicas when a cluster becomes unready. Without replicated stateful data, failover is compute-only.
Clarify active-active versus active-passive expectations with application owners. GitOps can keep configurations synchronized, but databases and object stores need their own replication contracts.
Run game days that simulate hub loss during business hours in a non-production fleet: disable management cluster apiserver load balancers, verify spokes continue serving traffic, measure time to restore etcd from backup, and confirm GitOps objects reconcile without manual object surgery. Capture gaps in runbooks when engineers reach for break-glass kubeconfigs because registration metadata was incomplete.
For workload-cluster failover, document whether DNS, load balancers, or service mesh traffic policies own user-visible cutover. Platform teams provide healthy clusters and networking paths; application teams confirm state replication RPO meets product requirements. Without that split, disaster recovery tests devolve into blaming Kubernetes when the database lag was the actual root cause.
Backup scope checklist:
| Asset | Backup target | Restore owner |
|---|---|---|
| Management etcd | Object storage with encryption | Platform SRE |
| CAPI management secrets | Sealed secrets / vault | Platform SRE |
| Workload etcd | Per-cluster Velero or etcd snapshot | Cluster team |
| Git repositories | Corporate Git + DR mirror | DevOps |
| Observability data | Thanos long-term storage | SRE |
Observability Fan-In
Section titled “Observability Fan-In”Twelve Prometheus instances are unusable for executive dashboards and slow for on-call engineers. Observability fan-in centralizes metrics, logs, and traces while preserving per-cluster isolation labels.
Common metrics pattern: Prometheus agents or lightweight scrapers on spokes remote-write to Thanos Receive or Grafana Mimir on the management site or a dedicated observability cluster. Mandatory labels include cluster, environment, site, and platform_version. OCM addon ObservabilityMetric resources can deploy collectors consistently.
HA deduplication differs by backend: Thanos deduplicates redundant Prometheus replicas at query time—the Thanos Querier merges series and drops duplicates using replica external labels on scrape targets. Grafana Mimir deduplicates at ingestion time via its HA tracker: the distributor keeps samples from only one replica of a HA-paired Prometheus pair, so duplicates never land in long-term storage. Design remote-write labels and HA pairing accordingly when spokes run redundant scrapers.
Logs use Fluent Bit or OpenTelemetry collectors forwarding to Loki with the same label contract. Traces aggregate in Tempo or Jaeger with tenant headers per business unit.
Alert routing should identify which cluster fired, not only which metric. On-call runbooks link from ManagedCluster name to network topology and escalation paths.
Test fan-in failure modes: if remote-write backs up, spokes must not OOM Prometheus. Use buffering, dropping rules, and cardinality limits on recording rules imported from a golden observability repository.
When you diagnose observability fan-in failure domains, walk the pipeline in order: scrape targets on the spoke, agent remote-write connectivity, receive path on Thanos or Mimir, query frontend caches, and Grafana datasource credentials. A spoke Prometheus that looks healthy locally may still fail fan-in if TLS intercept appliances rewrite certificates on the management network. Hub control plane degradation also stalls OCM addon upgrades that deploy collectors—symptoms resemble application outages but root cause sits in fleet registration.
Runbooks should state explicit failover steps: if the primary metrics store is unavailable, spokes buffer or drop with alert RemoteWriteBehind; if the hub is lost, spokes continue scraping locally until restore; if a single spoke agent dies, dashboards show a gap for one cluster label rather than global blindness. Disaster recovery drills must restore both etcd and object-storage buckets holding historical blocks, otherwise post-incident timelines lack data even when applications recovered quickly.
Did You Know?
Section titled “Did You Know?”- Cluster API is a subproject of Kubernetes SIG Cluster Lifecycle—not a CNCF incubating project on its own—and providers ship independently from Kubernetes minor releases.
- Open Cluster Management’s klusterlet uses a pull model so edge clusters behind outbound-only firewalls can receive
ManifestWorkwithout exposing their API servers to the corporate CI network. - Cilium ClusterMesh requires non-overlapping pod and service CIDRs across member clusters, a planning constraint that must be enforced before the first cluster is provisioned.
- Losing management-cluster etcd quorum typically freezes fleet-wide GitOps and policy updates long before workload-cluster applications stop running—making hub monitoring easy to under-prioritize until the first failed upgrade window.
Platform teams that treat the management cluster as “just another dev cluster” usually learn this lesson during the first coordinated Kubernetes minor upgrade across a fleet. Elevate hub SLOs, backup verification, and etcd latency alerts to the same tier as production workload apiserver monitoring on every private cloud and bare-metal site you operate fleet-wide.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Running tenant workloads on the management cluster | Fleet syncs starve or CAPI machine rolls evict platform controllers | Taint management workers; keep only fleet/CAPI controllers |
| Hub push GitOps (e.g. Argo) without network path to spokes | Applications never sync; silent drift on edge clusters | Use OCM/Flux/Fleet pull agents or Argo CD cluster agents |
| Skipping CIDR planning before Submariner or ClusterMesh | Overlapping pod CIDRs require painful cluster rebuild | Document IPAM matrix before provisioning cluster zero |
| Embedding application charts inside CAPI bootstrap | Cluster upgrades break unrelated application releases | Separate CAPI templates from GitOps platform repo |
| One kubeconfig shared by all engineers | No audit trail; excessive blast radius on credential leak | OIDC per cluster plus hub proxy with RBAC |
| Upgrading all clusters same day | Correlated failure during deprecated API removal | Canary cluster per minor version; staged Fleet bundles |
Observability without cluster label | Alerts cannot route; Thanos queries aggregate incorrectly | Enforce label injection via scrape config or addon |
| Deleting infrastructure before hub deregistration | Orphaned secrets and finalizers block GitOps objects | Ordered decommission runbook: detach, delete CRs, then VMs |
Question 1: Your private cloud team provisions clusters with Cluster API on vSphere. Application platform engineers want a single Git repository to deploy ingress and monitoring to every new cluster automatically. Which component owns each concern?
A) CAPI owns ingress Helm releases; Argo CD owns VM creation. B) CAPI owns VM and Kubernetes node lifecycle; a fleet GitOps tool owns platform add-ons after the cluster API is ready. C) Rancher Fleet owns etcd backups; CAPI owns application Deployments. D) ApplicationSets own bare-metal BMC credentials; CAPM3 owns Prometheus rules.
View Answer
Correct Answer: B. Cluster API infrastructure providers translate Cluster and Machine objects into VMs and joined nodes. GitOps controllers—Fleet, Argo CD, Flux, or OCM ManifestWork—reconcile platform manifests once the API server is reachable. Option A reverses responsibilities. Option C confuses backup tooling with CAPI. Option D incorrectly assigns application and monitoring ownership to infrastructure providers.
Question 2: Twelve factory-edge clusters can initiate HTTPS outbound to a central management API but cannot receive inbound connections from the corporate network. Which fleet architecture fits best?
A) Argo CD ApplicationSets on the hub pushing to spoke API servers on port 6443.
B) Open Cluster Management with klusterlet agents pulling ManifestWork from the hub.
C) Karmada only, without any agents on spokes.
D) Running all workloads on the management cluster to avoid spokes.
View Answer
Correct Answer: B. OCM’s agent pull model matches outbound-only edge networks. Option A requires inbound API access from hub to spoke, which the scenario forbids. Option C is incomplete because Karmada still registers member clusters with agents. Option D defeats isolation requirements for edge factories.
Question 3: Platform engineering needs active-active replica placement across datacenters with a federated Kubernetes API surfaced to application teams. Which tool most directly provides that experience?
A) Submariner alone without GitOps. B) Karmada propagation policies selecting member clusters. C) A single-cluster autoscaler. D) etcd snapshot scheduling only.
View Answer
Correct Answer: B. Karmada federates familiar workload APIs and propagates resources based on placement and failover policies. Submariner solves networking, not workload API federation. Options C and D do not address multi-cluster scheduling semantics.
Question 4: During a management-cluster etcd corruption event, workload clusters remain healthy. What is the most likely immediate impact?
A) All production pods terminate within seconds. B) New GitOps syncs and cluster registrations stall while existing workloads continue running. C) Cilium ClusterMesh automatically rebuilds etcd on spokes. D) vSphere CSI stops mounting volumes globally.
View Answer
Correct Answer: B. Hub datastore loss freezes fleet control loops but does not directly stop kubelet-scheduled pods on spokes. Option A overstates impact. Options C and D reference unrelated components.
Question 5: Application teams need cross-cluster Kubernetes Service DNS/discovery. One site runs Cilium with non-overlapping pod CIDRs; another runs Submariner with Lighthouse enabled. Which statement is most accurate?
A) Only Cilium ClusterMesh can export Services across clusters; Submariner provides pod connectivity only. B) Submariner Lighthouse (MCS DNS) and Cilium ClusterMesh both provide cross-cluster Service discovery; choose based on your CNI and existing Submariner deployment. C) Deleting NetworkPolicies on both clusters enables cross-cluster DNS. D) Sharing one kubeconfig file federates Service records automatically.
View Answer
Correct Answer: B. Cilium ClusterMesh global services and Submariner Lighthouse (Multi-Cluster Services API with MCS DNS) both address cross-cluster Service discovery. Option A ignores Lighthouse. Options C and D do not provide MCS or ClusterMesh semantics.
Question 6: Auditors require proof that production clusters received the latest Pod Security admission policy but staging clusters may lag one week. Which combination implements that policy distribution design?
A) Manually kubectl apply on each cluster without records. B) OCM or Fleet placement targeting production labels plus staged Git promotion for staging. C) Disable admission during audits. D) Store policies only in a wiki.
View Answer
Correct Answer: B. Placement rules and staged Git repos provide evidence and differing schedules. Option A lacks auditable automation. Options C and D fail compliance goals.
Question 7: Platform leadership asks you to design a management-cluster architecture that upgrades workload clusters on vSphere with Cluster API while keeping Git history auditable. What is the strongest first deliverable?
A) A Git repository with Cluster, KubeadmControlPlane, and Fleet bundle manifests plus documented upgrade waves.
B) A spreadsheet of SSH passwords for each apiserver.
C) Deleting CAPI and installing kubectl on each node manually.
D) Running all workloads on the management cluster to reduce cluster count.
View Answer
Correct Answer: A. Design artifacts should capture declarative provisions and upgrades through version-controlled manifests. CAPI clusters on private cloud map naturally to reviewed YAML. Option B is insecure and non-auditable. Option C abandons declarative lifecycle. Option D collapses isolation boundaries the architecture is meant to preserve.
Question 8: On-call reports Thanos fan-in stopped while applications run fine. Which diagnose path matches observability failure domains?
A) Check remote-write errors on spoke agents, then hub receive endpoints, then object storage retention. B) Immediately rebuild every workload cluster etcd. C) Disable all NetworkPolicies cluster-wide. D) Remove OIDC to simplify kubeconfig access.
View Answer
Correct Answer: A. Observability pipelines fail independently from workload scheduling; fan-in issues usually appear in agent buffers, receive ingress (Thanos Receive or Mimir distributor), or long-term storage—not in application Deployments. Option B risks unnecessary disaster scope. Options C and D do not address metrics recovery paths.
Hands-On Exercise: Explore a Local Management Plane
Section titled “Hands-On Exercise: Explore a Local Management Plane”These exercises use kind clusters and public documentation commands so you can practice fleet concepts without a full private cloud lab. Commands assume kubectl, kind, and docker are installed. Together they walk through design of a local management-cluster architecture, implement a minimal policy distribution object, and diagnose label contracts for observability fan-in.
Exercise 1: Initialize Cluster API on a kind management cluster
Section titled “Exercise 1: Initialize Cluster API on a kind management cluster”kind create cluster --name capi-mgmtkubectl cluster-info --context kind-capi-mgmt
curl -L https://github.com/kubernetes-sigs/cluster-api/releases/download/v1.12.1/clusterctl-linux-amd64 -o /tmp/clusterctlchmod +x /tmp/clusterctl/tmp/clusterctl init --infrastructure docker
kubectl wait --for=condition=Ready pod -l cluster.x-k8s.io/provider=cluster-api -n capi-system --timeout=180skubectl get pods -n capi-systemkubectl get pods -n capd-system- kind management cluster created and API reachable.
-
clusterctl initcompleted without provider pod crash loops. - You can explain how this lab design maps to provisions and upgrades of workload clusters with Cluster API on private cloud.
Exercise 2: Implement a namespaced policy placeholder and inspect fleet CRDs
Section titled “Exercise 2: Implement a namespaced policy placeholder and inspect fleet CRDs”kubectl api-resources | grep -E 'cluster.x-k8s.io|fleet.cattle.io|cluster.open-cluster-management.io'kubectl explain cluster --api-version=cluster.x-k8s.io/v1beta1 | head -20
kubectl create namespace policy-lab --dry-run=client -o yaml | kubectl apply -f -cat <<'EOF' | kubectl apply -f -apiVersion: v1kind: ConfigMapmetadata: name: tenant-boundary-contract namespace: policy-labdata: policy_distribution: "production-only-kyverno-bundle" rbac_federation: "oidc-group-platform-admins"EOFkubectl get configmap tenant-boundary-contract -n policy-labOpen the following documentation in a browser and note whether reconciliation is push or pull for each tool:
-
Rancher Fleet: https://fleet.rancher.io/
-
Open Cluster Management: https://open-cluster-management.io/docs/getting-started/quick-start/
-
CAPI
Clusterkind explained successfully. -
Documented push versus pull behavior for Fleet and OCM.
-
Created a policy distribution placeholder ConfigMap describing tenant boundaries and RBAC federation fields.
Exercise 3: Diagnose label gaps for observability fan-in
Section titled “Exercise 3: Diagnose label gaps for observability fan-in”kubectl create namespace observability-drill --dry-run=client -o yaml | kubectl apply -f -kubectl label namespace observability-drill platform.kubedojo.io/cluster-class=lab --overwrite
cat <<'EOF' | kubectl apply -f -apiVersion: v1kind: ConfigMapmetadata: name: prometheus-scrape-contract namespace: observability-drilldata: required_labels: "cluster,environment,site"EOF
kubectl get configmap prometheus-scrape-contract -n observability-drill -o yaml- Namespace labeled for cluster-class selection.
- ConfigMap documents required fan-in labels for future Prometheus agents.
- You can diagnose which missing label would break observability fan-in queries during a failover drill.
Learner Check
Section titled “Learner Check”You are ready to continue when you can sketch a management cluster and at least two workload clusters on paper, labeling which components handle infrastructure lifecycle (Cluster API), application delivery (Fleet, Argo CD, Flux, Karmada, or OCM), policy distribution, identity, networking, and observability fan-in. Explain push versus pull reconciliation for your own network constraints without looking at notes.
Self-assessment prompts:
- Why should the management cluster stay workload-light?
- Which failure happens first when hub etcd is unavailable but spokes are healthy?
- When would you choose Submariner over Cilium ClusterMesh over Istio multi-cluster?
- How do placement rules help auditors receive stricter policies on production clusters only?
If you can answer those with a concrete on-premises example from your organization’s datacenter or factory edge constraints, proceed to bare-metal Cluster API depth in the next module.
Before moving on, rehearse explaining the control-plane-of-control-planes pattern to a skeptical application developer in three sentences: what the management cluster does, what stays local on workload clusters, and why hub outages differ from application outages. That narrative skill prevents over-promising “single pane of glass” magic while still justifying investment in fleet automation, policy distribution, and observability fan-in for on-premises Kubernetes at scale.
Next Module
Section titled “Next Module”Continue to Module 5.3: Cluster API on Bare Metal for declarative provisioning with Metal3, MachineHealthCheck remediation, and GitOps-driven cluster classes on physical servers.
Sources
Section titled “Sources”- https://cluster-api.sigs.k8s.io/
- https://cluster-api.sigs.k8s.io/user/quick-start.html
- https://github.com/kubernetes-sigs/cluster-api
- https://karmada.io/docs/
- https://open-cluster-management.io/
- https://open-cluster-management.io/docs/getting-started/quick-start/
- https://fleet.rancher.io/
- https://ranchermanager.docs.rancher.com/integrations-in-rancher/fleet
- https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/
- https://fluxcd.io/flux/installation/configuration/multitenancy/
- https://submariner.io/
- https://docs.cilium.io/en/stable/network/clustermesh/clustermesh/
- https://istio.io/latest/docs/setup/install/multicluster/
- https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/
- Cluster API documentation — Declarative cluster lifecycle API and provider model.
- Karmada documentation — Federated propagation and failover policies across member clusters.
- Open Cluster Management — Hub-and-spoke registration and pull-based
ManifestWorkdelivery.