Module 10.5: Multi-Cloud Fleet Management (Azure Arc / GKE Fleet)
Complexity: [COMPLEX] | Time to Complete: 2.5h | Prerequisites: Hybrid Cloud Architecture (Module 10.4), Kubernetes Multi-Cluster Basics
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Configure Azure Arc-enabled Kubernetes and GKE Fleet to register and manage clusters across clouds and on-premises
- Implement fleet-wide GitOps with Flux or Config Sync deployed consistently across all registered clusters
- Deploy centralized policy enforcement using Azure Policy for Arc or GKE Policy Controller across the entire fleet
- Design fleet topology strategies that balance central governance with team autonomy in multi-cluster environments
Why This Module Matters
Section titled “Why This Module Matters”In late 2023, a global retail company operated 73 Kubernetes clusters across three cloud providers and two data centers. Each cluster had its own deployment pipeline, its own monitoring stack, its own policy engine, and its own team responsible for upgrades. When a critical CVE in the Kubernetes API server (CVE-2023-5528) was announced, their security team needed to assess and patch every cluster. It took them 11 days to determine which clusters were affected, 6 weeks to patch all of them, and during that window they discovered that 9 clusters were running Kubernetes versions so old they were no longer receiving security patches at all. Nobody had noticed because nobody had a fleet-wide view.
The incident report identified a fundamental organizational failure: they had clusters but not a fleet. Each cluster was a pet, individually configured and managed. The company had no centralized inventory, no way to push configuration changes to all clusters simultaneously, and no unified view of compliance or health. Their CTO described it as “running 73 separate Kubernetes islands with bridges made of Slack messages and wiki pages.”
Fleet management tools solve this by treating your entire collection of Kubernetes clusters as a single manageable unit. Azure Arc, Google Fleet (GKE Enterprise), and open-source alternatives like Rancher Fleet provide centralized inventory, policy distribution, configuration management, and observability across clusters regardless of where they run. In this module, you will learn the reality of multi-cloud fleet management, how Azure Arc and Google Fleet work, how to centralize telemetry and policy, and how to implement multi-cloud GitOps at scale.
The Multi-Cloud Reality Check
Section titled “The Multi-Cloud Reality Check”Before diving into tools, you need an honest assessment of why enterprises end up multi-cloud and what it actually costs.
Why Enterprises Go Multi-Cloud
Section titled “Why Enterprises Go Multi-Cloud”Most enterprises do not choose multi-cloud strategically. They end up there through:
- Acquisitions: Company A uses AWS, acquires Company B which uses Azure. Consolidation is estimated at 3 years but never happens.
- Best-of-breed selection: ML team chose GCP for Vertex AI, main platform team chose AWS for EKS, data team chose Azure for Synapse.
- Regulatory requirements: EU data must stay in a specific region that only one provider supports well.
- Vendor negotiation leverage: “We use AWS but we could switch to Azure” is only credible if you actually have Azure workloads.
- Shadow IT: A team started using a second cloud on a corporate credit card. By the time IT found out, there were production workloads running.
The Real Cost of Multi-Cloud
Section titled “The Real Cost of Multi-Cloud”| Category | Single Cloud | Multi-Cloud (3 CSPs) |
|---|---|---|
| Platform team size | 5-8 engineers | 12-20 engineers |
| Training budget | $15K/year | $45-60K/year |
| Tooling licenses | $50K/year | $100-200K/year |
| Data transfer costs | Internal only | $50-200K/year |
| Compliance audits | 1 scope | 3 scopes |
| Incident complexity | Low | High (finger-pointing) |
| Negotiation leverage | Low | Medium |
Net effect: 2-3x operational cost for marginal benefit. Exception: If you genuinely use best-of-breed per provider.
Stop and think: Look at the table above. If your organization is operating in multiple clouds purely due to an un-merged acquisition, are you extracting any of the “best-of-breed” benefits, or are you just paying the multi-cloud operational tax?
Azure Arc for Kubernetes
Section titled “Azure Arc for Kubernetes”Azure Arc extends Azure’s management plane to any Kubernetes cluster, regardless of where it runs. You can connect an EKS cluster, a GKE cluster, an on-premises kubeadm cluster, or even a Raspberry Pi cluster to Azure Arc and manage them all through the Azure portal and APIs.
How Azure Arc Works
Section titled “How Azure Arc Works”flowchart TD subgraph Azure ["Azure Control Plane"] ARM["Azure Resource Manager\n(Azure Portal)"] Policy["Azure Policy Engine"] Monitor["Azure Monitor"] Defender["Microsoft Defender"] end
subgraph Cluster ["Target Cluster"] subgraph Agent ["Arc Agent"] CC["cluster-connect\n(reverse proxy)"] CA["config-agent\n(GitOps/Flux)"] AP["azure-policy\n(Gatekeeper)"] OM["monitoring\n(omsagent)"] end end
Agent -- "HTTPS (outbound only)\nNo inbound ports needed.\nNo VPN required." --> AzureConnecting a Cluster to Azure Arc
Section titled “Connecting a Cluster to Azure Arc”# Prerequisites: Azure CLI with connectedk8s extensionaz extension add --name connectedk8saz extension add --name k8s-configuration
# Connect an EKS cluster to Azure Arc# First, ensure your kubeconfig points to the target clusterexport KUBECONFIG=~/.kube/eks-production-config
az connectedk8s connect \ --name eks-prod-us-east-1 \ --resource-group rg-arc-fleet \ --location eastus \ --tags "provider=aws" "environment=production" "team=platform" \ --distribution eks \ --infrastructure aws
# Verify the connectionaz connectedk8s show \ --name eks-prod-us-east-1 \ --resource-group rg-arc-fleet \ --query '{Name:name, Status:connectivityStatus, Distribution:distribution, Infrastructure:infrastructure}'
# List all Arc-connected clustersaz connectedk8s list \ --resource-group rg-arc-fleet \ --query '[].{Name:name, Status:connectivityStatus, Provider:infrastructure, K8sVersion:kubernetesVersion}' \ --output tableAzure Policy for Arc-Connected Clusters
Section titled “Azure Policy for Arc-Connected Clusters”Once connected, you can push Azure Policies to any cluster in your fleet:
# Assign a policy to enforce no privileged containers across ALL Arc clustersaz policy assignment create \ --name deny-privileged-containers \ --display-name "Deny privileged containers on all Arc clusters" \ --policy "/providers/Microsoft.Authorization/policyDefinitions/95edb821-ddaf-4404-9732-666045e056b4" \ --scope "/subscriptions/$SUB_ID/resourceGroups/rg-arc-fleet" \ --params '{"effect": {"value": "deny"}}'
# This installs OPA Gatekeeper on every Arc-connected cluster# and deploys the constraint automatically
# Check compliance across the fleetaz policy state list \ --resource-group rg-arc-fleet \ --filter "policyDefinitionName eq '95edb821-ddaf-4404-9732-666045e056b4'" \ --query '[].{Resource:resourceId, Compliance:complianceState}' \ --output tablePause and predict: If you assign a “deny privileged containers” policy to your fleet, what happens to existing privileged pods that were deployed before the policy was assigned? Will they be terminated? (Hint: Think about how admission controllers work.)
GitOps with Arc (Flux)
Section titled “GitOps with Arc (Flux)”# Deploy a GitOps configuration to all Arc clusters with a specific tagaz k8s-configuration flux create \ --name platform-baseline \ --cluster-name eks-prod-us-east-1 \ --resource-group rg-arc-fleet \ --cluster-type connectedClusters \ --namespace flux-system \ --scope cluster \ --url https://github.com/company/fleet-config.git \ --branch main \ --kustomization name=platform path=./platform/base prune=true \ --kustomization name=monitoring path=./monitoring/overlays/production prune=true \ --kustomization name=policies path=./policies/production prune=trueGoogle Fleet (GKE Enterprise)
Section titled “Google Fleet (GKE Enterprise)”Google’s approach to fleet management is built around the concept of a “fleet” — a logical grouping of GKE and non-GKE clusters that share configuration and policies.
GKE Fleet Architecture
Section titled “GKE Fleet Architecture”flowchart TD subgraph GCP ["GCP Fleet Host Project"] API["Fleet API\n(GCP Console)"] CS["Config Sync (GitOps)"] PC["Policy Controller (OPA)"] SM["Service Mesh (Istio/ASM)"] BA["Binary Authorization"] end
subgraph Fleet ["Fleet Members"] subgraph GKE ["GKE Cluster\n(native fleet member)"] GKE_CS["Config Sync"] GKE_PC["Policy Ctrl"] end subgraph EKS ["EKS Cluster\n(attached via agent)"] EKS_CS["Config Sync"] EKS_PC["Policy Ctrl"] end subgraph OnPrem ["On-Prem K8s\n(attached)"] OP_CS["Config Sync"] OP_PC["Policy Ctrl"] end end
GCP -->|Fleet Features:\napplied uniformly across all\nmembers regardless of where they run| FleetRegistering Clusters in a Fleet
Section titled “Registering Clusters in a Fleet”# Register a GKE cluster (automatic for GKE clusters in the fleet project)gcloud container fleet memberships register gke-prod-us \ --gke-cluster us-central1/gke-prod-us \ --enable-workload-identity
# Register an external cluster (EKS, AKS, on-prem)# First, generate a registration manifestgcloud container fleet memberships register eks-prod-east \ --context=arn:aws:eks:us-east-1:123456789012:cluster/eks-prod \ --kubeconfig=/path/to/eks-kubeconfig \ --enable-workload-identity \ --public-issuer-url=https://oidc.eks.us-east-1.amazonaws.com/id/ABC123
# List fleet membersgcloud container fleet memberships list \ --format="table(name, uniqueId, authority.workloadIdentityPool, state.code)"Fleet-Wide Configuration with Config Sync
Section titled “Fleet-Wide Configuration with Config Sync”Config Sync is Google’s GitOps engine, similar to Flux or ArgoCD but tightly integrated with Fleet:
# Applied once, syncs to all fleet membersapiVersion: configmanagement.gke.io/v1kind: ConfigManagementmetadata: name: config-managementspec: sourceFormat: unstructured git: syncRepo: https://github.com/company/fleet-config.git syncBranch: main secretType: token policyDir: fleet-policies policyController: enabled: true templateLibraryInstalled: true referentialRulesEnabled: true logDeniesEnabled: true mutationEnabled: true# Enable Config Sync for the entire fleetgcloud beta container fleet config-management enable
# Apply configuration to all fleet membersgcloud beta container fleet config-management apply \ --membership=gke-prod-us \ --config=config-sync-config.yaml
# Check sync status across the fleetgcloud beta container fleet config-management status \ --format="table(Name, Status, Last_Synced_Token, Sync_Errors)"Fleet-Wide Policy with Policy Controller
Section titled “Fleet-Wide Policy with Policy Controller”apiVersion: templates.gatekeeper.sh/v1kind: ConstraintTemplatemetadata: name: k8srequiredlabelsspec: crd: spec: names: kind: K8sRequiredLabels validation: openAPIV3Schema: type: object properties: labels: type: array items: type: string targets: - target: admission.k8s.gatekeeper.sh rego: | package k8srequiredlabels violation[{"msg": msg}] { provided := {l | input.review.object.metadata.labels[l]} required := {l | l := input.parameters.labels[_]} missing := required - provided count(missing) > 0 msg := sprintf("Missing required labels: %v", [missing]) }
---# fleet-policies/constraints/require-team-label.yamlapiVersion: constraints.gatekeeper.sh/v1beta1kind: K8sRequiredLabelsmetadata: name: require-team-labelspec: enforcementAction: deny match: kinds: - apiGroups: ["apps"] kinds: ["Deployment", "StatefulSet"] parameters: labels: - "team" - "cost-center"Centralized Telemetry for Multi-Cloud Fleets
Section titled “Centralized Telemetry for Multi-Cloud Fleets”A fleet without centralized observability is a fleet in name only. You need a single place to see the health, performance, and compliance of every cluster.
Telemetry Architecture
Section titled “Telemetry Architecture”flowchart TD subgraph Clusters ["Fleet Clusters"] AWS["AWS Cluster\nOTel Collector"] Azure["Azure Cluster\nOTel Collector"] GCP["GCP Cluster\nOTel Collector"] OnPrem["On-Prem Cluster\nOTel Collector"] end
subgraph Hub ["CENTRAL TELEMETRY HUB"] Metrics["Metrics: Thanos/Cortex"] Logs["Logs: Loki/Elasticsearch"] Traces["Traces: Tempo/Jaeger"] Dashboards["Dashboards: Grafana"] end
AWS --> Hub Azure --> Hub GCP --> Hub OnPrem --> HubOpenTelemetry Collector for Fleet Telemetry
Section titled “OpenTelemetry Collector for Fleet Telemetry”# Deploy on each cluster with cluster-specific labelsapiVersion: v1kind: ConfigMapmetadata: name: otel-collector-config namespace: monitoringdata: config.yaml: | receivers: prometheus: config: scrape_configs: - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true otlp: protocols: grpc: endpoint: 0.0.0.0:4317
processors: resource: attributes: - key: cluster.name value: "${CLUSTER_NAME}" action: upsert - key: cluster.provider value: "${CLOUD_PROVIDER}" action: upsert - key: cluster.region value: "${CLUSTER_REGION}" action: upsert - key: cluster.environment value: "${ENVIRONMENT}" action: upsert batch: timeout: 30s send_batch_size: 1024
exporters: otlphttp/metrics: endpoint: https://telemetry-hub.company.com/api/v1/push headers: Authorization: "Bearer ${TELEMETRY_TOKEN}" otlphttp/traces: endpoint: https://telemetry-hub.company.com/api/v1/traces headers: Authorization: "Bearer ${TELEMETRY_TOKEN}"
service: pipelines: metrics: receivers: [prometheus] processors: [resource, batch] exporters: [otlphttp/metrics] traces: receivers: [otlp] processors: [resource, batch] exporters: [otlphttp/traces]Stop and think: If your central telemetry hub goes down, what happens to the telemetry data generated by your 50 clusters? How should you configure your OTel Collectors to handle this scenario?
Fleet Health Dashboard Query Examples
Section titled “Fleet Health Dashboard Query Examples”# Cluster count by provider and statuscount by (cluster_provider, cluster_environment) ( up{job="kubernetes-apiservers"})
# API server latency P99 across the fleethistogram_quantile(0.99, sum by (cluster_name, le) ( rate(apiserver_request_duration_seconds_bucket{verb!="WATCH"}[5m]) ))
# Node readiness across the fleetsum by (cluster_name) (kube_node_status_condition{condition="Ready", status="true"})/sum by (cluster_name) (kube_node_info)
# Pod restart rate by cluster (anomaly detection)sum by (cluster_name) ( increase(kube_pod_container_status_restarts_total[1h])) > 50Multi-Cloud GitOps at Scale
Section titled “Multi-Cloud GitOps at Scale”GitOps for a fleet of clusters requires more than a single ArgoCD instance. You need patterns that scale to dozens or hundreds of clusters.
ArgoCD ApplicationSet for Fleet-Wide Deployment
Section titled “ArgoCD ApplicationSet for Fleet-Wide Deployment”apiVersion: argoproj.io/v1alpha1kind: ApplicationSetmetadata: name: fleet-platform-services namespace: argocdspec: generators: # Generate an Application for every cluster registered in ArgoCD - clusters: selector: matchExpressions: - key: environment operator: In values: - production - staging template: metadata: name: 'platform-{{name}}' labels: fleet-component: platform spec: project: fleet-platform source: repoURL: https://github.com/company/fleet-platform.git targetRevision: main path: 'clusters/{{metadata.labels.provider}}/{{metadata.labels.environment}}' destination: server: '{{server}}' namespace: platform-system syncPolicy: automated: prune: true selfHeal: true retry: limit: 5 backoff: duration: 5s factor: 2 maxDuration: 3m syncOptions: - CreateNamespace=true - ServerSideApply=trueFleet Git Repository Structure
Section titled “Fleet Git Repository Structure”fleet-platform/├── base/ # Shared across all clusters│ ├── monitoring/│ │ ├── prometheus.yaml│ │ ├── otel-collector.yaml│ │ └── kustomization.yaml│ ├── policy/│ │ ├── kyverno-policies.yaml│ │ └── kustomization.yaml│ └── security/│ ├── falco.yaml│ └── kustomization.yaml│├── clusters/│ ├── aws/│ │ ├── production/│ │ │ ├── kustomization.yaml # patches for AWS prod│ │ │ └── values-override.yaml│ │ └── staging/│ │ └── kustomization.yaml│ ├── azure/│ │ ├── production/│ │ │ └── kustomization.yaml # patches for Azure prod│ │ └── staging/│ │ └── kustomization.yaml│ └── onprem/│ └── production/│ └── kustomization.yaml # patches for on-prem│└── fleet-sync.yaml # ApplicationSet definitionapiVersion: kustomize.config.k8s.io/v1beta1kind: Kustomizationresources: - ../../../base/monitoring - ../../../base/policy - ../../../base/securitypatches: - target: kind: ConfigMap name: otel-collector-config patch: |- - op: replace path: /data/config.yaml value: | # AWS-specific: export to CloudWatch as well exporters: awscloudwatchlogs: log_group_name: /eks/fleet-telemetry log_stream_name: ${CLUSTER_NAME} - target: kind: ClusterPolicy name: require-image-registry patch: |- - op: replace path: /spec/rules/0/validate/pattern/spec/containers/0/image value: "123456789012.dkr.ecr.*.amazonaws.com/*"Did You Know?
Section titled “Did You Know?”-
Azure Arc has connected over 26,000 Kubernetes clusters as of early 2025, including clusters running on AWS, GCP, and edge devices. Microsoft does not charge for the basic Arc connection — the revenue comes from extensions (Azure Policy, Monitoring, Defender) that cost $6-15 per vCPU/month. A 100-node cluster with 4 vCPUs per node running all extensions can cost $2,400-6,000/month in Arc extension fees alone.
-
Google renamed “Anthos” to “GKE Enterprise” in 2023 partly because customers could not pronounce it consistently. The name “Anthos” came from the Greek word for “flower,” symbolizing growth and adaptation. Despite the rebrand, the underlying technology has been remarkably stable — Config Sync, Policy Controller, and Service Mesh (based on Istio) have remained the core pillars since the original Anthos launch in 2019.
-
The average enterprise fleet grows by 35% per year in cluster count, according to a 2024 Datadog survey. Organizations with fleet management tooling grow faster (42%/year) than those without (21%/year) because the management overhead per cluster is lower, removing the friction that previously limited cluster creation. This suggests that fleet management tools do not just manage existing complexity — they enable more of it.
-
Rancher, originally created by Rancher Labs and now owned by SUSE after a $600 million acquisition in 2020, manages over 180,000 clusters worldwide. Unlike Arc and Fleet which are tied to specific clouds, Rancher is fully vendor-neutral and self-hosted. Its “Fleet” component (confusingly sharing a name concept with Google’s Fleet) handles GitOps-based multi-cluster management and scales to thousands of clusters per management server.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Connecting clusters to Arc/Fleet without a strategy | ”Let us just connect everything and see what happens.” Clusters flood in without naming conventions, tags, or ownership. | Define a fleet taxonomy first: naming conventions, required tags (provider, environment, team, region), grouping strategy. Then connect clusters. |
| Using fleet management as the sole management layer | ”Arc/Fleet replaces our need for per-cluster tools.” But fleet tools provide a subset of cluster management features. | Fleet tools handle cross-cluster concerns (policy, config sync, observability). Per-cluster concerns (node management, storage, networking) still need per-cluster tools. |
| Same configuration for all clusters | ”One size fits all — push the same policies and monitoring everywhere.” But dev clusters do not need production-grade monitoring, and on-prem clusters have different storage classes. | Use the base/overlay pattern: shared base configurations with per-environment and per-provider patches. Kustomize overlays are ideal for this. |
| Ignoring fleet telemetry costs | Centralized monitoring of 50 clusters generates enormous data volumes. Teams are surprised by a $30K/month observability bill. | Set retention policies, use sampling for traces, aggregate metrics at the cluster level before shipping. Calculate telemetry costs per cluster before enabling fleet-wide collection. |
| No cluster lifecycle management | Fleet management connects existing clusters but nobody plans for cluster creation, upgrades, or decommissioning at fleet scale. | Combine fleet management with Cluster API or Crossplane for lifecycle. Fleet tools manage what runs in clusters; lifecycle tools manage the clusters themselves. |
| Multi-cloud GitOps without abstractions | Git repo contains provider-specific configs scattered everywhere. Changing a policy means editing 3 different files for 3 providers. | Use the base/overlay pattern shown in this module. Provider-specific differences go in overlay patches, not in the base configuration. |
Question 1: Your organization recently acquired a company that runs 15 EKS clusters. You currently manage 20 GKE clusters using Google Fleet. Your CTO asks if you should use Azure Arc or Google Fleet to unify management, noting they heard Arc requires inbound firewall ports. How do you evaluate the architectural differences to advise the CTO?
Azure Arc does not require inbound firewall ports; it uses an agent installed inside the target cluster that maintains an outbound HTTPS connection to the Azure Control Plane. This allows it to bypass inbound firewall restrictions safely. Google Fleet also uses an agent for non-GKE clusters, but it operates with a stronger concept of “fleet features”—capabilities like Config Sync and Policy Controller that are enabled at the fleet level and distributed to all members. Because you already use Google Fleet for 20 GKE clusters, attaching the 15 EKS clusters to your existing Fleet is likely the most seamless approach. Fleet excels in homogeneous configuration distribution, whereas Arc would be more suited if you were heavily invested in Azure management tools (like Azure Monitor or Defender) and needed a-la-carte extension selection per cluster.
Question 2: Your company has 40 clusters: 25 on AWS, 10 on Azure, and 5 on-premises. You are tasked with selecting a fleet management platform and need to choose between Azure Arc and Google Fleet. What factors should drive the decision for your specific environment?
You need to consider several key factors to make an informed decision. First, evaluate your existing cloud investment: if the company already uses Azure AD for identity, Azure Monitor for observability, and Azure DevOps for CI/CD, Arc integrates seamlessly with these tools. Second, assess team expertise; Arc requires Azure knowledge, while Fleet requires GCP knowledge, so you should lean towards the ecosystem your team already understands. Third, analyze your feature requirements, such as whether you need Fleet’s built-in Istio service mesh or Arc’s integration with Windows containers. Finally, compare the total cost for your fleet size and consider vendor neutrality, potentially evaluating self-hosted options like Rancher or EKS Connector with ArgoCD if avoiding a third cloud dependency is a priority.
Question 3: Your platform team is struggling to manage GitOps deployments across 50 clusters spread across AWS, Azure, and on-premises environments. Engineers are currently copying and pasting entire configuration repositories for each cluster, leading to massive configuration drift. How should you restructure your GitOps approach to solve this multi-cloud configuration problem?
You should implement the base/overlay pattern using a tool like Kustomize to separate what you want to deploy from how it differs per environment. The base directory will contain configurations that are identical across all clusters, such as the standard monitoring stack, core policy definitions, and security tools. Overlay directories will contain patches specific to each cloud provider or environment, such as overriding the image registry URL for AWS (ECR) versus Azure (ACR). When you need to update a fleet-wide policy, you only change the base, and the change propagates to all clusters automatically. This completely eliminates the need to maintain multiple distinct copies of configuration files, ensuring consistency at scale while cleanly managing provider-specific deviations.
Question 4: A fleet-wide policy update is pushed via GitOps. It works on 39 out of 40 clusters. The 40th cluster (an on-premises kubeadm cluster running Kubernetes 1.28) rejects the policy. What happened and how do you handle it?
The most likely cause is a Kubernetes version incompatibility between the policy manifest and the older API server. If the policy uses an API version or feature only available in Kubernetes 1.30+ (such as a newer Gatekeeper constraint template format), the 1.28 cluster’s API server will reject the resource during the apply phase. This highlights a common fleet management challenge where version skew across clusters breaks unified deployments. To handle this, you should include version gates in your GitOps configuration, such as using ArgoCD ApplicationSets to filter which policies sync based on cluster version labels. Alternatively, you can use conditional patches in Kustomize overlays to adjust policies for older clusters, or implement automated pre-flight checks in your pipeline to validate manifests against target cluster versions before attempting to sync.
Question 5: The CFO noticed a massive spike in cloud spending after your team enabled centralized monitoring for a 50-cluster fleet. The observability platform bill is now $40,000 per month. What are the primary drivers of these telemetry costs at fleet scale, and how can you architect a solution to reduce them?
The primary drivers of telemetry costs at fleet scale are the sheer volume of active metric time series and log ingestion data. A single Kubernetes cluster can generate up to 100,000 active metric time series and gigabytes of logs per day, which scales linearly and becomes prohibitively expensive when using managed services charging per metric or gigabyte. To reduce these costs, you should implement filtering at the source, shipping only golden signals (latency, traffic, errors, saturation) to the central hub rather than all available metrics. You can also aggregate per-pod metrics into per-deployment metrics before shipping and apply strict sampling rates to distributed traces (e.g., 1% for normal traffic, 100% for errors). Finally, setting aggressive retention tiers—keeping data hot for only 7 days and moving the rest to cold storage like S3—will drastically cut the ongoing storage costs.
Question 6: An executive mandates that since the company has adopted Azure Arc for fleet management, you must uninstall all provider-specific tools like `eksctl` and Terraform AWS providers to "simplify tooling". Why will this mandate cause operational failures, and what specific examples should you provide to push back?
This mandate will cause operational failures because fleet management tools handle cross-cluster concerns, not the deep, provider-specific cluster-internal operations required for lifecycle management. Fleet tools act as a coordination layer for policy, configuration, and observability, but they lack the APIs to interact with the underlying cloud infrastructure. For example, while Azure Arc can push an OPA Gatekeeper policy to an EKS cluster, it cannot configure Karpenter provisioners or adjust EKS managed node group scaling parameters, which still require AWS-specific tools. Similarly, Arc can deploy a Kubernetes StorageClass manifest, but it cannot provision or resize the underlying AWS EBS volumes or Azure Disks, meaning per-cluster management tools remain absolutely essential.
Hands-On Exercise: Build a Multi-Cluster Fleet with GitOps
Section titled “Hands-On Exercise: Build a Multi-Cluster Fleet with GitOps”In this exercise, you will create a fleet of three kind clusters simulating different environments, implement centralized GitOps, and build a fleet inventory and health dashboard.
What you will build:
flowchart TD subgraph Mgmt ["Management Cluster (ArgoCD)"] Argo["ArgoCD ApplicationSets\ndeploy platform services\nto all fleet members"] end
AWS["fleet-aws-prod\n(kind, simulates AWS)"] Azure["fleet-azure-staging\n(kind, simulates Azure)"] OnPrem["fleet-onprem-prod\n(kind, simulates on-prem)"]
Mgmt --> AWS Mgmt --> Azure Mgmt --> OnPremTask 1: Create the Fleet Clusters
Section titled “Task 1: Create the Fleet Clusters”Solution
# Create three clustersfor CLUSTER in fleet-mgmt fleet-aws-prod fleet-azure-staging; do kind create cluster --name $CLUSTERdone
# Verify all clusters are runningfor CLUSTER in fleet-mgmt fleet-aws-prod fleet-azure-staging; do echo "=== $CLUSTER ===" kubectl --context kind-$CLUSTER get nodesdoneTask 2: Install ArgoCD on the Management Cluster
Section titled “Task 2: Install ArgoCD on the Management Cluster”Solution
# Install ArgoCD on management clusterkubectl --context kind-fleet-mgmt create namespace argocdkubectl --context kind-fleet-mgmt apply -n argocd \ -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Wait for ArgoCD to be readykubectl --context kind-fleet-mgmt wait --for=condition=available \ deployment/argocd-server -n argocd --timeout=120s
# Get the admin passwordARGOCD_PW=$(kubectl --context kind-fleet-mgmt -n argocd \ get secret argocd-initial-admin-secret \ -o jsonpath='{.data.password}' | base64 -d)echo "ArgoCD admin password: $ARGOCD_PW"Task 3: Register Fleet Clusters in ArgoCD
Section titled “Task 3: Register Fleet Clusters in ArgoCD”Solution
# Get the API server URLs for each fleet clusterAWS_SERVER=$(kubectl --context kind-fleet-aws-prod config view --minify -o jsonpath='{.clusters[0].cluster.server}')AZURE_SERVER=$(kubectl --context kind-fleet-azure-staging config view --minify -o jsonpath='{.clusters[0].cluster.server}')
# Add fleet clusters to ArgoCD via CLI or Secrets# Using the Secret method (no argocd CLI needed)for CLUSTER_INFO in "fleet-aws-prod:aws:production:kind-fleet-aws-prod" "fleet-azure-staging:azure:staging:kind-fleet-azure-staging"; do NAME=$(echo $CLUSTER_INFO | cut -d: -f1) PROVIDER=$(echo $CLUSTER_INFO | cut -d: -f2) ENV=$(echo $CLUSTER_INFO | cut -d: -f3) CTX=$(echo $CLUSTER_INFO | cut -d: -f4)
SERVER=$(kubectl --context $CTX config view --minify -o jsonpath='{.clusters[0].cluster.server}') CA_DATA=$(kubectl --context $CTX config view --minify --raw -o jsonpath='{.clusters[0].cluster.certificate-authority-data}') TOKEN=$(kubectl --context $CTX -n kube-system get secret \ $(kubectl --context $CTX -n kube-system get sa default -o jsonpath='{.secrets[0].name}' 2>/dev/null || echo "none") \ -o jsonpath='{.data.token}' 2>/dev/null | base64 -d || echo "")
# Create a ServiceAccount for ArgoCD in the target cluster kubectl --context $CTX create serviceaccount argocd-manager -n kube-system 2>/dev/null || true kubectl --context $CTX create clusterrolebinding argocd-manager \ --clusterrole=cluster-admin --serviceaccount=kube-system:argocd-manager 2>/dev/null || true
cat <<EOF | kubectl --context kind-fleet-mgmt apply -f -apiVersion: v1kind: Secretmetadata: name: cluster-$NAME namespace: argocd labels: argocd.argoproj.io/secret-type: cluster provider: $PROVIDER environment: $ENVtype: OpaquestringData: name: $NAME server: $SERVER config: | { "tlsClientConfig": { "insecure": true } }EOF
echo "Registered cluster: $NAME (provider=$PROVIDER, env=$ENV)"done
# Verify clusters are registeredkubectl --context kind-fleet-mgmt get secrets -n argocd -l argocd.argoproj.io/secret-type=clusterTask 4: Deploy Platform Services Across the Fleet
Section titled “Task 4: Deploy Platform Services Across the Fleet”Solution
# Create platform baseline configmaps on each cluster via ArgoCD Applications# Since we don't have a Git repo, we'll use a direct approach to demonstrate the pattern
for CTX in kind-fleet-aws-prod kind-fleet-azure-staging; do CLUSTER_NAME=$(echo $CTX | sed 's/kind-//')
# Create platform namespace kubectl --context $CTX create namespace platform-system 2>/dev/null || true
# Deploy fleet-standard configuration cat <<EOF | kubectl --context $CTX apply -f -apiVersion: v1kind: ConfigMapmetadata: name: fleet-identity namespace: platform-system labels: managed-by: fleet-managementdata: cluster-name: "$CLUSTER_NAME" fleet-version: "1.0.0" managed-by: "fleet-mgmt-cluster" registered-at: "$(date -u +%Y-%m-%dT%H:%M:%SZ)"---apiVersion: v1kind: ConfigMapmetadata: name: fleet-monitoring-config namespace: platform-system labels: managed-by: fleet-managementdata: scrape-interval: "30s" retention: "7d" external-labels: | cluster=$CLUSTER_NAME fleet=enterprise---apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: fleet-default-deny namespace: platform-system labels: managed-by: fleet-managementspec: podSelector: {} policyTypes: - Ingress - Egress egress: - to: [] ports: - protocol: TCP port: 443 - protocol: TCP port: 53 - protocol: UDP port: 53EOF
echo "Platform services deployed to $CLUSTER_NAME"doneTask 5: Build a Fleet Inventory and Health Report
Section titled “Task 5: Build a Fleet Inventory and Health Report”Solution
cat <<'SCRIPT' > /tmp/fleet-report.sh#!/bin/bashecho "============================================="echo " FLEET INVENTORY & HEALTH REPORT"echo " $(date -u +%Y-%m-%dT%H:%M:%SZ)"echo "============================================="
TOTAL_NODES=0TOTAL_PODS=0TOTAL_CLUSTERS=0HEALTHY=0
for CTX in kind-fleet-aws-prod kind-fleet-azure-staging; do CLUSTER=$(echo $CTX | sed 's/kind-//') TOTAL_CLUSTERS=$((TOTAL_CLUSTERS + 1))
echo "" echo "--- Cluster: $CLUSTER ---"
# Get cluster info FLEET_ID=$(kubectl --context $CTX get configmap fleet-identity -n platform-system -o jsonpath='{.data.cluster-name}' 2>/dev/null || echo "NOT REGISTERED") echo " Fleet ID: $FLEET_ID"
# Node health NODES=$(kubectl --context $CTX get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ') READY_NODES=$(kubectl --context $CTX get nodes --no-headers 2>/dev/null | grep " Ready" | wc -l | tr -d ' ') echo " Nodes: $READY_NODES/$NODES ready" TOTAL_NODES=$((TOTAL_NODES + NODES))
# Pod count PODS=$(kubectl --context $CTX get pods -A --no-headers --field-selector=status.phase=Running 2>/dev/null | wc -l | tr -d ' ') echo " Running Pods: $PODS" TOTAL_PODS=$((TOTAL_PODS + PODS))
# K8s version VERSION=$(kubectl --context $CTX version --short 2>/dev/null | grep Server | awk '{print $3}' || kubectl --context $CTX get nodes -o jsonpath='{.items[0].status.nodeInfo.kubeletVersion}' 2>/dev/null) echo " Kubernetes Version: $VERSION"
# Fleet services check FLEET_CONFIGS=$(kubectl --context $CTX get configmap -n platform-system -l managed-by=fleet-management --no-headers 2>/dev/null | wc -l | tr -d ' ') NETPOLS=$(kubectl --context $CTX get networkpolicy -n platform-system -l managed-by=fleet-management --no-headers 2>/dev/null | wc -l | tr -d ' ')
if [ "$FLEET_CONFIGS" -ge 2 ] && [ "$NETPOLS" -ge 1 ]; then echo " Fleet Services: HEALTHY ($FLEET_CONFIGS configs, $NETPOLS netpols)" HEALTHY=$((HEALTHY + 1)) else echo " Fleet Services: DEGRADED (configs=$FLEET_CONFIGS, netpols=$NETPOLS)" fidone
echo ""echo "============================================="echo " FLEET SUMMARY"echo "============================================="echo " Total Clusters: $TOTAL_CLUSTERS"echo " Healthy Clusters: $HEALTHY/$TOTAL_CLUSTERS"echo " Total Nodes: $TOTAL_NODES"echo " Total Running Pods: $TOTAL_PODS"echo " Fleet Health: $(( (HEALTHY * 100) / TOTAL_CLUSTERS ))%"echo "============================================="SCRIPT
chmod +x /tmp/fleet-report.shbash /tmp/fleet-report.shClean Up
Section titled “Clean Up”kind delete cluster --name fleet-mgmtkind delete cluster --name fleet-aws-prodkind delete cluster --name fleet-azure-stagingdocker network rm hybrid-net 2>/dev/null || truerm /tmp/fleet-report.shSuccess Criteria
Section titled “Success Criteria”- I created three kind clusters simulating a multi-cloud fleet
- I installed ArgoCD on the management cluster
- I registered fleet clusters in ArgoCD with provider and environment labels
- I deployed standardized platform services across all fleet members
- I built a fleet inventory and health report
- I can explain the architectural differences between Azure Arc and Google Fleet
- I can describe the base/overlay pattern for multi-cloud GitOps
Next Module
Section titled “Next Module”Now that you can manage a fleet of clusters, it is time to learn how to provision them declaratively. Head to Module 10.6: Multi-Cloud Provisioning with Cluster API to learn how CAPI and its providers (CAPA, CAPZ, CAPG) let you create, upgrade, and scale Kubernetes clusters across any infrastructure using Kubernetes-native APIs.