Module 10.5: Multi-Cloud Fleet Management (Azure Arc / GKE Fleet)

Complexity: [COMPLEX] | Time to Complete: 2.5h | Prerequisites: Hybrid Cloud Architecture (Module 10.4), Kubernetes Multi-Cluster Basics

What You’ll Be Able to Do

After completing this module, you will be able to:

Configure Azure Arc-enabled Kubernetes and GKE Fleet to register and manage clusters across clouds and on-premises
Implement fleet-wide GitOps with Flux or Config Sync deployed consistently across all registered clusters
Deploy centralized policy enforcement using Azure Policy for Arc or GKE Policy Controller across the entire fleet
Design fleet topology strategies that balance central governance with team autonomy in multi-cluster environments

Why This Module Matters

In late 2023, a global retail company operated 73 Kubernetes clusters across three cloud providers and two data centers. Each cluster had its own deployment pipeline, its own monitoring stack, its own policy engine, and its own team responsible for upgrades. When a critical CVE in the Kubernetes API server (CVE-2023-5528) was announced, their security team needed to assess and patch every cluster. It took them 11 days to determine which clusters were affected, 6 weeks to patch all of them, and during that window they discovered that 9 clusters were running Kubernetes versions so old they were no longer receiving security patches at all. Nobody had noticed because nobody had a fleet-wide view.

The incident report identified a fundamental organizational failure: they had clusters but not a fleet. Each cluster was a pet, individually configured and managed. The company had no centralized inventory, no way to push configuration changes to all clusters simultaneously, and no unified view of compliance or health. Their CTO described it as “running 73 separate Kubernetes islands with bridges made of Slack messages and wiki pages.”

Fleet management tools solve this by treating your entire collection of Kubernetes clusters as a single manageable unit. Azure Arc, Google Fleet (GKE Enterprise), and open-source alternatives like Rancher Fleet provide centralized inventory, policy distribution, configuration management, and observability across clusters regardless of where they run. In this module, you will learn the reality of multi-cloud fleet management, how Azure Arc and Google Fleet work, how to centralize telemetry and policy, and how to implement multi-cloud GitOps at scale.

The Multi-Cloud Reality Check

Before diving into tools, you need an honest assessment of why enterprises end up multi-cloud and what it actually costs.

Why Enterprises Go Multi-Cloud

Most enterprises do not choose multi-cloud strategically. They end up there through:

Acquisitions: Company A uses AWS, acquires Company B which uses Azure. Consolidation is estimated at 3 years but never happens.
Best-of-breed selection: ML team chose GCP for Vertex AI, main platform team chose AWS for EKS, data team chose Azure for Synapse.
Regulatory requirements: EU data must stay in a specific region that only one provider supports well.
Vendor negotiation leverage: “We use AWS but we could switch to Azure” is only credible if you actually have Azure workloads.
Shadow IT: A team started using a second cloud on a corporate credit card. By the time IT found out, there were production workloads running.

The Real Cost of Multi-Cloud

Category	Single Cloud	Multi-Cloud (3 CSPs)
Platform team size	5-8 engineers	12-20 engineers
Training budget	$15K/year	$45-60K/year
Tooling licenses	$50K/year	$100-200K/year
Data transfer costs	Internal only	$50-200K/year
Compliance audits	1 scope	3 scopes
Incident complexity	Low	High (finger-pointing)
Negotiation leverage	Low	Medium

Net effect: 2-3x operational cost for marginal benefit. Exception: If you genuinely use best-of-breed per provider.

Stop and think: Look at the table above. If your organization is operating in multiple clouds purely due to an un-merged acquisition, are you extracting any of the “best-of-breed” benefits, or are you just paying the multi-cloud operational tax?

Azure Arc for Kubernetes

Azure Arc extends Azure’s management plane to any Kubernetes cluster, regardless of where it runs. You can connect an EKS cluster, a GKE cluster, an on-premises kubeadm cluster, or even a Raspberry Pi cluster to Azure Arc and manage them all through the Azure portal and APIs.

How Azure Arc Works

flowchart TD
    subgraph Azure ["Azure Control Plane"]
        ARM["Azure Resource Manager\n(Azure Portal)"]
        Policy["Azure Policy Engine"]
        Monitor["Azure Monitor"]
        Defender["Microsoft Defender"]
    end

    subgraph Cluster ["Target Cluster"]
        subgraph Agent ["Arc Agent"]
            CC["cluster-connect\n(reverse proxy)"]
            CA["config-agent\n(GitOps/Flux)"]
            AP["azure-policy\n(Gatekeeper)"]
            OM["monitoring\n(omsagent)"]
        end
    end

    Agent -- "HTTPS (outbound only)\nNo inbound ports needed.\nNo VPN required." --> Azure

Connecting a Cluster to Azure Arc

# Prerequisites: Azure CLI with connectedk8s extension
az extension add --name connectedk8s
az extension add --name k8s-configuration

# Connect an EKS cluster to Azure Arc
# First, ensure your kubeconfig points to the target cluster
export KUBECONFIG=~/.kube/eks-production-config

az connectedk8s connect \
  --name eks-prod-us-east-1 \
  --resource-group rg-arc-fleet \
  --location eastus \
  --tags "provider=aws" "environment=production" "team=platform" \
  --distribution eks \
  --infrastructure aws

# Verify the connection
az connectedk8s show \
  --name eks-prod-us-east-1 \
  --resource-group rg-arc-fleet \
  --query '{Name:name, Status:connectivityStatus, Distribution:distribution, Infrastructure:infrastructure}'

# List all Arc-connected clusters
az connectedk8s list \
  --resource-group rg-arc-fleet \
  --query '[].{Name:name, Status:connectivityStatus, Provider:infrastructure, K8sVersion:kubernetesVersion}' \
  --output table

Azure Policy for Arc-Connected Clusters

Once connected, you can push Azure Policies to any cluster in your fleet:

# Assign a policy to enforce no privileged containers across ALL Arc clusters
az policy assignment create \
  --name deny-privileged-containers \
  --display-name "Deny privileged containers on all Arc clusters" \
  --policy "/providers/Microsoft.Authorization/policyDefinitions/95edb821-ddaf-4404-9732-666045e056b4" \
  --scope "/subscriptions/$SUB_ID/resourceGroups/rg-arc-fleet" \
  --params '{"effect": {"value": "deny"}}'

# This installs OPA Gatekeeper on every Arc-connected cluster
# and deploys the constraint automatically

# Check compliance across the fleet
az policy state list \
  --resource-group rg-arc-fleet \
  --filter "policyDefinitionName eq '95edb821-ddaf-4404-9732-666045e056b4'" \
  --query '[].{Resource:resourceId, Compliance:complianceState}' \
  --output table

Pause and predict: If you assign a “deny privileged containers” policy to your fleet, what happens to existing privileged pods that were deployed before the policy was assigned? Will they be terminated? (Hint: Think about how admission controllers work.)

GitOps with Arc (Flux)

# Deploy a GitOps configuration to all Arc clusters with a specific tag
az k8s-configuration flux create \
  --name platform-baseline \
  --cluster-name eks-prod-us-east-1 \
  --resource-group rg-arc-fleet \
  --cluster-type connectedClusters \
  --namespace flux-system \
  --scope cluster \
  --url https://github.com/company/fleet-config.git \
  --branch main \
  --kustomization name=platform path=./platform/base prune=true \
  --kustomization name=monitoring path=./monitoring/overlays/production prune=true \
  --kustomization name=policies path=./policies/production prune=true

Google Fleet (GKE Enterprise)

Google’s approach to fleet management is built around the concept of a “fleet” — a logical grouping of GKE and non-GKE clusters that share configuration and policies.

GKE Fleet Architecture

flowchart TD
    subgraph GCP ["GCP Fleet Host Project"]
        API["Fleet API\n(GCP Console)"]
        CS["Config Sync (GitOps)"]
        PC["Policy Controller (OPA)"]
        SM["Service Mesh (Istio/ASM)"]
        BA["Binary Authorization"]
    end

    subgraph Fleet ["Fleet Members"]
        subgraph GKE ["GKE Cluster\n(native fleet member)"]
            GKE_CS["Config Sync"]
            GKE_PC["Policy Ctrl"]
        end
        subgraph EKS ["EKS Cluster\n(attached via agent)"]
            EKS_CS["Config Sync"]
            EKS_PC["Policy Ctrl"]
        end
        subgraph OnPrem ["On-Prem K8s\n(attached)"]
            OP_CS["Config Sync"]
            OP_PC["Policy Ctrl"]
        end
    end

    GCP -->|Fleet Features:\napplied uniformly across all\nmembers regardless of where they run| Fleet

Registering Clusters in a Fleet

# Register a GKE cluster (automatic for GKE clusters in the fleet project)
gcloud container fleet memberships register gke-prod-us \
  --gke-cluster us-central1/gke-prod-us \
  --enable-workload-identity

# Register an external cluster (EKS, AKS, on-prem)
# First, generate a registration manifest
gcloud container fleet memberships register eks-prod-east \
  --context=arn:aws:eks:us-east-1:123456789012:cluster/eks-prod \
  --kubeconfig=/path/to/eks-kubeconfig \
  --enable-workload-identity \
  --public-issuer-url=https://oidc.eks.us-east-1.amazonaws.com/id/ABC123

# List fleet members
gcloud container fleet memberships list \
  --format="table(name, uniqueId, authority.workloadIdentityPool, state.code)"

Fleet-Wide Configuration with Config Sync

Config Sync is Google’s GitOps engine, similar to Flux or ArgoCD but tightly integrated with Fleet:

# Applied once, syncs to all fleet members
apiVersion: configmanagement.gke.io/v1
kind: ConfigManagement
metadata:
  name: config-management
spec:
  sourceFormat: unstructured
  git:
    syncRepo: https://github.com/company/fleet-config.git
    syncBranch: main
    secretType: token
    policyDir: fleet-policies
  policyController:
    enabled: true
    templateLibraryInstalled: true
    referentialRulesEnabled: true
    logDeniesEnabled: true
    mutationEnabled: true

# Enable Config Sync for the entire fleet
gcloud beta container fleet config-management enable

# Apply configuration to all fleet members
gcloud beta container fleet config-management apply \
  --membership=gke-prod-us \
  --config=config-sync-config.yaml

# Check sync status across the fleet
gcloud beta container fleet config-management status \
  --format="table(Name, Status, Last_Synced_Token, Sync_Errors)"

Fleet-Wide Policy with Policy Controller

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          type: object
          properties:
            labels:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels
        violation[{"msg": msg}] {
          provided := {l | input.review.object.metadata.labels[l]}
          required := {l | l := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("Missing required labels: %v", [missing])
        }

---
# fleet-policies/constraints/require-team-label.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: require-team-label
spec:
  enforcementAction: deny
  match:
    kinds:
      - apiGroups: ["apps"]
        kinds: ["Deployment", "StatefulSet"]
  parameters:
    labels:
      - "team"
      - "cost-center"

Centralized Telemetry for Multi-Cloud Fleets

A fleet without centralized observability is a fleet in name only. You need a single place to see the health, performance, and compliance of every cluster.

Telemetry Architecture

flowchart TD
    subgraph Clusters ["Fleet Clusters"]
        AWS["AWS Cluster\nOTel Collector"]
        Azure["Azure Cluster\nOTel Collector"]
        GCP["GCP Cluster\nOTel Collector"]
        OnPrem["On-Prem Cluster\nOTel Collector"]
    end

    subgraph Hub ["CENTRAL TELEMETRY HUB"]
        Metrics["Metrics: Thanos/Cortex"]
        Logs["Logs: Loki/Elasticsearch"]
        Traces["Traces: Tempo/Jaeger"]
        Dashboards["Dashboards: Grafana"]
    end

    AWS --> Hub
    Azure --> Hub
    GCP --> Hub
    OnPrem --> Hub

OpenTelemetry Collector for Fleet Telemetry

# Deploy on each cluster with cluster-specific labels
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: monitoring
data:
  config.yaml: |
    receivers:
      prometheus:
        config:
          scrape_configs:
            - job_name: kubernetes-pods
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
                  action: keep
                  regex: true
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317

    processors:
      resource:
        attributes:
          - key: cluster.name
            value: "${CLUSTER_NAME}"
            action: upsert
          - key: cluster.provider
            value: "${CLOUD_PROVIDER}"
            action: upsert
          - key: cluster.region
            value: "${CLUSTER_REGION}"
            action: upsert
          - key: cluster.environment
            value: "${ENVIRONMENT}"
            action: upsert
      batch:
        timeout: 30s
        send_batch_size: 1024

    exporters:
      otlphttp/metrics:
        endpoint: https://telemetry-hub.company.com/api/v1/push
        headers:
          Authorization: "Bearer ${TELEMETRY_TOKEN}"
      otlphttp/traces:
        endpoint: https://telemetry-hub.company.com/api/v1/traces
        headers:
          Authorization: "Bearer ${TELEMETRY_TOKEN}"

    service:
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [resource, batch]
          exporters: [otlphttp/metrics]
        traces:
          receivers: [otlp]
          processors: [resource, batch]
          exporters: [otlphttp/traces]

Stop and think: If your central telemetry hub goes down, what happens to the telemetry data generated by your 50 clusters? How should you configure your OTel Collectors to handle this scenario?

Fleet Health Dashboard Query Examples

# Cluster count by provider and status
count by (cluster_provider, cluster_environment) (
  up{job="kubernetes-apiservers"}
)

# API server latency P99 across the fleet
histogram_quantile(0.99,
  sum by (cluster_name, le) (
    rate(apiserver_request_duration_seconds_bucket{verb!="WATCH"}[5m])
  )
)

# Node readiness across the fleet
sum by (cluster_name) (kube_node_status_condition{condition="Ready", status="true"})
/
sum by (cluster_name) (kube_node_info)

# Pod restart rate by cluster (anomaly detection)
sum by (cluster_name) (
  increase(kube_pod_container_status_restarts_total[1h])
) > 50

Multi-Cloud GitOps at Scale

GitOps for a fleet of clusters requires more than a single ArgoCD instance. You need patterns that scale to dozens or hundreds of clusters.

ArgoCD ApplicationSet for Fleet-Wide Deployment

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: fleet-platform-services
  namespace: argocd
spec:
  generators:
    # Generate an Application for every cluster registered in ArgoCD
    - clusters:
        selector:
          matchExpressions:
            - key: environment
              operator: In
              values:
                - production
                - staging
  template:
    metadata:
      name: 'platform-{{name}}'
      labels:
        fleet-component: platform
    spec:
      project: fleet-platform
      source:
        repoURL: https://github.com/company/fleet-platform.git
        targetRevision: main
        path: 'clusters/{{metadata.labels.provider}}/{{metadata.labels.environment}}'
      destination:
        server: '{{server}}'
        namespace: platform-system
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        retry:
          limit: 5
          backoff:
            duration: 5s
            factor: 2
            maxDuration: 3m
        syncOptions:
          - CreateNamespace=true
          - ServerSideApply=true

Fleet Git Repository Structure

fleet-platform/
├── base/                          # Shared across all clusters
│   ├── monitoring/
│   │   ├── prometheus.yaml
│   │   ├── otel-collector.yaml
│   │   └── kustomization.yaml
│   ├── policy/
│   │   ├── kyverno-policies.yaml
│   │   └── kustomization.yaml
│   └── security/
│       ├── falco.yaml
│       └── kustomization.yaml
│
├── clusters/
│   ├── aws/
│   │   ├── production/
│   │   │   ├── kustomization.yaml    # patches for AWS prod
│   │   │   └── values-override.yaml
│   │   └── staging/
│   │       └── kustomization.yaml
│   ├── azure/
│   │   ├── production/
│   │   │   └── kustomization.yaml    # patches for Azure prod
│   │   └── staging/
│   │       └── kustomization.yaml
│   └── onprem/
│       └── production/
│           └── kustomization.yaml    # patches for on-prem
│
└── fleet-sync.yaml                   # ApplicationSet definition

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../../base/monitoring
  - ../../../base/policy
  - ../../../base/security
patches:
  - target:
      kind: ConfigMap
      name: otel-collector-config
    patch: |-
      - op: replace
        path: /data/config.yaml
        value: |
          # AWS-specific: export to CloudWatch as well
          exporters:
            awscloudwatchlogs:
              log_group_name: /eks/fleet-telemetry
              log_stream_name: ${CLUSTER_NAME}
  - target:
      kind: ClusterPolicy
      name: require-image-registry
    patch: |-
      - op: replace
        path: /spec/rules/0/validate/pattern/spec/containers/0/image
        value: "123456789012.dkr.ecr.*.amazonaws.com/*"

Did You Know?

Azure Arc has connected over 26,000 Kubernetes clusters as of early 2025, including clusters running on AWS, GCP, and edge devices. Microsoft does not charge for the basic Arc connection — the revenue comes from extensions (Azure Policy, Monitoring, Defender) that cost $6-15 per vCPU/month. A 100-node cluster with 4 vCPUs per node running all extensions can cost $2,400-6,000/month in Arc extension fees alone.
Google renamed “Anthos” to “GKE Enterprise” in 2023 partly because customers could not pronounce it consistently. The name “Anthos” came from the Greek word for “flower,” symbolizing growth and adaptation. Despite the rebrand, the underlying technology has been remarkably stable — Config Sync, Policy Controller, and Service Mesh (based on Istio) have remained the core pillars since the original Anthos launch in 2019.
The average enterprise fleet grows by 35% per year in cluster count, according to a 2024 Datadog survey. Organizations with fleet management tooling grow faster (42%/year) than those without (21%/year) because the management overhead per cluster is lower, removing the friction that previously limited cluster creation. This suggests that fleet management tools do not just manage existing complexity — they enable more of it.
Rancher, originally created by Rancher Labs and now owned by SUSE after a $600 million acquisition in 2020, manages over 180,000 clusters worldwide. Unlike Arc and Fleet which are tied to specific clouds, Rancher is fully vendor-neutral and self-hosted. Its “Fleet” component (confusingly sharing a name concept with Google’s Fleet) handles GitOps-based multi-cluster management and scales to thousands of clusters per management server.

Common Mistakes

Mistake	Why It Happens	How to Fix It
Connecting clusters to Arc/Fleet without a strategy	”Let us just connect everything and see what happens.” Clusters flood in without naming conventions, tags, or ownership.	Define a fleet taxonomy first: naming conventions, required tags (provider, environment, team, region), grouping strategy. Then connect clusters.
Using fleet management as the sole management layer	”Arc/Fleet replaces our need for per-cluster tools.” But fleet tools provide a subset of cluster management features.	Fleet tools handle cross-cluster concerns (policy, config sync, observability). Per-cluster concerns (node management, storage, networking) still need per-cluster tools.
Same configuration for all clusters	”One size fits all — push the same policies and monitoring everywhere.” But dev clusters do not need production-grade monitoring, and on-prem clusters have different storage classes.	Use the base/overlay pattern: shared base configurations with per-environment and per-provider patches. Kustomize overlays are ideal for this.
Ignoring fleet telemetry costs	Centralized monitoring of 50 clusters generates enormous data volumes. Teams are surprised by a $30K/month observability bill.	Set retention policies, use sampling for traces, aggregate metrics at the cluster level before shipping. Calculate telemetry costs per cluster before enabling fleet-wide collection.
No cluster lifecycle management	Fleet management connects existing clusters but nobody plans for cluster creation, upgrades, or decommissioning at fleet scale.	Combine fleet management with Cluster API or Crossplane for lifecycle. Fleet tools manage what runs in clusters; lifecycle tools manage the clusters themselves.
Multi-cloud GitOps without abstractions	Git repo contains provider-specific configs scattered everywhere. Changing a policy means editing 3 different files for 3 providers.	Use the base/overlay pattern shown in this module. Provider-specific differences go in overlay patches, not in the base configuration.

Quiz

Question 1: Your organization recently acquired a company that runs 15 EKS clusters. You currently manage 20 GKE clusters using Google Fleet. Your CTO asks if you should use Azure Arc or Google Fleet to unify management, noting they heard Arc requires inbound firewall ports. How do you evaluate the architectural differences to advise the CTO?

Azure Arc does not require inbound firewall ports; it uses an agent installed inside the target cluster that maintains an outbound HTTPS connection to the Azure Control Plane. This allows it to bypass inbound firewall restrictions safely. Google Fleet also uses an agent for non-GKE clusters, but it operates with a stronger concept of “fleet features”—capabilities like Config Sync and Policy Controller that are enabled at the fleet level and distributed to all members. Because you already use Google Fleet for 20 GKE clusters, attaching the 15 EKS clusters to your existing Fleet is likely the most seamless approach. Fleet excels in homogeneous configuration distribution, whereas Arc would be more suited if you were heavily invested in Azure management tools (like Azure Monitor or Defender) and needed a-la-carte extension selection per cluster.

Question 2: Your company has 40 clusters: 25 on AWS, 10 on Azure, and 5 on-premises. You are tasked with selecting a fleet management platform and need to choose between Azure Arc and Google Fleet. What factors should drive the decision for your specific environment?

You need to consider several key factors to make an informed decision. First, evaluate your existing cloud investment: if the company already uses Azure AD for identity, Azure Monitor for observability, and Azure DevOps for CI/CD, Arc integrates seamlessly with these tools. Second, assess team expertise; Arc requires Azure knowledge, while Fleet requires GCP knowledge, so you should lean towards the ecosystem your team already understands. Third, analyze your feature requirements, such as whether you need Fleet’s built-in Istio service mesh or Arc’s integration with Windows containers. Finally, compare the total cost for your fleet size and consider vendor neutrality, potentially evaluating self-hosted options like Rancher or EKS Connector with ArgoCD if avoiding a third cloud dependency is a priority.

Question 3: Your platform team is struggling to manage GitOps deployments across 50 clusters spread across AWS, Azure, and on-premises environments. Engineers are currently copying and pasting entire configuration repositories for each cluster, leading to massive configuration drift. How should you restructure your GitOps approach to solve this multi-cloud configuration problem?

You should implement the base/overlay pattern using a tool like Kustomize to separate what you want to deploy from how it differs per environment. The base directory will contain configurations that are identical across all clusters, such as the standard monitoring stack, core policy definitions, and security tools. Overlay directories will contain patches specific to each cloud provider or environment, such as overriding the image registry URL for AWS (ECR) versus Azure (ACR). When you need to update a fleet-wide policy, you only change the base, and the change propagates to all clusters automatically. This completely eliminates the need to maintain multiple distinct copies of configuration files, ensuring consistency at scale while cleanly managing provider-specific deviations.

Question 4: A fleet-wide policy update is pushed via GitOps. It works on 39 out of 40 clusters. The 40th cluster (an on-premises kubeadm cluster running Kubernetes 1.28) rejects the policy. What happened and how do you handle it?

The most likely cause is a Kubernetes version incompatibility between the policy manifest and the older API server. If the policy uses an API version or feature only available in Kubernetes 1.30+ (such as a newer Gatekeeper constraint template format), the 1.28 cluster’s API server will reject the resource during the apply phase. This highlights a common fleet management challenge where version skew across clusters breaks unified deployments. To handle this, you should include version gates in your GitOps configuration, such as using ArgoCD ApplicationSets to filter which policies sync based on cluster version labels. Alternatively, you can use conditional patches in Kustomize overlays to adjust policies for older clusters, or implement automated pre-flight checks in your pipeline to validate manifests against target cluster versions before attempting to sync.

Question 5: The CFO noticed a massive spike in cloud spending after your team enabled centralized monitoring for a 50-cluster fleet. The observability platform bill is now $40,000 per month. What are the primary drivers of these telemetry costs at fleet scale, and how can you architect a solution to reduce them?

The primary drivers of telemetry costs at fleet scale are the sheer volume of active metric time series and log ingestion data. A single Kubernetes cluster can generate up to 100,000 active metric time series and gigabytes of logs per day, which scales linearly and becomes prohibitively expensive when using managed services charging per metric or gigabyte. To reduce these costs, you should implement filtering at the source, shipping only golden signals (latency, traffic, errors, saturation) to the central hub rather than all available metrics. You can also aggregate per-pod metrics into per-deployment metrics before shipping and apply strict sampling rates to distributed traces (e.g., 1% for normal traffic, 100% for errors). Finally, setting aggressive retention tiers—keeping data hot for only 7 days and moving the rest to cold storage like S3—will drastically cut the ongoing storage costs.

Question 6: An executive mandates that since the company has adopted Azure Arc for fleet management, you must uninstall all provider-specific tools like `eksctl` and Terraform AWS providers to "simplify tooling". Why will this mandate cause operational failures, and what specific examples should you provide to push back?

This mandate will cause operational failures because fleet management tools handle cross-cluster concerns, not the deep, provider-specific cluster-internal operations required for lifecycle management. Fleet tools act as a coordination layer for policy, configuration, and observability, but they lack the APIs to interact with the underlying cloud infrastructure. For example, while Azure Arc can push an OPA Gatekeeper policy to an EKS cluster, it cannot configure Karpenter provisioners or adjust EKS managed node group scaling parameters, which still require AWS-specific tools. Similarly, Arc can deploy a Kubernetes StorageClass manifest, but it cannot provision or resize the underlying AWS EBS volumes or Azure Disks, meaning per-cluster management tools remain absolutely essential.

Hands-On Exercise: Build a Multi-Cluster Fleet with GitOps

In this exercise, you will create a fleet of three kind clusters simulating different environments, implement centralized GitOps, and build a fleet inventory and health dashboard.

What you will build:

flowchart TD
    subgraph Mgmt ["Management Cluster (ArgoCD)"]
        Argo["ArgoCD ApplicationSets\ndeploy platform services\nto all fleet members"]
    end

    AWS["fleet-aws-prod\n(kind, simulates AWS)"]
    Azure["fleet-azure-staging\n(kind, simulates Azure)"]
    OnPrem["fleet-onprem-prod\n(kind, simulates on-prem)"]

    Mgmt --> AWS
    Mgmt --> Azure
    Mgmt --> OnPrem

Task 1: Create the Fleet Clusters

Solution

# Create three clusters
for CLUSTER in fleet-mgmt fleet-aws-prod fleet-azure-staging; do
  kind create cluster --name $CLUSTER
done

# Verify all clusters are running
for CLUSTER in fleet-mgmt fleet-aws-prod fleet-azure-staging; do
  echo "=== $CLUSTER ==="
  kubectl --context kind-$CLUSTER get nodes
done

Task 2: Install ArgoCD on the Management Cluster

Solution

# Install ArgoCD on management cluster
kubectl --context kind-fleet-mgmt create namespace argocd
kubectl --context kind-fleet-mgmt apply -n argocd \
  -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Wait for ArgoCD to be ready
kubectl --context kind-fleet-mgmt wait --for=condition=available \
  deployment/argocd-server -n argocd --timeout=120s

# Get the admin password
ARGOCD_PW=$(kubectl --context kind-fleet-mgmt -n argocd \
  get secret argocd-initial-admin-secret \
  -o jsonpath='{.data.password}' | base64 -d)
echo "ArgoCD admin password: $ARGOCD_PW"

Task 3: Register Fleet Clusters in ArgoCD

Solution

# Get the API server URLs for each fleet cluster
AWS_SERVER=$(kubectl --context kind-fleet-aws-prod config view --minify -o jsonpath='{.clusters[0].cluster.server}')
AZURE_SERVER=$(kubectl --context kind-fleet-azure-staging config view --minify -o jsonpath='{.clusters[0].cluster.server}')

# Add fleet clusters to ArgoCD via CLI or Secrets
# Using the Secret method (no argocd CLI needed)
for CLUSTER_INFO in "fleet-aws-prod:aws:production:kind-fleet-aws-prod" "fleet-azure-staging:azure:staging:kind-fleet-azure-staging"; do
  NAME=$(echo $CLUSTER_INFO | cut -d: -f1)
  PROVIDER=$(echo $CLUSTER_INFO | cut -d: -f2)
  ENV=$(echo $CLUSTER_INFO | cut -d: -f3)
  CTX=$(echo $CLUSTER_INFO | cut -d: -f4)

  SERVER=$(kubectl --context $CTX config view --minify -o jsonpath='{.clusters[0].cluster.server}')
  CA_DATA=$(kubectl --context $CTX config view --minify --raw -o jsonpath='{.clusters[0].cluster.certificate-authority-data}')
  TOKEN=$(kubectl --context $CTX -n kube-system get secret \
    $(kubectl --context $CTX -n kube-system get sa default -o jsonpath='{.secrets[0].name}' 2>/dev/null || echo "none") \
    -o jsonpath='{.data.token}' 2>/dev/null | base64 -d || echo "")

  # Create a ServiceAccount for ArgoCD in the target cluster
  kubectl --context $CTX create serviceaccount argocd-manager -n kube-system 2>/dev/null || true
  kubectl --context $CTX create clusterrolebinding argocd-manager \
    --clusterrole=cluster-admin --serviceaccount=kube-system:argocd-manager 2>/dev/null || true

  cat <<EOF | kubectl --context kind-fleet-mgmt apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: cluster-$NAME
  namespace: argocd
  labels:
    argocd.argoproj.io/secret-type: cluster
    provider: $PROVIDER
    environment: $ENV
type: Opaque
stringData:
  name: $NAME
  server: $SERVER
  config: |
    {
      "tlsClientConfig": {
        "insecure": true
      }
    }
EOF

  echo "Registered cluster: $NAME (provider=$PROVIDER, env=$ENV)"
done

# Verify clusters are registered
kubectl --context kind-fleet-mgmt get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster

Task 4: Deploy Platform Services Across the Fleet

Solution

# Create platform baseline configmaps on each cluster via ArgoCD Applications
# Since we don't have a Git repo, we'll use a direct approach to demonstrate the pattern

for CTX in kind-fleet-aws-prod kind-fleet-azure-staging; do
  CLUSTER_NAME=$(echo $CTX | sed 's/kind-//')

  # Create platform namespace
  kubectl --context $CTX create namespace platform-system 2>/dev/null || true

  # Deploy fleet-standard configuration
  cat <<EOF | kubectl --context $CTX apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: fleet-identity
  namespace: platform-system
  labels:
    managed-by: fleet-management
data:
  cluster-name: "$CLUSTER_NAME"
  fleet-version: "1.0.0"
  managed-by: "fleet-mgmt-cluster"
  registered-at: "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fleet-monitoring-config
  namespace: platform-system
  labels:
    managed-by: fleet-management
data:
  scrape-interval: "30s"
  retention: "7d"
  external-labels: |
    cluster=$CLUSTER_NAME
    fleet=enterprise
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: fleet-default-deny
  namespace: platform-system
  labels:
    managed-by: fleet-management
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  egress:
    - to: []
      ports:
        - protocol: TCP
          port: 443
        - protocol: TCP
          port: 53
        - protocol: UDP
          port: 53
EOF

  echo "Platform services deployed to $CLUSTER_NAME"
done

Task 5: Build a Fleet Inventory and Health Report

Solution

cat <<'SCRIPT' > /tmp/fleet-report.sh
#!/bin/bash
echo "============================================="
echo "  FLEET INVENTORY & HEALTH REPORT"
echo "  $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "============================================="

TOTAL_NODES=0
TOTAL_PODS=0
TOTAL_CLUSTERS=0
HEALTHY=0

for CTX in kind-fleet-aws-prod kind-fleet-azure-staging; do
  CLUSTER=$(echo $CTX | sed 's/kind-//')
  TOTAL_CLUSTERS=$((TOTAL_CLUSTERS + 1))

  echo ""
  echo "--- Cluster: $CLUSTER ---"

  # Get cluster info
  FLEET_ID=$(kubectl --context $CTX get configmap fleet-identity -n platform-system -o jsonpath='{.data.cluster-name}' 2>/dev/null || echo "NOT REGISTERED")
  echo "  Fleet ID: $FLEET_ID"

  # Node health
  NODES=$(kubectl --context $CTX get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')
  READY_NODES=$(kubectl --context $CTX get nodes --no-headers 2>/dev/null | grep " Ready" | wc -l | tr -d ' ')
  echo "  Nodes: $READY_NODES/$NODES ready"
  TOTAL_NODES=$((TOTAL_NODES + NODES))

  # Pod count
  PODS=$(kubectl --context $CTX get pods -A --no-headers --field-selector=status.phase=Running 2>/dev/null | wc -l | tr -d ' ')
  echo "  Running Pods: $PODS"
  TOTAL_PODS=$((TOTAL_PODS + PODS))

  # K8s version
  VERSION=$(kubectl --context $CTX version --short 2>/dev/null | grep Server | awk '{print $3}' || kubectl --context $CTX get nodes -o jsonpath='{.items[0].status.nodeInfo.kubeletVersion}' 2>/dev/null)
  echo "  Kubernetes Version: $VERSION"

  # Fleet services check
  FLEET_CONFIGS=$(kubectl --context $CTX get configmap -n platform-system -l managed-by=fleet-management --no-headers 2>/dev/null | wc -l | tr -d ' ')
  NETPOLS=$(kubectl --context $CTX get networkpolicy -n platform-system -l managed-by=fleet-management --no-headers 2>/dev/null | wc -l | tr -d ' ')

  if [ "$FLEET_CONFIGS" -ge 2 ] && [ "$NETPOLS" -ge 1 ]; then
    echo "  Fleet Services: HEALTHY ($FLEET_CONFIGS configs, $NETPOLS netpols)"
    HEALTHY=$((HEALTHY + 1))
  else
    echo "  Fleet Services: DEGRADED (configs=$FLEET_CONFIGS, netpols=$NETPOLS)"
  fi
done

echo ""
echo "============================================="
echo "  FLEET SUMMARY"
echo "============================================="
echo "  Total Clusters: $TOTAL_CLUSTERS"
echo "  Healthy Clusters: $HEALTHY/$TOTAL_CLUSTERS"
echo "  Total Nodes: $TOTAL_NODES"
echo "  Total Running Pods: $TOTAL_PODS"
echo "  Fleet Health: $(( (HEALTHY * 100) / TOTAL_CLUSTERS ))%"
echo "============================================="
SCRIPT

chmod +x /tmp/fleet-report.sh
bash /tmp/fleet-report.sh

Clean Up

kind delete cluster --name fleet-mgmt
kind delete cluster --name fleet-aws-prod
kind delete cluster --name fleet-azure-staging
docker network rm hybrid-net 2>/dev/null || true
rm /tmp/fleet-report.sh

Success Criteria

I created three kind clusters simulating a multi-cloud fleet
I installed ArgoCD on the management cluster
I registered fleet clusters in ArgoCD with provider and environment labels
I deployed standardized platform services across all fleet members
I built a fleet inventory and health report
I can explain the architectural differences between Azure Arc and Google Fleet
I can describe the base/overlay pattern for multi-cloud GitOps

Next Module

Now that you can manage a fleet of clusters, it is time to learn how to provision them declaratively. Head to Module 10.6: Multi-Cloud Provisioning with Cluster API to learn how CAPI and its providers (CAPA, CAPZ, CAPG) let you create, upgrade, and scale Kubernetes clusters across any infrastructure using Kubernetes-native APIs.