Module 7.5: Azure Kubernetes Fleet Manager & Multi-Cluster Operations
AKS Deep Dive | Complexity:
[ADVANCED]| Time: 2.5h
As organizations scale their Kubernetes footprints, managing a single sprawling cluster often becomes untenable due to blast radius concerns, hard limits, or multi-region requirements. The natural evolution is multi-cluster architecture, but this introduces massive operational overhead: how do you coordinate upgrades, enforce policies, and distribute workloads consistently across dozens of clusters?
Enter Azure Kubernetes Fleet Manager (Fleet).
Fleet provides a centralized control plane to manage multiple AKS clusters (and Azure Arc-enabled Kubernetes clusters) as a single, cohesive entity. It solves the “n-cluster problem” by introducing fleet-level workload placement, coordinated multi-cluster upgrades, and unified governance.
Hands-On Exercise
Section titled “Hands-On Exercise”Goal: Build a two-cluster AKS Fleet, propagate an application from the Fleet hub to both member clusters, observe reconciliation after drift, and define a staged multi-cluster upgrade strategy.
-
Set the lab variables and install the Fleet CLI extension.
Terminal window export SUBSCRIPTION_ID=$(az account show --query id -o tsv)export GROUP=rg-aks-fleet-labexport FLEET=aks-fleet-labexport CLUSTER_EAST=aks-fleet-eastexport CLUSTER_WEST=aks-fleet-westexport EAST_MEMBER=member-eastexport WEST_MEMBER=member-westexport EAST_LOCATION=eastusexport WEST_LOCATION=westus2export STRATEGY=safe-rolloutaz account set --subscription "${SUBSCRIPTION_ID}"az extension add --name fleetaz extension update --name fleetVerification:
Terminal window az account show --query "{subscription:id,user:user.name}" -o tableaz extension show --name fleet --query version -o tsv -
Create a resource group and deploy two AKS clusters in different regions.
Terminal window az group create --name "${GROUP}" --location "${EAST_LOCATION}"az aks create \--resource-group "${GROUP}" \--name "${CLUSTER_EAST}" \--location "${EAST_LOCATION}" \--node-count 1 \--generate-ssh-keysaz aks create \--resource-group "${GROUP}" \--name "${CLUSTER_WEST}" \--location "${WEST_LOCATION}" \--node-count 1 \--generate-ssh-keysVerification:
Terminal window az aks list --resource-group "${GROUP}" --query "[].{name:name,location:location,power:powerState.code}" -o table -
Create a Fleet hub and join both AKS clusters as Fleet members with separate update groups.
Terminal window az fleet create \--resource-group "${GROUP}" \--name "${FLEET}" \--location "${EAST_LOCATION}" \--enable-hubexport EAST_ID=$(az aks show --resource-group "${GROUP}" --name "${CLUSTER_EAST}" --query id -o tsv)export WEST_ID=$(az aks show --resource-group "${GROUP}" --name "${CLUSTER_WEST}" --query id -o tsv)az fleet member create \--resource-group "${GROUP}" \--fleet-name "${FLEET}" \--name "${EAST_MEMBER}" \--member-cluster-id "${EAST_ID}" \--update-group stage1az fleet member create \--resource-group "${GROUP}" \--fleet-name "${FLEET}" \--name "${WEST_MEMBER}" \--member-cluster-id "${WEST_ID}" \--update-group stage2Verification:
Terminal window az fleet member list --resource-group "${GROUP}" --fleet-name "${FLEET}" -o table -
Authorize hub-cluster access and pull kubeconfig contexts for the hub and both members.
Terminal window export FLEET_ID="/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${GROUP}/providers/Microsoft.ContainerService/fleets/${FLEET}"export IDENTITY=$(az ad signed-in-user show --query id -o tsv)az role assignment create \--role "Azure Kubernetes Fleet Manager RBAC Cluster Admin" \--assignee "${IDENTITY}" \--scope "${FLEET_ID}"az fleet get-credentials --resource-group "${GROUP}" --name "${FLEET}" --context "${FLEET}-hub" --overwrite-existingaz fleet get-credentials --resource-group "${GROUP}" --name "${FLEET}" --member "${EAST_MEMBER}" --context "${EAST_MEMBER}-ctx" --overwrite-existingaz fleet get-credentials --resource-group "${GROUP}" --name "${FLEET}" --member "${WEST_MEMBER}" --context "${WEST_MEMBER}-ctx" --overwrite-existingVerification:
Terminal window kubectl --context "${FLEET}-hub" get memberclusterskubectl --context "${FLEET}-hub" get memberclusters -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.fleet\.azure\.com/location}{"\n"}{end}' -
Deploy a sample namespace and application to the Fleet hub cluster.
Terminal window cat <<'EOF' | kubectl --context "${FLEET}-hub" apply -f -apiVersion: v1kind: Namespacemetadata:name: fleet-demo---apiVersion: apps/v1kind: Deploymentmetadata:name: webnamespace: fleet-demospec:replicas: 1selector:matchLabels:app: webtemplate:metadata:labels:app: webspec:containers:- name: webimage: mcr.microsoft.com/oss/nginx/nginx:1.25.5ports:- containerPort: 80---apiVersion: v1kind: Servicemetadata:name: webnamespace: fleet-demospec:selector:app: webports:- port: 80targetPort: 80EOFVerification:
Terminal window kubectl --context "${FLEET}-hub" -n fleet-demo get deploy,svc,pods -
Create a
ClusterResourcePlacementthat propagates the namespace and its child resources to all Fleet members.Terminal window cat <<'EOF' | kubectl --context "${FLEET}-hub" apply -f -apiVersion: placement.kubernetes-fleet.io/v1kind: ClusterResourcePlacementmetadata:name: fleet-demo-allspec:resourceSelectors:- group: ""version: v1kind: Namespacename: fleet-demopolicy:placementType: PickAllEOFVerification:
Terminal window kubectl --context "${FLEET}-hub" get clusterresourceplacement fleet-demo-allkubectl --context "${FLEET}-hub" describe clusterresourceplacement fleet-demo-all -
Confirm the workload exists on both member clusters, then create drift on one member and watch Fleet reconcile it.
Terminal window kubectl --context "${EAST_MEMBER}-ctx" -n fleet-demo get deploy,svc,podskubectl --context "${WEST_MEMBER}-ctx" -n fleet-demo get deploy,svc,podskubectl --context "${WEST_MEMBER}-ctx" -n fleet-demo delete deployment websleep 20kubectl --context "${WEST_MEMBER}-ctx" -n fleet-demo get deployment webVerification:
Terminal window kubectl --context "${WEST_MEMBER}-ctx" -n fleet-demo get podskubectl --context "${FLEET}-hub" describe clusterresourceplacement fleet-demo-all -
Define a staged Fleet update strategy so one member upgrades before the other.
Terminal window cat <<'EOF' > example-stages.json{"stages": [{"name": "stage-1-canary","groups": [{"name": "stage1"}],"afterStageWaitInSeconds": 900},{"name": "stage-2-production","groups": [{"name": "stage2"}]}]}EOFaz fleet updatestrategy create \--resource-group "${GROUP}" \--fleet-name "${FLEET}" \--name "${STRATEGY}" \--stages example-stages.jsonVerification:
Terminal window az fleet updatestrategy show --resource-group "${GROUP}" --fleet-name "${FLEET}" --name "${STRATEGY}" -o yamlaz aks get-upgrades --resource-group "${GROUP}" --name "${CLUSTER_EAST}" -o table
Success criteria:
- The Fleet hub shows both member clusters as joined.
fleet-demo-allreports as scheduled and applied from the hub.- The
fleet-demonamespace andwebworkload exist on both member clusters. - Deleting the deployment from one member cluster results in Fleet recreating it.
- The Fleet update strategy exists and shows two ordered stages mapped to different update groups.
When to Adopt Fleet Manager
Section titled “When to Adopt Fleet Manager”Before diving into the mechanics, it is crucial to understand when you actually need Fleet Manager. Multi-cluster architectures introduce complexity; you should not adopt them prematurely.
Use single-cluster (or a few independent clusters) when:
- You operate in a single region and haven’t hit AKS scalability limits (e.g., 5,000 nodes).
- Your team structure is simple, and blast radius concerns are satisfied by namespaces and RBAC.
- You prefer to manage multi-cluster deployments entirely through an external GitOps tool (like ArgoCD) without needing native Azure coordinated upgrades.
Adopt Azure Kubernetes Fleet Manager when:
- High Availability & Disaster Recovery: You run active-active or active-passive workloads across multiple Azure regions.
- Blast Radius Reduction: You intentionally split workloads across many smaller clusters rather than one massive cluster to minimize the impact of control plane failures or misconfigurations.
- Lifecycle Management at Scale: You need to orchestrate Kubernetes version upgrades across dozens of clusters in a safe, staged manner (e.g., Dev -> Staging -> Prod/Canary -> Prod/Main) without writing complex custom pipelines.
- Hybrid/Edge Footprint: You manage a mix of AKS and on-premises/edge clusters via Azure Arc and need a single pane of glass for policy and placement.
Pause and predict: If you have 50 AKS clusters across 3 regions, how would you upgrade them without Fleet Manager? You would likely need a complex CI/CD pipeline looping through clusters, checking health, and handling rollbacks. Fleet Manager moves this orchestration logic into the Azure platform itself.
Architecture and Topology
Section titled “Architecture and Topology”Fleet Manager operates on a hub-and-spoke topology.
- The Fleet (Hub): An Azure resource that acts as the centralized control plane. Under the hood, a Fleet resource with the “Hub cluster” feature enabled provisions a managed, headless Kubernetes control plane. You do not run user workloads directly on the Hub; it exists solely to store fleet-level custom resources (like placements and update runs) and API objects.
- Member Clusters (Spokes): Standard AKS clusters or Azure Arc-enabled clusters that are joined to the Fleet.
graph TD Fleet[Fleet Manager Hub Control Plane]
subgraph Region: East US ClusterA[AKS Member: app-east-1] ClusterB[AKS Member: app-east-2] end
subgraph Region: West Europe ClusterC[AKS Member: app-west-1] end
subgraph On-Premises ClusterD[Arc Member: factory-edge] end
Fleet -->|FleetMember| ClusterA Fleet -->|FleetMember| ClusterB Fleet -->|FleetMember| ClusterC Fleet -->|FleetMember| ClusterD
Admin((Platform Admin)) -->|kubectl apply <br> ClusterResourcePlacement| FleetJoining a Cluster to a Fleet
Section titled “Joining a Cluster to a Fleet”Clusters are joined to the Fleet by creating a FleetMember resource. This can be done via the Azure CLI, ARM templates, Bicep, or Terraform.
# Create the Fleet resource (with a hub cluster)az fleet create \ --resource-group my-fleet-rg \ --name global-app-fleet \ --enable-hub
# Join an existing AKS cluster as a memberaz fleet member create \ --resource-group my-fleet-rg \ --fleet-name global-app-fleet \ --name east-member-1 \ --member-cluster-id /subscriptions/.../managedClusters/app-east-1Once joined, the Fleet Hub has the necessary credentials and network line-of-sight to sync resources down to the member clusters.
Fleet-Level Workload Placement
Section titled “Fleet-Level Workload Placement”The most powerful feature of Fleet Manager is the ability to deploy Kubernetes resources to the Hub, and have the Hub intelligently distribute them to the member clusters based on rules. This is achieved using the ClusterResourcePlacement Custom Resource Definition (CRD).
Instead of running kubectl apply against 10 different clusters, you authenticate to the Fleet Hub and apply your standard Kubernetes manifests (Deployments, Services, ConfigMaps, etc.). Then, you create a ClusterResourcePlacement to tell the Hub where those resources should go.
Placement Strategies
Section titled “Placement Strategies”Fleet supports several placement policies:
pickAll: Distribute the resources to all member clusters, optionally filtering by cluster labels.pickFixed: Distribute the resources to a specific, hardcoded list of member cluster names.pickN: Distribute the resources to a specific number of clusters (e.g., “put this workload on exactly 3 clusters that have the labelenv=prod”).
Example: Propagating a Frontend App
Section titled “Example: Propagating a Frontend App”Let’s say you have a frontend application deployed in the frontend-app namespace on the Hub cluster. You want to deploy this namespace (and everything in it) to all member clusters labeled region: westeurope.
apiVersion: placement.kubernetes-fleet.io/v1beta1kind: ClusterResourcePlacementmetadata: name: frontend-europe-placementspec: resourceSelectors: - group: "" version: v1 kind: Namespace name: frontend-app policy: placementType: PickAll affinity: clusterAffinity: clusterSelectorTerms: - labelSelector: matchLabels: region: westeuropeWhen you apply this to the Hub, the Fleet controller packages the frontend-app namespace, the Deployments, Services, and ConfigMaps within it, and pushes them to the matching member clusters. It also monitors the member clusters to ensure the resources remain synchronized with the Hub’s desired state.
Stop and think: If you delete a Deployment directly on one of the member clusters, what happens? Because the Fleet Hub is the source of truth for placed resources, the Fleet controller will detect the drift and automatically recreate the Deployment on the member cluster to match the Hub’s state.
Coordinated Multi-Cluster Upgrades
Section titled “Coordinated Multi-Cluster Upgrades”Upgrading Kubernetes versions (e.g., from v1.34 to v1.35) is stressful. Upgrading 50 clusters is a nightmare. Fleet Manager provides an orchestration engine for multi-cluster upgrades using Update Runs, Stages, and Groups.
Instead of upgrading clusters randomly or relying on external CI/CD loops, you model your rollout strategy natively in Azure.
- Update Groups: Logical groupings of clusters (e.g.,
dev-clusters,canary-clusters,prod-westeurope,prod-eastus). - Update Stages: Ordered sequences of Update Groups. A stage waits for the previous stage to complete successfully before starting. You can also configure bake times (wait periods) between stages.
- Update Runs: The actual execution of an upgrade, targeting a specific Kubernetes version (e.g., upgrade all clusters to v1.35.2).
Defining an Update Strategy
Section titled “Defining an Update Strategy”You can define a reusable FleetUpdateStrategy.
az fleet updatestrategy create \ --resource-group my-fleet-rg \ --fleet-name global-app-fleet \ --name safe-rollout-strategy \ --stages \ '{"name": "Stage1-Dev", "groups": [{"name": "dev-group"}], "afterStageWaitInSeconds": 3600}' \ '{"name": "Stage2-Canary", "groups": [{"name": "canary-group"}], "afterStageWaitInSeconds": 86400}' \ '{"name": "Stage3-Prod", "groups": [{"name": "prod-east"}, {"name": "prod-west"}]}'In this strategy:
- The
dev-groupupgrades first. - The system waits 1 hour (3600 seconds) to allow for automated alerts to fire if something is broken.
- The
canary-groupupgrades. - The system waits 24 hours (86400 seconds) for bake time.
- The production groups (
prod-eastandprod-west) upgrade concurrently.
You trigger the upgrade by creating an Update Run using this strategy:
az fleet updaterun create \ --resource-group my-fleet-rg \ --fleet-name global-app-fleet \ --name upgrade-to-1-35 \ --upgrade-type Full \ --kubernetes-version 1.35.2 \ --update-strategy-name safe-rollout-strategyIf a stage fails (e.g., a cluster upgrade fails or workloads become unhealthy and trigger a halt), the Update Run pauses, preventing the bad update from cascading to your production clusters.
GitOps and Policy at Scale
Section titled “GitOps and Policy at Scale”Fleet Manager integrates seamlessly with other Azure scalable management tools.
GitOps with Flux
Section titled “GitOps with Flux”While you can manually kubectl apply resources to the Fleet Hub, best practice is to manage the Hub’s state via GitOps. You can install the Flux v2 extension directly onto the Fleet Hub cluster.
- Commit your Kubernetes manifests and
ClusterResourcePlacementYAMLs to a Git repository. - Configure the Flux extension on the Fleet Hub to watch that repository.
- Flux syncs the resources to the Hub.
- Fleet Manager distributes the resources to the member clusters.
This provides a centralized GitOps workflow for a multi-cluster fleet, rather than having to install and configure Flux individually on every single spoke cluster.
Azure Policy
Section titled “Azure Policy”Azure Policy can be applied at the resource group or subscription level containing your AKS clusters. However, when using Fleet Manager, you can ensure consistent policy enforcement across all members. For instance, you can use Azure Policy to enforce that all clusters in a specific Update Group have the correct labels applied, or that certain privileged containers are blocked globally across the fleet.
Multi-Cluster Observability
Section titled “Multi-Cluster Observability”To monitor a fleet, you must aggregate telemetry. The standard pattern is configuring all member AKS clusters to send their metrics and logs to a centralized Azure Monitor Workspace (for Managed Prometheus) and a centralized Log Analytics Workspace. Azure Managed Grafana can then connect to the Azure Monitor Workspace, allowing you to build dashboards that query metrics across the entire fleet, filtering by cluster name or region labels.
Knowledge Check
Section titled “Knowledge Check”Scenario 1
Section titled “Scenario 1”You are the platform engineer for an e-commerce company running 12 AKS clusters across 4 regions. You have defined a ClusterResourcePlacement on your Fleet Hub to deploy a new microservice to all 12 clusters. You commit the YAML to your Git repository, Flux syncs it to the Hub, but the microservice only appears on 3 of the clusters. You check the Hub, and the ClusterResourcePlacement status shows it successfully matched and applied to all 12 clusters.
What is the most likely cause of this discrepancy?
- A) The Fleet Manager controller is experiencing high latency and the rollout to the remaining 9 clusters is just delayed.
- B) The
pickNplacement strategy was accidentally configured to limit the deployment to 3 clusters. - C) The workloads on the 9 missing clusters were deployed, but a local GitOps agent (like ArgoCD or Flux) installed directly on those member clusters immediately deleted or overwrote the Fleet-managed resources because they drifted from the local agent’s Git source.
- D) The Azure region hosting the 9 missing clusters does not support Fleet Manager.
Explanation
Correct Answer: C
If the Hub reports successful placement to all 12 clusters, it means the Fleet controller successfully communicated with the API servers of those member clusters and applied the manifests. However, if a member cluster has its own local GitOps controller (like ArgoCD) running, and that controller is configured to manage the same namespaces or resources, it will view the Fleet’s changes as drift. The local GitOps agent will immediately reconcile the cluster state back to its Git source, effectively deleting or undoing the resources placed by Fleet Manager. When using Fleet Manager for workload placement, you must ensure that local cluster controllers do not have conflicting management scopes. Answer B is incorrect because the scenario states the status showed it matched all 12 clusters. Answer D is incorrect as Fleet member clusters can be in any region.
Scenario 2
Section titled “Scenario 2”Your organization is preparing to upgrade its entire fleet of 40 AKS clusters from Kubernetes v1.34 to v1.35. You have created a FleetUpdateStrategy with three stages: Dev, Staging, and Production, with a 12-hour wait time between Staging and Production. During the Staging stage upgrade, one of the 5 clusters in the staging group fails its node image upgrade due to a custom daemonset blocking node drains.
How will the Fleet Manager Update Run behave in this situation?
- A) It will immediately rollback the failed staging cluster to v1.34, continue upgrading the other 4 staging clusters, and then proceed to the Production stage.
- B) It will halt the entire Update Run at the
Stagingstage. TheProductionstage will not begin until the failed cluster is remediated and the run is resumed. - C) It will skip the failed cluster, mark the
Stagingstage as partially complete, wait the 12 hours, and then automatically start theProductionstage. - D) It will force-delete the blocking daemonset, retry the upgrade on the failed cluster, and proceed to Production.
Explanation
Correct Answer: B
Azure Kubernetes Fleet Manager’s update orchestration is designed for safety. If a cluster upgrade fails within a stage, the default behavior of the Update Run is to halt. It will not automatically proceed to the next stage (Production). This is the primary value proposition of stages: preventing a bad upgrade or systemic issue from cascading to your most critical environments. An administrator must investigate the failure on the specific staging cluster, resolve the issue (e.g., fix the pod disruption budgets or daemonset blocking the drain), and then resume the Update Run. Fleet Manager does not currently perform automatic cluster-level rollbacks of Kubernetes versions (Answer A), nor does it forcefully delete user workloads to bypass drain failures (Answer D).
Sources
Section titled “Sources”- Azure Kubernetes Fleet Manager overview — Microsoft’s canonical reference for the Fleet hub + member-cluster model, supported topologies, and the “n-cluster problem” this module frames.
- Orchestrate cluster updates across clusters with Azure Kubernetes Fleet Manager — Authoritative source for
FleetUpdateStrategy, staged Update Runs, and the halt-on-failure behavior referenced in Scenario 2. - Propagate resources from a Fleet Manager hub cluster to member clusters — Describes
ClusterResourcePlacementsemantics and how the hub reconciles workload placement to members. - Multi-cluster load balancing with Azure Kubernetes Fleet Manager — Reference for multi-cluster service discovery and cross-cluster traffic policies.
- Fleet Manager and GitOps (Flux/ArgoCD) coexistence — Context for the Scenario 1 conflict between Fleet-driven placement and a cluster-local GitOps controller reconciling to a different source of truth.
- Kubernetes 1.35 release notes — Upstream release cadence referenced in the v1.34 → v1.35 fleet upgrade example.