Перейти до вмісту

Module 3.5: Cluster API (CAPI)

Цей контент ще не доступний вашою мовою.

Toolkit Track | Complexity: [COMPLEX] | Time: ~50 minutes

It was 2 AM when the Slack alert fired: “Production cluster unreachable.” A platform engineer at a mid-sized fintech had been manually upgrading their 50 Kubernetes clusters, one kubeadm upgrade at a time. Cluster 37 got a botched etcd migration—wrong flags, no rollback plan. The blast radius? Payment processing for 200,000 users, down for four hours. The post-mortem was brutal: “We treat clusters like pets. We need cattle.” That team adopted Cluster API. They never had another manual upgrade incident.

Cluster API (CAPI) brings the declarative, reconciliation-driven model of Kubernetes to Kubernetes itself. You define your clusters as YAML, apply them to a management cluster, and CAPI handles provisioning, scaling, upgrading, and deleting workload clusters across any infrastructure—AWS, Azure, GCP, vSphere, bare metal, even Docker on your laptop.

What You’ll Learn:

  • CAPI architecture: management clusters, workload clusters, providers
  • Core resources: Cluster, Machine, MachineDeployment, MachineSet
  • ClusterClass for templated, repeatable cluster creation
  • Hands-on with Docker provider (CAPD) for local testing
  • Infrastructure provider landscape and GitOps integration

Prerequisites:


After completing this module, you will be able to:

  • Deploy Cluster API management clusters and provision target workload clusters declaratively
  • Configure infrastructure providers (AWS, Azure, vSphere) for automated Kubernetes cluster lifecycle
  • Implement cluster upgrades, scaling, and machine health checks using Cluster API resources
  • Evaluate Cluster API’s declarative approach against Terraform for Kubernetes cluster fleet management

Managing one Kubernetes cluster is hard enough. Managing 10 is a full-time job. Managing 50+ without automation is a recipe for outages, configuration drift, and burnout. Cluster API treats clusters as disposable, reproducible infrastructure—the same way Deployments treat Pods. If a cluster drifts, reconcile it. If you need 20 more, declare them.

Did You Know?

  • Cluster API was started by the Kubernetes SIG Cluster Lifecycle team in 2018—the same group responsible for kubeadm. They saw teams reinventing cluster provisioning and wanted a universal, declarative solution.
  • CAPI manages over 15,000 production clusters at companies like Deutsche Telekom, Giantswarm, and SUSE. It is a graduated component of the Kubernetes ecosystem with providers for every major cloud and bare-metal platform.
  • The Docker provider (CAPD) creates real multi-node Kubernetes clusters using Docker containers as “machines.” This means you can test full cluster lifecycle operations—create, scale, upgrade, delete—on your laptop without spending a cent on cloud resources.
  • ClusterClass, introduced in CAPI v1beta1, lets you define a cluster “template” once and stamp out hundreds of consistent clusters. Think of it as a Helm chart, but for entire Kubernetes clusters.

The central idea: one Kubernetes cluster (the management cluster) runs CAPI controllers that provision and manage other Kubernetes clusters (the workload clusters).

CLUSTER API ARCHITECTURE
════════════════════════════════════════════════════════════════════
┌───────────────────────────────────────────────────────────────┐
│ MANAGEMENT CLUSTER │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ CAPI CORE CONTROLLERS │ │
│ │ │ │
│ │ Cluster Controller Machine Controller │ │
│ │ MachineSet Controller MachineDeployment Controller │ │
│ │ ClusterClass Controller │ │
│ └──────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌──────────────┐ ┌───────▼────────┐ ┌──────────────────┐ │
│ │ Bootstrap │ │ Infrastructure │ │ Control Plane │ │
│ │ Provider │ │ Provider │ │ Provider │ │
│ │ (CABPK) │ │ (CAPA/CAPZ/..) │ │ (KCP) │ │
│ │ │ │ │ │ │ │
│ │ Generates │ │ Creates VMs/ │ │ Manages etcd + │ │
│ │ cloud-init │ │ networks/LBs │ │ control plane │ │
│ └──────────────┘ └───────────────┘ └──────────────────┘ │
│ │ │
└─────────────────────────────┼──────────────────────────────────┘
│ provisions & manages
┌───────────────┼────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Workload │ │ Workload │ │ Workload │
│ Cluster A │ │ Cluster B │ │ Cluster C │
│ (prod-east) │ │ (prod-west) │ │ (staging) │
│ │ │ │ │ │
│ CP: 3 nodes │ │ CP: 3 nodes │ │ CP: 1 node │
│ W: 50 nodes│ │ W: 30 nodes│ │ W: 5 nodes │
└──────────────┘ └──────────────┘ └──────────────┘
Provider TypePurposeExamples
InfrastructureCreates machines, networks, load balancersCAPA (AWS), CAPZ (Azure), CAPG (GCP), CAPV (vSphere), CAPD (Docker)
BootstrapGenerates node boot scripts (cloud-init)CABPK (kubeadm), Talos, MicroK8s
Control PlaneManages control plane nodes and etcdKubeadmControlPlane (KCP), Talos, k0s

The beauty is composability: mix any infrastructure provider with any bootstrap provider. Want kubeadm-bootstrapped clusters on AWS? CAPA + CABPK. Want Talos Linux on bare metal? CAPM3 + Talos bootstrap.


CAPI introduces several CRDs that model the cluster lifecycle declaratively.

CAPI RESOURCE HIERARCHY
════════════════════════════════════════════════════════════════
Cluster (top-level, owns everything)
├── InfrastructureCluster (provider-specific: AWSCluster, DockerCluster)
├── KubeadmControlPlane (KCP) (manages control plane Machines)
│ └── Machine (one per control plane node)
│ └── InfraMachine (AWSMachine, DockerMachine)
└── MachineDeployment (manages worker nodes, like Deployment)
└── MachineSet (like ReplicaSet)
└── Machine (one per worker node)
└── InfraMachine

The root resource. References the infrastructure-specific cluster object and control plane provider. It defines the cluster network CIDRs and links to a KubeadmControlPlane (control plane) and an infrastructure-specific cluster resource like AWSCluster or DockerCluster.

Think of it exactly like a Kubernetes Deployment, but for cluster nodes instead of Pods. It manages MachineSets, which manage Machines, which map to actual VMs or containers. Each Machine references a bootstrap config (how to install K8s) and an infrastructure machine template (what VM specs to use).

Scaling workers? Change replicas: 5 to replicas: 10 on the MachineDeployment. CAPI reconciles the difference—creating new Machines, joining them to the cluster automatically.


Creating individual Cluster, MachineDeployment, and KubeadmControlPlane objects for every cluster is verbose. ClusterClass solves this by defining a reusable template. A ClusterClass references templates for the control plane, infrastructure, and worker machine deployments. Once defined, you stamp out clusters with minimal YAML:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: team-payments-prod
spec:
topology:
class: standard-production
version: v1.31.0
controlPlane:
replicas: 3
workers:
machineDeployments:
- class: default-worker
name: worker-pool
replicas: 10

That is it. Dozens of underlying resources are created from one concise definition.


ProviderCodeInfrastructureMaturity
AWSCAPAEC2, ELB, VPCStable (v2+)
AzureCAPZVMs, VMSS, AKSStable (v1+)
GCPCAPGGCE, GKEStable
vSphereCAPVvSphere VMsStable
Metal3CAPM3Bare-metal via IronicStable
DockerCAPDDocker containersDevelopment/testing only
OpenStackCAPOOpenStack VMsStable
HetznerCAPHHetzner Cloud + bare-metalCommunity

Cross-reference: For lightweight distributions that CAPI can bootstrap, see K3s, Talos, and k0s. For understanding kubeadm (the default bootstrap provider), see the CKA track.


Hands-On: CAPI with Docker Provider (CAPD)

Section titled “Hands-On: CAPI with Docker Provider (CAPD)”

No cloud account needed. We use Docker containers as “machines” to experience the full cluster lifecycle locally.

Terminal window
# Install clusterctl (CAPI CLI)
curl -L https://github.com/kubernetes-sigs/cluster-api/releases/latest/download/clusterctl-$(uname -s | tr '[:upper:]' '[:lower:]')-amd64 -o clusterctl
chmod +x clusterctl
sudo mv clusterctl /usr/local/bin/
# Verify
clusterctl version
# You also need: Docker, kind, kubectl
Terminal window
# Create a kind cluster to serve as management cluster
kind create cluster --name capi-management
# Initialize CAPI with Docker infrastructure provider
clusterctl init --infrastructure docker
# Watch providers come up
k get pods -A --watch
# Wait until all pods in capi-system, capd-system, capi-kubeadm-* are Running
Terminal window
# Generate workload cluster manifest using Docker provider
clusterctl generate cluster my-workload \
--flavor development \
--kubernetes-version v1.31.0 \
--control-plane-machine-count 1 \
--worker-machine-count 2 \
> my-workload-cluster.yaml
# Review what will be created
k apply -f my-workload-cluster.yaml --dry-run=client
# Apply it
k apply -f my-workload-cluster.yaml
Terminal window
# Watch cluster provisioning (this is the fun part)
clusterctl describe cluster my-workload
# Watch machines being created
k get machines -w
# Check cluster status
k get cluster my-workload -o yaml | grep -A 5 status
Terminal window
# Get the kubeconfig for the workload cluster
clusterctl get kubeconfig my-workload > my-workload.kubeconfig
# Use it
k --kubeconfig my-workload.kubeconfig get nodes
# You should see 1 control-plane + 2 worker nodes
# Install a CNI (workload clusters need one)
k --kubeconfig my-workload.kubeconfig apply -f \
https://raw.githubusercontent.com/projectcalico/calico/v3.28.0/manifests/calico.yaml
Terminal window
# Scale from 2 to 4 workers
k patch machinedeployment my-workload-md-0 \
--type merge \
-p '{"spec":{"replicas": 4}}'
# Watch new machines appear
k get machines -w
Terminal window
# Delete the workload cluster
k delete cluster my-workload
# CAPI handles teardown: drains nodes, deletes machines, cleans up infra
k get machines -w # Watch them disappear
# Delete the management cluster
kind delete cluster --name capi-management

Success Criteria: You created a workload cluster, accessed it, scaled workers, and deleted it—all declaratively from a management cluster.


OperationHowWhat Happens
Createkubectl apply -f cluster.yamlCAPI provisions infrastructure, bootstraps nodes, forms cluster
Scale workersPatch MachineDeployment replicasNew Machines created, joined to cluster
Scale control planePatch KubeadmControlPlane replicas (1 to 3)New CP nodes added, etcd membership updated
UpgradePatch version on KCP + MachineDeploymentRolling replacement: new node up, old node drained and removed
Deletekubectl delete cluster <name>Nodes drained, machines deleted, infrastructure cleaned up
RepairAutomatic reconciliationIf a Machine fails health checks, CAPI replaces it

The upgrade strategy deserves emphasis: CAPI does immutable infrastructure upgrades. It does not run kubeadm upgrade on existing nodes. Instead, it creates a new node at the target version, waits for it to be healthy, then drains and removes the old node. This is safer and more predictable than in-place upgrades.


CAPI resources are just Kubernetes objects. This makes GitOps integration natural.

Store your cluster definitions in Git, point ArgoCD at the management cluster, and every cluster change goes through pull request review.

argocd/cluster-fleet-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: cluster-fleet
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/your-org/cluster-fleet
targetRevision: main
path: clusters
destination:
server: https://kubernetes.default.svc # management cluster
namespace: default
syncPolicy:
automated:
prune: false # IMPORTANT: never auto-delete clusters
selfHeal: true

With this setup, adding a new cluster means committing a YAML file. Upgrading a fleet means bumping the version in Git. Every change is auditable, reviewable, and reversible.

Cross-reference: See ArgoCD and Flux for GitOps fundamentals.


DimensionCluster APITerraformCrossplane
ModelKubernetes-native CRDsHCL declarativeKubernetes-native CRDs
ReconciliationContinuous (controller loop)One-shot (apply)Continuous (controller loop)
ScopeKubernetes clusters onlyAny infrastructureAny cloud resource
StateKubernetes etcdTerraform state fileKubernetes etcd
Drift detectionAutomatic, continuousManual (plan)Automatic, continuous
Cluster upgradesBuilt-in rolling upgradesManual scriptingNot cluster-aware
Multi-clusterFirst-class (fleet management)Possible but manualCan provision clusters but not manage lifecycle
Best forManaging Kubernetes clusters at scaleGeneral infrastructureCloud resources as K8s APIs

When to use what:

  • CAPI: You manage multiple Kubernetes clusters and want declarative lifecycle management with rolling upgrades, health checks, and remediation.
  • Terraform: You need to provision the base infrastructure (VPCs, IAM, DNS) that CAPI clusters run on. Many teams use Terraform for foundations + CAPI for clusters.
  • Crossplane: You want to expose cloud resources (databases, queues, storage) as Kubernetes APIs for developers. Crossplane and CAPI complement each other well.

MistakeWhy It’s WrongWhat To Do Instead
Running workloads on the management clusterIf the management cluster goes down, you lose the ability to manage all workload clustersKeep the management cluster dedicated; run workloads on workload clusters only
No backup of management clusterLosing the management cluster means losing cluster definitions and stateBack up etcd and CAPI resources regularly; consider Velero or clusterctl move
Skipping CNI installation on workload clustersCAPI provisions nodes but does not install a CNI; Pods will stay PendingAlways install a CNI (Calico, Cilium) as a post-create step or via ClusterResourceSet
Upgrading multiple minor versions at onceCAPI follows Kubernetes skew policy; jumping versions can break clustersUpgrade one minor version at a time (e.g., 1.30 to 1.31, not 1.29 to 1.31)
Using CAPD (Docker provider) in productionCAPD is for development and testing only; Docker “machines” are not production-gradeUse a real infrastructure provider (CAPA, CAPZ, CAPV) for production
Ignoring MachineHealthChecksWithout health checks, failed nodes sit broken indefinitelyDefine MachineHealthCheck resources to auto-replace unhealthy nodes
Setting prune: true in ArgoCD for clustersArgoCD could delete cluster resources if they disappear from GitAlways set prune: false for cluster fleet ArgoCD Applications

Test your understanding of Cluster API concepts.

Q1: What is the role of the management cluster in CAPI?

Show Answer

The management cluster runs CAPI controllers (core, infrastructure, bootstrap, control plane providers) that provision and manage workload clusters. It is the “cluster that manages clusters.” It stores the desired state of all workload clusters as Kubernetes resources in its own etcd.

Q2: How does CAPI handle Kubernetes version upgrades differently from kubeadm upgrade?

Show Answer

CAPI uses an immutable infrastructure approach: it creates a new Machine at the target Kubernetes version, waits for it to become healthy, then cordons, drains, and deletes the old Machine. This is a rolling replacement, not an in-place upgrade. kubeadm upgrade modifies the existing node in place, which carries more risk of leaving the node in a broken state.

Q3: What are the three types of CAPI providers and what does each do?

Show Answer
  1. Infrastructure Provider - Creates the actual compute resources (VMs, networks, load balancers) on the target platform (AWS, Azure, vSphere, Docker, etc.).
  2. Bootstrap Provider - Generates the configuration (typically cloud-init scripts) that turns a bare machine into a Kubernetes node. The default is CABPK (kubeadm bootstrap).
  3. Control Plane Provider - Manages the lifecycle of control plane nodes, including etcd membership. The default is KubeadmControlPlane (KCP).

Q4: What problem does ClusterClass solve, and how?

Show Answer

ClusterClass solves the problem of verbose, repetitive cluster definitions. Without it, creating a cluster requires defining Cluster, KubeadmControlPlane, MachineDeployment, and all their infrastructure-specific counterparts individually. ClusterClass lets you define a reusable template once, then create clusters by referencing the class with only the parameters that differ (name, version, replica counts). It is analogous to a class/instance relationship in object-oriented programming.


Scenario: Your team manages three environments (dev, staging, prod) and wants to standardize cluster creation with CAPI.

Tasks:

  1. Set up a management cluster using kind and initialize CAPI with the Docker provider
  2. Create a workload cluster named dev-cluster with 1 control plane node and 1 worker
  3. Verify the cluster is healthy and install a CNI
  4. Scale the worker pool to 3 nodes
  5. Create a second cluster named staging-cluster with the same configuration
  6. Delete the dev-cluster and verify all resources are cleaned up

Success Criteria:

  • Management cluster running with CAPI controllers
  • dev-cluster provisioned and accessible via kubeconfig
  • Workers scaled from 1 to 3 (verified with kubectl get machines)
  • staging-cluster running alongside (two workload clusters managed simultaneously)
  • dev-cluster deleted cleanly (no orphaned Machines)

Bonus Challenge: Write a ClusterClass that encapsulates your cluster template, then create both clusters using spec.topology.class references instead of individual resources.


  • CAPI is the standard for declarative Kubernetes lifecycle management
  • Rancher and Gardener are alternative fleet management tools with their own provisioning models
  • EKS Anywhere, AKS Arc, and GKE On-Prem use CAPI under the hood for hybrid deployments
  1. Dedicate the management cluster — no application workloads, high availability
  2. Use ClusterClass — avoid copy-paste cluster definitions; template everything
  3. Define MachineHealthChecks — auto-replace nodes that fail health checks
  4. GitOps your fleet — store all cluster definitions in Git
  5. Back up the management cluster — use Velero or clusterctl move

Next Module: Module 7.1: Backstage - Build an Internal Developer Portal to give developers self-service access to these clusters.