Module 2.4: Declarative Bare Metal with Cluster API
Complexity:
[COMPLEX]| Time: 60 minutesPrerequisites: Module 2.3: Immutable OS, Cluster API
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Implement Cluster API with Metal3 or Sidero providers to declaratively provision bare-metal Kubernetes clusters
- Configure a bare-metal host inventory with BMC credentials, hardware profiles, and network templates
- Deploy new Kubernetes clusters using
kubectl applywith version-controlled YAML manifests - Design a GitOps-driven cluster lifecycle workflow that covers provisioning, scaling, and decommissioning
Why This Module Matters
Section titled “Why This Module Matters”A financial services company with 8 Kubernetes clusters across two datacenters managed their infrastructure with a combination of Ansible playbooks, shell scripts, and a shared spreadsheet tracking which server was in which cluster. Creating a new cluster took 3 days: 1 day to allocate servers (manually checking the spreadsheet), 1 day to PXE boot and install the OS, and 1 day to run kubeadm and configure networking. Decommissioning a cluster was worse — nobody was sure which servers could be safely wiped because the spreadsheet was 4 months out of date.
When they needed to spin up an emergency cluster for a regulatory audit, it took 5 days instead of the promised 1. The CTO asked: “Why can’t we create a cluster as easily as we create a pod?” The answer was that their bare metal had no declarative API — no equivalent of kubectl apply -f cluster.yaml.
Cluster API (CAPI) with bare metal providers (Metal3/Sidero) solves this by treating physical servers like cloud instances. You define a cluster in YAML, apply it, and the system provisions hardware, installs the OS, bootstraps Kubernetes, and joins nodes — all declaratively, all auditable, all version-controlled in Git.
The Valet Parking Analogy
Without Cluster API, provisioning bare metal is like parking your own car in a multi-story garage: you walk around looking for a space, park, and try to remember where you left it. With Cluster API, it is valet parking: you hand over the keys (hardware inventory), say what you need (“3 control planes, 5 workers”), and the valet (CAPI) handles everything. When you are done, you get your car back (server released to the pool).
What You’ll Learn
Section titled “What You’ll Learn”- How Cluster API extends Kubernetes to manage infrastructure
- Metal3 (CAPM3): IPMI/Redfish-based bare metal provisioning
- Sidero: Talos-native bare metal management
- Hardware inventory and machine lifecycle
- GitOps-driven cluster lifecycle (create, upgrade, scale, delete)
- Multi-cluster management from a single management cluster
Cluster API Architecture
Section titled “Cluster API Architecture”┌─────────────────────────────────────────────────────────────┐│ CLUSTER API ON BARE METAL ││ ││ ┌────────────────────────────────────────────┐ ││ │ Management Cluster │ ││ │ │ ││ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ ││ │ │ CAPI │ │ Bootstrap│ │ Infra │ │ ││ │ │Controller│ │ Provider │ │ Provider │ │ ││ │ │ │ │ (Talos/ │ │ (Metal3/ │ │ ││ │ │ Manages │ │ kubeadm)│ │ Sidero) │ │ ││ │ │ Cluster, │ │ │ │ │ │ ││ │ │ Machine │ │ Generates│ │ Provisions│ │ ││ │ │ CRDs │ │ bootstrap│ │ bare │ │ ││ │ │ │ │ config │ │ metal │ │ ││ │ └──────────┘ └──────────┘ └──────────┘ │ ││ └──────────────────┬─────────────────────────┘ ││ │ Provisions ││ ┌───────────▼───────────┐ ││ │ Workload Cluster │ ││ │ │ ││ │ ┌────┐ ┌────┐ ┌────┐│ ││ │ │CP-1│ │CP-2│ │CP-3││ ││ │ └────┘ └────┘ └────┘│ ││ │ ┌────┐ ┌────┐ ┌────┐│ ││ │ │W-1 │ │W-2 │ │W-3 ││ ││ │ └────┘ └────┘ └────┘│ ││ └───────────────────────┘ ││ │└─────────────────────────────────────────────────────────────┘Key CRDs
Section titled “Key CRDs”| CRD | Purpose |
|---|---|
Cluster | Defines a K8s cluster (name, version, networking) |
Machine | Represents a single node (control plane or worker) |
MachineDeployment | Manages a set of worker machines (like a Deployment for pods) |
MachineHealthCheck | Auto-remediation for unhealthy nodes |
BareMetalHost (Metal3) | Represents a physical server |
ServerClass (Sidero) | Groups servers by hardware capabilities |
Metal3 (CAPM3)
Section titled “Metal3 (CAPM3)”Metal3 uses IPMI/Redfish to control bare metal servers. It integrates with Ironic (the OpenStack bare metal provisioner) to handle PXE boot, OS installation, and machine lifecycle.
Metal3 Architecture
Section titled “Metal3 Architecture”┌─────────────────────────────────────────────────────────────┐│ METAL3 STACK ││ ││ Management Cluster ││ ┌──────────────────────────────────────────┐ ││ │ ┌──────────────┐ ┌──────────────┐ │ ││ │ │ CAPM3 │ │ Ironic │ │ ││ │ │ (controller) │ │ (provisioner)│ │ ││ │ └──────┬───────┘ └──────┬───────┘ │ ││ │ │ │ │ ││ │ ┌──────▼──────────────────▼──────┐ │ ││ │ │ BareMetalHost CRDs │ │ ││ │ │ │ │ ││ │ │ bmh-01: available │ │ ││ │ │ bmh-02: provisioned (cp-1) │ │ ││ │ │ bmh-03: provisioned (cp-2) │ │ ││ │ │ bmh-04: provisioning... │ │ ││ │ └────────────────────────────────┘ │ ││ └──────────────────────────────────────────┘ ││ │ ││ │ IPMI/Redfish ││ ▼ ││ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ││ │ BMC │ │ BMC │ │ BMC │ │ BMC │ ││ │srv-01│ │srv-02│ │srv-03│ │srv-04│ ││ └──────┘ └──────┘ └──────┘ └──────┘ ││ │└─────────────────────────────────────────────────────────────┘Pause and predict: In the traditional workflow described in the war story above, creating a cluster took 3 days and involved a shared spreadsheet. With Cluster API, you define a cluster in YAML and
kubectl applyit. What are the prerequisites that must be in place before this “kubectl apply” can actually provision physical servers? List at least three infrastructure components.
BareMetalHost Definition
Section titled “BareMetalHost Definition”The BareMetalHost CRD is how Metal3 knows about your physical servers. Each server gets a manifest that includes its BMC address and credentials — this is how CAPI can power-cycle the machine, set the boot device to PXE, and initiate OS installation without anyone touching the hardware:
apiVersion: metal3.io/v1alpha1kind: BareMetalHostmetadata: name: server-01 namespace: metal3spec: online: true bootMACAddress: "aa:bb:cc:dd:ee:01" bmc: address: ipmi://10.0.100.10 credentialsName: server-01-bmc-credentials rootDeviceHints: deviceName: /dev/sda # Hardware profile auto-detected during inspection---apiVersion: v1kind: Secretmetadata: name: server-01-bmc-credentials namespace: metal3type: Opaquedata: username: YWRtaW4= # admin password: cGFzc3dvcmQ= # passwordMachine Lifecycle States
Section titled “Machine Lifecycle States”┌─────────────────────────────────────────────────────────────┐│ BAREMETAL HOST LIFECYCLE ││ ││ Registering → Inspecting → Available → Provisioning ││ │ ││ ▼ ││ Provisioned ││ │ ││ ▼ ││ Deprovisioning ││ │ ││ ▼ ││ Available ││ (ready for reuse) ││ ││ Registering: BMC credentials verified ││ Inspecting: Hardware inventory (CPU, RAM, disks, NICs) ││ Available: Ready for cluster allocation ││ Provisioning: PXE booting, OS installing ││ Provisioned: Running as K8s node ││ Deprovisioning: Wiping disks, returning to pool ││ │└─────────────────────────────────────────────────────────────┘Sidero (Talos-Native)
Section titled “Sidero (Talos-Native)”Sidero is Sidero Labs’ bare metal provider for Cluster API, designed specifically for Talos Linux. It is simpler than Metal3 (no Ironic dependency) and uses IPMI/Redfish directly.
Sidero vs Metal3
Section titled “Sidero vs Metal3”| Feature | Metal3 (CAPM3) | Sidero |
|---|---|---|
| OS support | Any (Ubuntu, Flatcar, etc.) | Talos Linux only |
| Provisioner | Ironic (complex, OpenStack heritage) | Built-in (simpler) |
| BMC protocol | IPMI, Redfish, iDRAC, iLO | IPMI, Redfish |
| Server discovery | Manual BareMetalHost CRDs | Auto-discovery via DHCP |
| Image delivery | Ironic Python Agent (IPA) | Talos PXE image |
| Complexity | Higher (Ironic is a large system) | Lower (fewer moving parts) |
| Maturity | Older, more tested | Newer, less battle-tested |
| Best for | Multi-OS environments | Talos-only environments |
Sidero Server Discovery
Section titled “Sidero Server Discovery”# Sidero auto-discovers servers when they PXE boot# Servers appear as Server CRDs automatically
# Check discovered serverskubectl get servers -n sidero-system# NAME HOSTNAME BMC ACCEPTED# 00000000-0000-0000-0000-aabbccddeef1 server-01 10.0.100.10 false# 00000000-0000-0000-0000-aabbccddeef2 server-02 10.0.100.11 false
# Accept a server into the poolkubectl patch server 00000000-0000-0000-0000-aabbccddeef1 \ --type merge -p '{"spec":{"accepted": true}}'
# Group servers by capabilityapiVersion: metal.sidero.dev/v1alpha2kind: ServerClassmetadata: name: worker-largespec: qualifiers: cpu: - manufacturer: "AMD" version: "EPYC.*" systemInformation: - manufacturer: "Dell Inc." selector: matchLabels: rack: "rack-a"Creating a Cluster with Sidero
Section titled “Creating a Cluster with Sidero”# Define the workload clusterapiVersion: cluster.x-k8s.io/v1beta1kind: Clustermetadata: name: production namespace: defaultspec: clusterNetwork: pods: cidrBlocks: ["10.244.0.0/16"] services: cidrBlocks: ["10.96.0.0/12"] controlPlaneRef: apiVersion: controlplane.cluster.x-k8s.io/v1alpha3 kind: TalosControlPlane name: production-cp infrastructureRef: apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 kind: MetalCluster name: production---# Control plane (3 nodes from 'control-plane' server class)apiVersion: controlplane.cluster.x-k8s.io/v1alpha3kind: TalosControlPlanemetadata: name: production-cpspec: replicas: 3 version: v1.35.0 infrastructureTemplate: apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 kind: MetalMachineTemplate name: production-cp controlPlaneConfig: controlplane: generateType: controlplane---# Worker machines (5 nodes from 'worker-large' server class)apiVersion: cluster.x-k8s.io/v1beta1kind: MachineDeploymentmetadata: name: production-workersspec: replicas: 5 clusterName: production selector: matchLabels: cluster.x-k8s.io/cluster-name: production template: metadata: labels: cluster.x-k8s.io/cluster-name: production spec: clusterName: production version: v1.35.0 bootstrap: configRef: apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3 kind: TalosConfigTemplate name: production-workers infrastructureRef: apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 kind: MetalMachineTemplate name: production-workers---apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3kind: MetalMachineTemplatemetadata: name: production-workersspec: template: spec: serverClassRef: apiVersion: metal.sidero.dev/v1alpha2 kind: ServerClass name: worker-large# Apply the cluster definitionkubectl apply -f production-cluster.yaml
# Watch the provisioningkubectl get machines -w# NAME PHASE# production-cp-abc12 Provisioning# production-cp-def34 Pending# production-cp-ghi56 Pending# production-workers-jkl78 Pending# ...
# After ~10-15 minutes:# production-cp-abc12 Running# production-cp-def34 Running# production-cp-ghi56 Running# production-workers-jkl78 Running# production-workers-mno90 Running
# Get the workload cluster kubeconfigkubectl get secret production-kubeconfig -o jsonpath='{.data.value}' | base64 -d > production.kubeconfigkubectl --kubeconfig production.kubeconfig get nodesStop and think: A worker node’s NVMe drive fails at 3 AM. With the traditional approach, an on-call engineer gets paged, SSH’s into the node, cordons it, drains pods, and files a hardware ticket. With MachineHealthCheck below, what happens instead? What is still a manual step even with full automation?
Machine Health Checks
Section titled “Machine Health Checks”Automatically remediate failed nodes by replacing them with fresh hardware from the pool:
apiVersion: cluster.x-k8s.io/v1beta1kind: MachineHealthCheckmetadata: name: production-worker-healthspec: clusterName: production selector: matchLabels: cluster.x-k8s.io/deployment-name: production-workers unhealthyConditions: - type: Ready status: "False" timeout: 5m - type: Ready status: Unknown timeout: 5m maxUnhealthy: "40%" # Don't remediate if >40% are unhealthy (likely a systemic issue) nodeStartupTimeout: 10mWhen a node is unhealthy for >5 minutes:
- CAPI marks the Machine for deletion
- The infrastructure provider deprovisions the bare metal host (wipes disk)
- The host returns to “Available” in the pool
- CAPI creates a new Machine, which provisions a new host
- The new node joins the cluster automatically
Pause and predict: Your team manages 5 Kubernetes clusters across 2 datacenters. Currently, cluster changes are made by running
kubectlcommands manually. What specific risks does this create, and how does the GitOps approach below eliminate each one?
GitOps-Driven Cluster Lifecycle
Section titled “GitOps-Driven Cluster Lifecycle”┌─────────────────────────────────────────────────────────────┐│ GITOPS CLUSTER LIFECYCLE ││ ││ Git Repository ││ ├── clusters/ ││ │ ├── production/ ││ │ │ ├── cluster.yaml (Cluster definition) ││ │ │ ├── control-plane.yaml (TalosControlPlane) ││ │ │ ├── workers.yaml (MachineDeployment) ││ │ │ └── health-checks.yaml (MachineHealthCheck) ││ │ ├── staging/ ││ │ │ └── ... ││ │ └── dev/ ││ │ └── ... ││ └── infrastructure/ ││ ├── servers.yaml (BareMetalHost inventory) ││ └── server-classes.yaml (ServerClass definitions) ││ ││ ArgoCD/Flux watches → applies to management cluster ││ Management cluster → provisions workload clusters ││ ││ To create a cluster: git commit + push ││ To scale workers: change replicas, git push ││ To upgrade K8s: change version, git push ││ To delete a cluster: remove YAML, git push ││ │└─────────────────────────────────────────────────────────────┘Did You Know?
Section titled “Did You Know?”-
Cluster API was created by Kubernetes SIG Cluster Lifecycle specifically because every cloud provider had built their own incompatible cluster management tooling. CAPI provides a single API that works across AWS, Azure, GCP, vSphere, and bare metal.
-
Metal3 stands for “Metal Kubed” (Metal^3). It was created by Red Hat and is the bare metal infrastructure provider used by OpenShift’s Assisted Installer for on-premises deployments.
-
Sidero was created by the same team that built Talos Linux (Sidero Labs). The name comes from the Greek word for “iron” — fitting for bare metal management.
-
The largest known Cluster API deployment manages over 4,000 clusters across multiple infrastructure providers. Organizations like Deutsche Telekom and SAP use CAPI to manage their multi-cluster Kubernetes platforms at enterprise scale.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| No management cluster HA | Management cluster dies = cannot manage anything | Run 3-node HA management cluster with etcd backup |
| BMC credentials in plain text | Security risk | Use Kubernetes secrets + external secrets operator |
| No server pool buffer | MachineHealthCheck tries to replace but no servers available | Maintain 2-3 spare servers in the pool |
| Skipping hardware inspection | Deploying on servers with failed RAM or disks | Always let CAPI inspect hardware before marking available |
| No disk wipe on deprovision | Previous tenant’s data visible to next | Enable secure erase in Metal3/Sidero deprovisioning |
| Single management cluster | Management cluster failure = total loss of control | Backup management cluster state; consider multi-site mgmt |
| Not using GitOps | Cluster definitions are imperative and unauditable | Store all CAPI YAMLs in Git; deploy via ArgoCD/Flux |
Question 1
Section titled “Question 1”What happens if the management cluster goes down? Can the workload clusters still function?
Answer
Yes, workload clusters continue to function normally. The management cluster only manages the lifecycle (creation, scaling, upgrades, health checks) of workload clusters. Once a workload cluster is provisioned, it operates independently — its control plane, workers, and workloads are self-contained.
However, you lose:
- Scaling: Cannot add/remove worker nodes
- Upgrades: Cannot trigger K8s or OS upgrades
- Auto-remediation: MachineHealthChecks stop working (unhealthy nodes are not replaced)
- New cluster creation: Cannot provision new clusters
Mitigation: Run the management cluster with 3-node HA, back up its etcd regularly, and consider a standby management cluster in a second datacenter.
Question 2
Section titled “Question 2”You need to upgrade Kubernetes from 1.34 to 1.35 on a 50-node production cluster managed by Cluster API. How does this work?
Answer
Rolling upgrade via CAPI:
-
Update the version field in the control plane and MachineDeployment YAMLs:
# TalosControlPlane — version is at spec.versionspec:version: v1.35.0 # was v1.34.0# MachineDeployment — version is at spec.template.spec.versionspec:template:spec:version: v1.35.0 # was v1.34.0 -
Apply (or Git push if using GitOps). CAPI detects the version change.
-
CAPI performs a rolling update:
- Creates a new Machine with v1.35.0
- Waits for it to join the cluster and become Ready
- Cordons and drains an old v1.34.0 Machine
- Deletes the old Machine (hardware returns to pool)
- Repeats until all machines are upgraded
-
Control plane upgrades first, then workers.
This is exactly like a Deployment rollout — CAPI manages Machine objects the same way the Deployment controller manages Pods. The maxSurge and maxUnavailable settings on MachineDeployment control the rollout speed.
Key consideration on bare metal: This requires spare servers in the pool. CAPI needs to provision a new machine before deprovisioning the old one (surge). If your pool has no spare servers, the upgrade blocks.
Question 3
Section titled “Question 3”Compare Metal3 and Sidero. When would you choose each?
Answer
Metal3:
- Supports any OS (Ubuntu, Flatcar, RHEL, etc.)
- Uses Ironic (OpenStack heritage) — more complex but battle-tested
- Backed by Red Hat, used in OpenShift bare metal deployments
- Better for multi-OS environments or organizations already using OpenStack
- More mature ecosystem and documentation
Sidero:
- Talos Linux only (no other OS support)
- Simpler architecture (no Ironic dependency)
- Auto-discovers servers via PXE (no manual BareMetalHost creation)
- ServerClass grouping for hardware-aware scheduling
- Best for all-Talos environments where simplicity is valued
Decision: If you chose Talos Linux in Module 2.3, use Sidero. If you need Ubuntu/Flatcar/RHEL, use Metal3.
Question 4
Section titled “Question 4”How do you handle a server with a failed disk in a Cluster API-managed cluster?
Answer
Automatic remediation via MachineHealthCheck:
-
The disk failure causes kubelet to report NotReady (or the node stops responding entirely).
-
MachineHealthCheck detects the
Ready=Falsecondition persisting beyond the timeout (e.g., 5 minutes). -
CAPI marks the Machine for deletion.
-
The infrastructure provider (Metal3/Sidero):
- Deprovisions the server (marks as “needs maintenance”)
- The server does NOT return to the available pool (bad hardware)
-
CAPI creates a new Machine, which is provisioned on a healthy server from the pool.
-
The new node joins the cluster and workloads are scheduled on it.
Manual steps still needed:
- Someone must physically replace the failed disk
- After repair, re-register the server (update BareMetalHost or re-PXE for Sidero)
- The server goes through inspection and returns to the available pool
This is why spare servers matter. If your pool is empty, the MachineHealthCheck cannot create a replacement, and the unhealthy machine stays in the cluster.
Hands-On Exercise: Cluster API with Docker (Simulation)
Section titled “Hands-On Exercise: Cluster API with Docker (Simulation)”Task: Use Cluster API with the Docker provider to simulate the bare metal workflow.
Note: The Docker provider is CAPI’s testing/development provider. It creates “machines” as Docker containers. The workflow is identical to bare metal — only the infrastructure layer differs.
# Install clusterctlcurl -L https://github.com/kubernetes-sigs/cluster-api/releases/latest/download/clusterctl-linux-amd64 -o clusterctlchmod +x clusterctl && sudo mv clusterctl /usr/local/bin/
# Create a kind cluster as the management clusterkind create cluster --name capi-mgmt
# Initialize CAPI with Docker providerclusterctl init --infrastructure docker
# Generate a workload cluster manifestclusterctl generate cluster demo-cluster \ --infrastructure docker \ --kubernetes-version v1.35.0 \ --control-plane-machine-count 1 \ --worker-machine-count 2 \ > demo-cluster.yaml
# Apply the cluster definitionkubectl apply -f demo-cluster.yaml
# Watch machines being provisionedkubectl get machines -w
# Get the workload cluster kubeconfigclusterctl get kubeconfig demo-cluster > demo.kubeconfig
# Verify the workload clusterkubectl --kubeconfig demo.kubeconfig get nodes
# Scale workerskubectl patch machinedeployment demo-cluster-md-0 \ --type merge -p '{"spec":{"replicas": 4}}'
# Watch new machines appearkubectl get machines -w
# Cleanupkubectl delete cluster demo-clusterkind delete cluster --name capi-mgmtSuccess Criteria
Section titled “Success Criteria”- Management cluster created (kind)
- CAPI initialized with Docker provider
- Workload cluster provisioned (1 CP + 2 workers)
- kubeconfig retrieved and kubectl works against workload cluster
- Workers scaled from 2 to 4
- Cluster deleted cleanly (all machines deprovisioned)
Next Module
Section titled “Next Module”Continue to Module 3.1: Datacenter Network Architecture to learn about spine-leaf topology, VLANs, and network design for on-premises Kubernetes.