Skip to content

Module 2.4: Declarative Bare Metal with Cluster API

Complexity: [COMPLEX] | Time: 60 minutes

Prerequisites: Module 2.3: Immutable OS, Cluster API


After completing this module, you will be able to:

  1. Implement Cluster API with Metal3 or Sidero providers to declaratively provision bare-metal Kubernetes clusters
  2. Configure a bare-metal host inventory with BMC credentials, hardware profiles, and network templates
  3. Deploy new Kubernetes clusters using kubectl apply with version-controlled YAML manifests
  4. Design a GitOps-driven cluster lifecycle workflow that covers provisioning, scaling, and decommissioning

A financial services company with 8 Kubernetes clusters across two datacenters managed their infrastructure with a combination of Ansible playbooks, shell scripts, and a shared spreadsheet tracking which server was in which cluster. Creating a new cluster took 3 days: 1 day to allocate servers (manually checking the spreadsheet), 1 day to PXE boot and install the OS, and 1 day to run kubeadm and configure networking. Decommissioning a cluster was worse — nobody was sure which servers could be safely wiped because the spreadsheet was 4 months out of date.

When they needed to spin up an emergency cluster for a regulatory audit, it took 5 days instead of the promised 1. The CTO asked: “Why can’t we create a cluster as easily as we create a pod?” The answer was that their bare metal had no declarative API — no equivalent of kubectl apply -f cluster.yaml.

Cluster API (CAPI) with bare metal providers (Metal3/Sidero) solves this by treating physical servers like cloud instances. You define a cluster in YAML, apply it, and the system provisions hardware, installs the OS, bootstraps Kubernetes, and joins nodes — all declaratively, all auditable, all version-controlled in Git.

The Valet Parking Analogy

Without Cluster API, provisioning bare metal is like parking your own car in a multi-story garage: you walk around looking for a space, park, and try to remember where you left it. With Cluster API, it is valet parking: you hand over the keys (hardware inventory), say what you need (“3 control planes, 5 workers”), and the valet (CAPI) handles everything. When you are done, you get your car back (server released to the pool).


  • How Cluster API extends Kubernetes to manage infrastructure
  • Metal3 (CAPM3): IPMI/Redfish-based bare metal provisioning
  • Sidero: Talos-native bare metal management
  • Hardware inventory and machine lifecycle
  • GitOps-driven cluster lifecycle (create, upgrade, scale, delete)
  • Multi-cluster management from a single management cluster

┌─────────────────────────────────────────────────────────────┐
│ CLUSTER API ON BARE METAL │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ Management Cluster │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ CAPI │ │ Bootstrap│ │ Infra │ │ │
│ │ │Controller│ │ Provider │ │ Provider │ │ │
│ │ │ │ │ (Talos/ │ │ (Metal3/ │ │ │
│ │ │ Manages │ │ kubeadm)│ │ Sidero) │ │ │
│ │ │ Cluster, │ │ │ │ │ │ │
│ │ │ Machine │ │ Generates│ │ Provisions│ │ │
│ │ │ CRDs │ │ bootstrap│ │ bare │ │ │
│ │ │ │ │ config │ │ metal │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └──────────────────┬─────────────────────────┘ │
│ │ Provisions │
│ ┌───────────▼───────────┐ │
│ │ Workload Cluster │ │
│ │ │ │
│ │ ┌────┐ ┌────┐ ┌────┐│ │
│ │ │CP-1│ │CP-2│ │CP-3││ │
│ │ └────┘ └────┘ └────┘│ │
│ │ ┌────┐ ┌────┐ ┌────┐│ │
│ │ │W-1 │ │W-2 │ │W-3 ││ │
│ │ └────┘ └────┘ └────┘│ │
│ └───────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
CRDPurpose
ClusterDefines a K8s cluster (name, version, networking)
MachineRepresents a single node (control plane or worker)
MachineDeploymentManages a set of worker machines (like a Deployment for pods)
MachineHealthCheckAuto-remediation for unhealthy nodes
BareMetalHost (Metal3)Represents a physical server
ServerClass (Sidero)Groups servers by hardware capabilities

Metal3 uses IPMI/Redfish to control bare metal servers. It integrates with Ironic (the OpenStack bare metal provisioner) to handle PXE boot, OS installation, and machine lifecycle.

┌─────────────────────────────────────────────────────────────┐
│ METAL3 STACK │
│ │
│ Management Cluster │
│ ┌──────────────────────────────────────────┐ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ CAPM3 │ │ Ironic │ │ │
│ │ │ (controller) │ │ (provisioner)│ │ │
│ │ └──────┬───────┘ └──────┬───────┘ │ │
│ │ │ │ │ │
│ │ ┌──────▼──────────────────▼──────┐ │ │
│ │ │ BareMetalHost CRDs │ │ │
│ │ │ │ │ │
│ │ │ bmh-01: available │ │ │
│ │ │ bmh-02: provisioned (cp-1) │ │ │
│ │ │ bmh-03: provisioned (cp-2) │ │ │
│ │ │ bmh-04: provisioning... │ │ │
│ │ └────────────────────────────────┘ │ │
│ └──────────────────────────────────────────┘ │
│ │ │
│ │ IPMI/Redfish │
│ ▼ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ BMC │ │ BMC │ │ BMC │ │ BMC │ │
│ │srv-01│ │srv-02│ │srv-03│ │srv-04│ │
│ └──────┘ └──────┘ └──────┘ └──────┘ │
│ │
└─────────────────────────────────────────────────────────────┘

Pause and predict: In the traditional workflow described in the war story above, creating a cluster took 3 days and involved a shared spreadsheet. With Cluster API, you define a cluster in YAML and kubectl apply it. What are the prerequisites that must be in place before this “kubectl apply” can actually provision physical servers? List at least three infrastructure components.

The BareMetalHost CRD is how Metal3 knows about your physical servers. Each server gets a manifest that includes its BMC address and credentials — this is how CAPI can power-cycle the machine, set the boot device to PXE, and initiate OS installation without anyone touching the hardware:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
name: server-01
namespace: metal3
spec:
online: true
bootMACAddress: "aa:bb:cc:dd:ee:01"
bmc:
address: ipmi://10.0.100.10
credentialsName: server-01-bmc-credentials
rootDeviceHints:
deviceName: /dev/sda
# Hardware profile auto-detected during inspection
---
apiVersion: v1
kind: Secret
metadata:
name: server-01-bmc-credentials
namespace: metal3
type: Opaque
data:
username: YWRtaW4= # admin
password: cGFzc3dvcmQ= # password
┌─────────────────────────────────────────────────────────────┐
│ BAREMETAL HOST LIFECYCLE │
│ │
│ Registering → Inspecting → Available → Provisioning │
│ │ │
│ ▼ │
│ Provisioned │
│ │ │
│ ▼ │
│ Deprovisioning │
│ │ │
│ ▼ │
│ Available │
│ (ready for reuse) │
│ │
│ Registering: BMC credentials verified │
│ Inspecting: Hardware inventory (CPU, RAM, disks, NICs) │
│ Available: Ready for cluster allocation │
│ Provisioning: PXE booting, OS installing │
│ Provisioned: Running as K8s node │
│ Deprovisioning: Wiping disks, returning to pool │
│ │
└─────────────────────────────────────────────────────────────┘

Sidero is Sidero Labs’ bare metal provider for Cluster API, designed specifically for Talos Linux. It is simpler than Metal3 (no Ironic dependency) and uses IPMI/Redfish directly.

FeatureMetal3 (CAPM3)Sidero
OS supportAny (Ubuntu, Flatcar, etc.)Talos Linux only
ProvisionerIronic (complex, OpenStack heritage)Built-in (simpler)
BMC protocolIPMI, Redfish, iDRAC, iLOIPMI, Redfish
Server discoveryManual BareMetalHost CRDsAuto-discovery via DHCP
Image deliveryIronic Python Agent (IPA)Talos PXE image
ComplexityHigher (Ironic is a large system)Lower (fewer moving parts)
MaturityOlder, more testedNewer, less battle-tested
Best forMulti-OS environmentsTalos-only environments
# Sidero auto-discovers servers when they PXE boot
# Servers appear as Server CRDs automatically
# Check discovered servers
kubectl get servers -n sidero-system
# NAME HOSTNAME BMC ACCEPTED
# 00000000-0000-0000-0000-aabbccddeef1 server-01 10.0.100.10 false
# 00000000-0000-0000-0000-aabbccddeef2 server-02 10.0.100.11 false
# Accept a server into the pool
kubectl patch server 00000000-0000-0000-0000-aabbccddeef1 \
--type merge -p '{"spec":{"accepted": true}}'
# Group servers by capability
apiVersion: metal.sidero.dev/v1alpha2
kind: ServerClass
metadata:
name: worker-large
spec:
qualifiers:
cpu:
- manufacturer: "AMD"
version: "EPYC.*"
systemInformation:
- manufacturer: "Dell Inc."
selector:
matchLabels:
rack: "rack-a"
# Define the workload cluster
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: production
namespace: default
spec:
clusterNetwork:
pods:
cidrBlocks: ["10.244.0.0/16"]
services:
cidrBlocks: ["10.96.0.0/12"]
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
name: production-cp
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: MetalCluster
name: production
---
# Control plane (3 nodes from 'control-plane' server class)
apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
name: production-cp
spec:
replicas: 3
version: v1.35.0
infrastructureTemplate:
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: MetalMachineTemplate
name: production-cp
controlPlaneConfig:
controlplane:
generateType: controlplane
---
# Worker machines (5 nodes from 'worker-large' server class)
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
name: production-workers
spec:
replicas: 5
clusterName: production
selector:
matchLabels:
cluster.x-k8s.io/cluster-name: production
template:
metadata:
labels:
cluster.x-k8s.io/cluster-name: production
spec:
clusterName: production
version: v1.35.0
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
kind: TalosConfigTemplate
name: production-workers
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: MetalMachineTemplate
name: production-workers
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: MetalMachineTemplate
metadata:
name: production-workers
spec:
template:
spec:
serverClassRef:
apiVersion: metal.sidero.dev/v1alpha2
kind: ServerClass
name: worker-large
Terminal window
# Apply the cluster definition
kubectl apply -f production-cluster.yaml
# Watch the provisioning
kubectl get machines -w
# NAME PHASE
# production-cp-abc12 Provisioning
# production-cp-def34 Pending
# production-cp-ghi56 Pending
# production-workers-jkl78 Pending
# ...
# After ~10-15 minutes:
# production-cp-abc12 Running
# production-cp-def34 Running
# production-cp-ghi56 Running
# production-workers-jkl78 Running
# production-workers-mno90 Running
# Get the workload cluster kubeconfig
kubectl get secret production-kubeconfig -o jsonpath='{.data.value}' | base64 -d > production.kubeconfig
kubectl --kubeconfig production.kubeconfig get nodes

Stop and think: A worker node’s NVMe drive fails at 3 AM. With the traditional approach, an on-call engineer gets paged, SSH’s into the node, cordons it, drains pods, and files a hardware ticket. With MachineHealthCheck below, what happens instead? What is still a manual step even with full automation?

Automatically remediate failed nodes by replacing them with fresh hardware from the pool:

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
name: production-worker-health
spec:
clusterName: production
selector:
matchLabels:
cluster.x-k8s.io/deployment-name: production-workers
unhealthyConditions:
- type: Ready
status: "False"
timeout: 5m
- type: Ready
status: Unknown
timeout: 5m
maxUnhealthy: "40%" # Don't remediate if >40% are unhealthy (likely a systemic issue)
nodeStartupTimeout: 10m

When a node is unhealthy for >5 minutes:

  1. CAPI marks the Machine for deletion
  2. The infrastructure provider deprovisions the bare metal host (wipes disk)
  3. The host returns to “Available” in the pool
  4. CAPI creates a new Machine, which provisions a new host
  5. The new node joins the cluster automatically

Pause and predict: Your team manages 5 Kubernetes clusters across 2 datacenters. Currently, cluster changes are made by running kubectl commands manually. What specific risks does this create, and how does the GitOps approach below eliminate each one?

┌─────────────────────────────────────────────────────────────┐
│ GITOPS CLUSTER LIFECYCLE │
│ │
│ Git Repository │
│ ├── clusters/ │
│ │ ├── production/ │
│ │ │ ├── cluster.yaml (Cluster definition) │
│ │ │ ├── control-plane.yaml (TalosControlPlane) │
│ │ │ ├── workers.yaml (MachineDeployment) │
│ │ │ └── health-checks.yaml (MachineHealthCheck) │
│ │ ├── staging/ │
│ │ │ └── ... │
│ │ └── dev/ │
│ │ └── ... │
│ └── infrastructure/ │
│ ├── servers.yaml (BareMetalHost inventory) │
│ └── server-classes.yaml (ServerClass definitions) │
│ │
│ ArgoCD/Flux watches → applies to management cluster │
│ Management cluster → provisions workload clusters │
│ │
│ To create a cluster: git commit + push │
│ To scale workers: change replicas, git push │
│ To upgrade K8s: change version, git push │
│ To delete a cluster: remove YAML, git push │
│ │
└─────────────────────────────────────────────────────────────┘

  • Cluster API was created by Kubernetes SIG Cluster Lifecycle specifically because every cloud provider had built their own incompatible cluster management tooling. CAPI provides a single API that works across AWS, Azure, GCP, vSphere, and bare metal.

  • Metal3 stands for “Metal Kubed” (Metal^3). It was created by Red Hat and is the bare metal infrastructure provider used by OpenShift’s Assisted Installer for on-premises deployments.

  • Sidero was created by the same team that built Talos Linux (Sidero Labs). The name comes from the Greek word for “iron” — fitting for bare metal management.

  • The largest known Cluster API deployment manages over 4,000 clusters across multiple infrastructure providers. Organizations like Deutsche Telekom and SAP use CAPI to manage their multi-cluster Kubernetes platforms at enterprise scale.


MistakeProblemSolution
No management cluster HAManagement cluster dies = cannot manage anythingRun 3-node HA management cluster with etcd backup
BMC credentials in plain textSecurity riskUse Kubernetes secrets + external secrets operator
No server pool bufferMachineHealthCheck tries to replace but no servers availableMaintain 2-3 spare servers in the pool
Skipping hardware inspectionDeploying on servers with failed RAM or disksAlways let CAPI inspect hardware before marking available
No disk wipe on deprovisionPrevious tenant’s data visible to nextEnable secure erase in Metal3/Sidero deprovisioning
Single management clusterManagement cluster failure = total loss of controlBackup management cluster state; consider multi-site mgmt
Not using GitOpsCluster definitions are imperative and unauditableStore all CAPI YAMLs in Git; deploy via ArgoCD/Flux

What happens if the management cluster goes down? Can the workload clusters still function?

Answer

Yes, workload clusters continue to function normally. The management cluster only manages the lifecycle (creation, scaling, upgrades, health checks) of workload clusters. Once a workload cluster is provisioned, it operates independently — its control plane, workers, and workloads are self-contained.

However, you lose:

  • Scaling: Cannot add/remove worker nodes
  • Upgrades: Cannot trigger K8s or OS upgrades
  • Auto-remediation: MachineHealthChecks stop working (unhealthy nodes are not replaced)
  • New cluster creation: Cannot provision new clusters

Mitigation: Run the management cluster with 3-node HA, back up its etcd regularly, and consider a standby management cluster in a second datacenter.

You need to upgrade Kubernetes from 1.34 to 1.35 on a 50-node production cluster managed by Cluster API. How does this work?

Answer

Rolling upgrade via CAPI:

  1. Update the version field in the control plane and MachineDeployment YAMLs:

    # TalosControlPlane — version is at spec.version
    spec:
    version: v1.35.0 # was v1.34.0
    # MachineDeployment — version is at spec.template.spec.version
    spec:
    template:
    spec:
    version: v1.35.0 # was v1.34.0
  2. Apply (or Git push if using GitOps). CAPI detects the version change.

  3. CAPI performs a rolling update:

    • Creates a new Machine with v1.35.0
    • Waits for it to join the cluster and become Ready
    • Cordons and drains an old v1.34.0 Machine
    • Deletes the old Machine (hardware returns to pool)
    • Repeats until all machines are upgraded
  4. Control plane upgrades first, then workers.

This is exactly like a Deployment rollout — CAPI manages Machine objects the same way the Deployment controller manages Pods. The maxSurge and maxUnavailable settings on MachineDeployment control the rollout speed.

Key consideration on bare metal: This requires spare servers in the pool. CAPI needs to provision a new machine before deprovisioning the old one (surge). If your pool has no spare servers, the upgrade blocks.

Compare Metal3 and Sidero. When would you choose each?

Answer

Metal3:

  • Supports any OS (Ubuntu, Flatcar, RHEL, etc.)
  • Uses Ironic (OpenStack heritage) — more complex but battle-tested
  • Backed by Red Hat, used in OpenShift bare metal deployments
  • Better for multi-OS environments or organizations already using OpenStack
  • More mature ecosystem and documentation

Sidero:

  • Talos Linux only (no other OS support)
  • Simpler architecture (no Ironic dependency)
  • Auto-discovers servers via PXE (no manual BareMetalHost creation)
  • ServerClass grouping for hardware-aware scheduling
  • Best for all-Talos environments where simplicity is valued

Decision: If you chose Talos Linux in Module 2.3, use Sidero. If you need Ubuntu/Flatcar/RHEL, use Metal3.

How do you handle a server with a failed disk in a Cluster API-managed cluster?

Answer

Automatic remediation via MachineHealthCheck:

  1. The disk failure causes kubelet to report NotReady (or the node stops responding entirely).

  2. MachineHealthCheck detects the Ready=False condition persisting beyond the timeout (e.g., 5 minutes).

  3. CAPI marks the Machine for deletion.

  4. The infrastructure provider (Metal3/Sidero):

    • Deprovisions the server (marks as “needs maintenance”)
    • The server does NOT return to the available pool (bad hardware)
  5. CAPI creates a new Machine, which is provisioned on a healthy server from the pool.

  6. The new node joins the cluster and workloads are scheduled on it.

Manual steps still needed:

  • Someone must physically replace the failed disk
  • After repair, re-register the server (update BareMetalHost or re-PXE for Sidero)
  • The server goes through inspection and returns to the available pool

This is why spare servers matter. If your pool is empty, the MachineHealthCheck cannot create a replacement, and the unhealthy machine stays in the cluster.


Hands-On Exercise: Cluster API with Docker (Simulation)

Section titled “Hands-On Exercise: Cluster API with Docker (Simulation)”

Task: Use Cluster API with the Docker provider to simulate the bare metal workflow.

Note: The Docker provider is CAPI’s testing/development provider. It creates “machines” as Docker containers. The workflow is identical to bare metal — only the infrastructure layer differs.

Terminal window
# Install clusterctl
curl -L https://github.com/kubernetes-sigs/cluster-api/releases/latest/download/clusterctl-linux-amd64 -o clusterctl
chmod +x clusterctl && sudo mv clusterctl /usr/local/bin/
# Create a kind cluster as the management cluster
kind create cluster --name capi-mgmt
# Initialize CAPI with Docker provider
clusterctl init --infrastructure docker
# Generate a workload cluster manifest
clusterctl generate cluster demo-cluster \
--infrastructure docker \
--kubernetes-version v1.35.0 \
--control-plane-machine-count 1 \
--worker-machine-count 2 \
> demo-cluster.yaml
# Apply the cluster definition
kubectl apply -f demo-cluster.yaml
# Watch machines being provisioned
kubectl get machines -w
# Get the workload cluster kubeconfig
clusterctl get kubeconfig demo-cluster > demo.kubeconfig
# Verify the workload cluster
kubectl --kubeconfig demo.kubeconfig get nodes
# Scale workers
kubectl patch machinedeployment demo-cluster-md-0 \
--type merge -p '{"spec":{"replicas": 4}}'
# Watch new machines appear
kubectl get machines -w
# Cleanup
kubectl delete cluster demo-cluster
kind delete cluster --name capi-mgmt
  • Management cluster created (kind)
  • CAPI initialized with Docker provider
  • Workload cluster provisioned (1 CP + 2 workers)
  • kubeconfig retrieved and kubectl works against workload cluster
  • Workers scaled from 2 to 4
  • Cluster deleted cleanly (all machines deprovisioned)

Continue to Module 3.1: Datacenter Network Architecture to learn about spine-leaf topology, VLANs, and network design for on-premises Kubernetes.