Module 5.3: Cluster API on Bare Metal
Цей контент ще не доступний вашою мовою.
Complexity:
[ADVANCED]| Time: 50 minutesPrerequisites: Module 5.2: Multi-Cluster Control Planes, Module 2.4: Declarative Bare Metal
Why This Module Matters
Section titled “Why This Module Matters”At Deutsche Telekom, managing the underlying infrastructure for edge computing meant operating thousands of distributed Kubernetes clusters across remote cell towers and localized data centers. Prior to adopting declarative infrastructure, their legacy provisioning model required an engineer to manually trigger PXE boots, install operating systems, execute kubeadm bootstrap commands, and configure overlay networks. This imperative workflow meant that provisioning a single remote cluster consumed three full days of engineering effort. When a critical node suffered a hardware failure at a remote edge site, the replacement process demanded up to eight hours of downtime and remote hands to physically identify the server, reconfigure BIOS parameters, reinstall the system, and rejoin the degraded cluster. Such delays incurred massive service degradation costs, frequently exceeding tens of thousands of dollars per incident in SLA penalties and lost operational efficiency.
The organization transformed its infrastructure lifecycle by adopting Cluster API (CAPI) paired with the Metal3 infrastructure provider (CAPM3). By treating bare-metal Kubernetes clusters as declarative resources—identical in concept to native Kubernetes Deployments or Services—the operations team defined their desired target state in Git. Once applied, CAPI and Metal3 handled the complex orchestration of communicating with Baseboard Management Controllers (BMCs) to power on servers, push operating system images, and orchestrate cluster joins autonomously.
The results of this transformation were immediate and profound. The laborious three-day cluster provisioning cycle was reduced to an unattended forty-five-minute operation. More importantly, node failures now trigger autonomous remediation. A dedicated controller detects the unhealthy node, evicts the workload, deprovisions the failed server, and dynamically provisions a replacement from an available pool of spare hardware. The multi-hour outage window shrank to fifteen minutes, eliminating the need for remote hands and drastically cutting the financial impact of hardware degradation.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Implement Cluster API with the Metal3 provider to declaratively provision bare-metal Kubernetes clusters.
- Design a multi-cluster lifecycle pipeline that orchestrates provisioning and zero-downtime chained upgrades through GitOps.
- Evaluate the state transitions of BareMetalHost resources during hardware inspection, provisioning, and decommissioning.
- Diagnose node failures and configure MachineHealthChecks to enforce automatic infrastructure remediation safely.
- Compare the provisioning capabilities, network prerequisites, and operational overhead of bare-metal environments against hypervisor-backed virtualized infrastructure.
What You’ll Learn
Section titled “What You’ll Learn”- The overarching architecture of Cluster API and the organizational structure of Kubernetes SIG Cluster Lifecycle.
- The distinctions between Infrastructure Providers, Bootstrap Providers, and Control Plane Providers.
- The mechanics of CAPM3 (Metal3) and its integration with OpenStack Ironic for out-of-band hardware management.
- Techniques for managing a BareMetalHost inventory and understanding its internal state machine.
- Strategies for configuring MachineHealthCheck controllers to perform automatic node remediation without risking cascading failures.
- Designing GitOps-driven cluster lifecycle workflows utilizing Flux or ArgoCD with safe pruning policies.
- Declarative operations for scaling workloads and executing chained Kubernetes version upgrades.
1. Cluster API Architecture & Ecosystem
Section titled “1. Cluster API Architecture & Ecosystem”Cluster API (CAPI) is not a standalone CNCF project; rather, it is a formal subproject of the Kubernetes SIG Cluster Lifecycle. Its fundamental mission is to bring declarative, Kubernetes-style APIs to the creation, configuration, and management of clusters themselves. By extending the Kubernetes controller pattern, CAPI allows platform engineers to manage the lifecycle of entire fleets of clusters with the same tooling used to manage application pods.
The architectural paradigm relies on a strict separation of concerns between a Management Cluster and multiple Workload Clusters. The Management Cluster is a dedicated Kubernetes environment that runs the CAPI core controllers and provider-specific Custom Resource Definitions (CRDs). It is strictly responsible for managing the state of other clusters and should never host end-user application workloads. The Workload Clusters are the downstream environments created and operated by the Management Cluster.
The Provider Model
Section titled “The Provider Model”Cluster API separates its operational logic into distinct provider categories:
- Infrastructure Providers: These controllers interact with underlying cloud or hardware APIs to provision computational resources, networks, and load balancers. Examples include Metal3 for bare metal, AWS (CAPA), and vSphere (CAPV).
- Bootstrap Providers: These controllers are responsible for turning a newly provisioned server into a Kubernetes node. They generate the necessary
cloud-initor Ignition scripts containing the certificates and join commands. - Control Plane Providers: These controllers manage the complex lifecycle of the Kubernetes control plane, handling etcd quorum, certificate rotation, and safely rolling out minor version upgrades across control plane nodes.
The default and built-in implementation for both the bootstrap and control-plane roles is Kubeadm (officially known as CABPK for the bootstrap provider, and KubeadmControlPlane for the control plane provider).
Architectural Diagram
Section titled “Architectural Diagram”flowchart TD subgraph Mgmt[Management Cluster] subgraph Core[CAPI Core Controllers] C[Cluster Controller] M[Machine Controller] MHC[MachineHealthCheck Ctrl] end subgraph Infra[Infrastructure Provider] M3C[Metal3Cluster Controller] M3M[Metal3Machine Controller] I[Ironic Bare-Metal Provisioner\n- IPMI/Redfish\n- PXE boot\n- Inspection] end subgraph Boot[Bootstrap Provider] K[Kubeadm\n- Generates configs\n- Cloud-init] end subgraph CP[Control Plane Provider] KCP[KubeadmControlPlane\n- CP lifecycle] end end subgraph Wkld[Workload Cluster] N1[CP1] N2[CP2] N3[CP3] N4[W1] N5[W2] N6[W3] N7[W4] end Mgmt -- "Creates and manages" --> WkldTooling and API Evolution
Section titled “Tooling and API Evolution”The official command-line utility for initializing management clusters, upgrading providers, and moving resources between clusters is clusterctl. Because Cluster API heavily utilizes mutating and validating webhooks, cert-manager (specifically supporting the cert-manager.io/v1 API) is a strict, required prerequisite. When running clusterctl init, the CLI will automatically detect if cert-manager is absent and install version v1.20.0 by default to ensure webhooks function correctly.
As the project matures, API versions transition through standard Kubernetes deprecation cycles. The CAPI v1beta2 API was introduced and promoted to the official storage version in CAPI v1.11 (released in August 2025). Consequently, the older v1beta1 API will be completely unserved and removed from the API server in CAPI v1.14, targeting August 2026. The latest stable CAPI release as of April 2026 is v1.12.x (with v1.12.0 having shipped on January 27, 2026, followed by patch releases such as v1.12.2). The v1.13 minor release is scheduled for April 2026.
CAPI maintains a strict version support policy: only the two most recent minor releases are actively supported at any given time. Older releases become unsupported immediately upon the release of a new minor version. Within the v1.12 release, the management cluster can run Kubernetes v1.31.x through v1.34.x, while workload clusters can run Kubernetes v1.29.x through v1.34.x.
One of the most significant operational improvements introduced in CAPI v1.12 is support for in-place updates and chained upgrades. Chained upgrades allow platform engineers to declare a target Kubernetes version that spans multiple minor releases; CAPI will automatically compute and execute the intermediate upgrade steps required to safely reach the destination without violating Kubeadm version skew policies. Additionally, ClusterClass—a mechanism for templating managed topologies—remains classified as an experimental feature in CAPI v1.12 and strictly requires the ClusterTopology feature gate to be enabled on the management cluster.
2. Metal3 and Bare Metal Provisioning
Section titled “2. Metal3 and Bare Metal Provisioning”Metal3 (pronounced “metal-cubed”) serves as the Cluster API infrastructure provider for bare-metal servers. The project was initially accepted into the CNCF Sandbox in September 2020. Due to massive ecosystem adoption, Metal3.io subsequently became a CNCF Incubating project on August 27, 2025, supported by 57 active contributing organizations.
The current release series for the Metal3 provider is CAPM3 v1.12.x, which is explicitly versioned to track the upstream CAPI release cycle. Like CAPI, CAPM3 strictly maintains support for only the two most recent minor releases.
The Bare Metal Operator and Ironic
Section titled “The Bare Metal Operator and Ironic”Metal3 does not reinvent hardware provisioning; instead, it leverages OpenStack Ironic as its underlying bare metal provisioning engine. Ironic runs seamlessly inside a container within the Metal3 deployment. The Bare Metal Operator (BMO) acts as the bridge between Kubernetes CRDs and Ironic’s declarative API. The BMO operates on its own semantic versioning cycle, with BMO v0.12.x currently serving as the stable release.
Ironic provides native out-of-band management support via both IPMI and Redfish protocols. Modern environments heavily favor Redfish, which offers advanced capabilities such as virtual media boot, BIOS settings configuration, and RAID management over a RESTful HTTPS API.
Pause and predict: What would happen if you registered a BareMetalHost with incorrect BMC credentials? At what stage of the lifecycle would the failure be detected?
To manage bare-metal servers, you register them in your management cluster as BareMetalHost (BMH) resources using the API group metal3.io/v1alpha1.
# Register a bare-metal serverapiVersion: metal3.io/v1alpha1kind: BareMetalHostmetadata: name: server-rack1-u10 namespace: metal3spec: online: true bootMACAddress: "aa:bb:cc:dd:ee:01" bmc: address: "ipmi://192.168.1.10" # Or redfish:// credentialsName: server-rack1-u10-bmc disableCertificateVerification: true rootDeviceHints: deviceName: "/dev/nvme0n1" # Install OS here hardwareProfile: "unknown" # Let Ironic inspect# BMC credentialsapiVersion: v1kind: Secretmetadata: name: server-rack1-u10-bmc namespace: metal3type: OpaquestringData: username: admin password: "CHANGE_ME_IN_VAULT"The Provisioning State Machine
Section titled “The Provisioning State Machine”The BareMetalHost resource is governed by a rigorous state machine managed by the Bare Metal Operator:
- Registering: The BMH is created. BMO verifies the syntax and attempts to authenticate against the BMC using the provided credentials.
- Inspecting: Ironic boots an ephemeral inspection ramdisk on the server to dynamically inventory CPU, RAM, disks, and network interfaces.
- Available: The server successfully passed inspection and is powered off. This is the stable idle state awaiting a machine claim from CAPI.
- Provisioning: A Cluster API Machine claims the host. Ironic powers on the server, instructs it to PXE boot, writes the operating system image to the root device, and injects the cloud-init configuration.
- Provisioned: The operating system boots from local disk, cloud-init executes the Kubeadm join sequence, and the node successfully registers with the workload cluster.
- Deprovisioning: The Cluster API Machine is deleted. The BMO instructs Ironic to securely wipe the local disks and scrub configuration data.
- Deleting: The resource is being removed from the Kubernetes API.
- Error: If any of the above processing states encounter an unrecoverable failure, the host transitions into an error state requiring human intervention.
Note on Ironic Versions: Some unofficial community guides suggest that the minimum OpenStack Ironic API version required for Metal3 integration is 1.81 (which corresponds to the OpenStack 2023.1 ‘Antelope’ release). However, administrators must exercise caution, as this specific version requirement is not corroborated by the authoritative Metal3 or Ironic documentation as of this writing. Always validate compatibility empirically before executing infrastructure upgrades.
In addition to hardware provisioning, Metal3 ships an IP Address Manager (IPAM) component that manages static IP allocations for bare-metal nodes via CRDs (such as IPPool and IPClaim) under the API group ipam.metal3.io/v1alpha1.
For organizations evaluating alternatives, Tinkerbell offers its own Cluster API Provider (CAPT) as a different bare-metal infrastructure backend. Its latest tagged release is v0.6.4.
3. Declarative Cluster Creation
Section titled “3. Declarative Cluster Creation”Creating a cluster with CAPM3 requires linking the core CAPI resources to their infrastructure-specific counterparts. The Cluster resource references a KubeadmControlPlane and a Metal3Cluster. The MachineDeployment references a Metal3MachineTemplate and a KubeadmConfigTemplate.
# 1. Cluster definitionapiVersion: cluster.x-k8s.io/v1beta1kind: Clustermetadata: name: production namespace: clustersspec: clusterNetwork: pods: cidrBlocks: ["10.244.0.0/16"] services: cidrBlocks: ["10.96.0.0/12"] controlPlaneRef: apiVersion: controlplane.cluster.x-k8s.io/v1beta1 kind: KubeadmControlPlane name: production-cp infrastructureRef: apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: Metal3Cluster name: production# 2. Metal3 cluster configapiVersion: infrastructure.cluster.x-k8s.io/v1beta1kind: Metal3Clustermetadata: name: production namespace: clustersspec: controlPlaneEndpoint: host: 10.0.0.100 port: 6443 noCloudProvider: true# 3. Control plane (3 nodes, auto-managed)apiVersion: controlplane.cluster.x-k8s.io/v1beta1kind: KubeadmControlPlanemetadata: name: production-cp namespace: clustersspec: replicas: 3 version: v1.34.0 machineTemplate: infrastructureRef: apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: Metal3MachineTemplate name: production-cp kubeadmConfigSpec: initConfiguration: nodeRegistration: kubeletExtraArgs: node-labels: "node-role.kubernetes.io/control-plane=" joinConfiguration: nodeRegistration: kubeletExtraArgs: node-labels: "node-role.kubernetes.io/control-plane="# 4. Machine template for control planeapiVersion: infrastructure.cluster.x-k8s.io/v1beta1kind: Metal3MachineTemplatemetadata: name: production-cp namespace: clustersspec: template: spec: image: url: "http://10.0.0.50/ubuntu-22.04-k8s.qcow2" checksum: "sha256:abc123..." checksumType: sha256 format: qcow2 hostSelector: matchLabels: role: control-plane# 5. Worker MachineDeploymentapiVersion: cluster.x-k8s.io/v1beta1kind: MachineDeploymentmetadata: name: production-workers namespace: clustersspec: clusterName: production replicas: 5 selector: matchLabels: cluster.x-k8s.io/cluster-name: production template: metadata: labels: cluster.x-k8s.io/cluster-name: production spec: clusterName: production version: v1.34.0 bootstrap: configRef: apiVersion: bootstrap.cluster.x-k8s.io/v1beta1 kind: KubeadmConfigTemplate name: production-workers infrastructureRef: apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 kind: Metal3MachineTemplate name: production-workersComparing Bare Metal to vSphere
Section titled “Comparing Bare Metal to vSphere”Organizations frequently operate a mix of bare-metal infrastructure and hypervisor-backed virtual machines. Cluster API Provider vSphere (CAPV) is the analog to CAPM3 for VMware environments. CAPV utilizes VSphereCluster and VSphereMachineTemplate CRDs to define virtual machine specifications.
| Dimension | CAPM3 (Bare Metal) | CAPV (vSphere) |
|---|---|---|
| Provision time | 5-15 minutes | 2-5 minutes |
| Prerequisites | Ironic, DHCP, PXE, BMC | vCenter, templates |
| Node pool | BareMetalHost inventory (fixed) | Unlimited (VM clone) |
| Rollback | Wipe + reprovision | VM snapshot/rollback |
| Best for | Maximum perf, no hypervisor tax | Flexibility, fast iteration |
Stop and think: The MachineHealthCheck has a
maxUnhealthyfield set to 40%. Why is this safety valve critical for bare-metal environments? What would happen if it were set to 100% and a network switch failed, making 6 out of 10 nodes appear NotReady?
4. MachineHealthCheck & Automatic Remediation
Section titled “4. MachineHealthCheck & Automatic Remediation”The capability to self-heal underlying infrastructure is one of the most powerful features provided by Cluster API. The MachineHealthCheck (MHC) controller constantly monitors the health conditions of Kubernetes nodes. If a node remains in a NotReady state for a configured duration, the MHC safely assumes the node is irrecoverable and orchestrates its replacement.
apiVersion: cluster.x-k8s.io/v1beta1kind: MachineHealthCheckmetadata: name: production-worker-health namespace: clustersspec: clusterName: production selector: matchLabels: cluster.x-k8s.io/deployment-name: production-workers unhealthyConditions: - type: Ready status: "False" timeout: 300s # Node NotReady for 5 minutes - type: Ready status: "Unknown" timeout: 300s # Node unreachable for 5 minutes maxUnhealthy: "40%" # Safety: do not remediate if > 40% nodes are unhealthy nodeStartupTimeout: 600s # New nodes must be Ready within 10 minutesRemediation Flow
Section titled “Remediation Flow”sequenceDiagram participant Node as Node worker-03 participant MHC as MachineHealthCheck participant MD as MachineDeployment participant BMO as BareMetalHost Inventory
Node-->>MHC: Becomes NotReady Note over MHC: Detects Ready=False for > 300s Note over MHC: Safety check: < 40% unhealthy? (Yes) MHC->>Node: Delete Machine "worker-03" Note over Node: Cordon, drain, deprovision, power off MD-->>MD: Detects replicas=5 but only 4 exist MD->>BMO: Create new Machine "worker-06" BMO-->>MD: Select Available BareMetalHost Note over BMO: Provision OS, join cluster BMO->>Node: New node "worker-06" becomes Ready Note over BMO,Node: Total time: 5-15 minsIt is absolutely vital to maintain spare capacity in your bare-metal inventory. If the MachineHealthCheck triggers a remediation event but all BareMetalHosts are currently allocated, the replacement Machine will remain in a Pending state indefinitely, leaving your cluster permanently degraded until physical hardware is procured.
Pause and predict: If you store Cluster API manifests in Git and use Flux to sync them, what would happen if someone accidentally deleted the
clusters/production/directory from the Git repository withprune: trueenabled in Flux?
5. GitOps-Driven Cluster Management
Section titled “5. GitOps-Driven Cluster Management”By treating infrastructure strictly as code, organizations can manage multi-cluster environments using GitOps methodologies. Changes to hardware allocation, node scaling, and Kubernetes minor version upgrades are proposed via Pull Requests. Once approved by senior engineers, continuous deployment tools like Flux or ArgoCD synchronize the manifests to the Management Cluster, which then reconciles the downstream Workload Clusters.
flowchart TD subgraph GitRepo[Git Repository] direction TB C[clusters/production/cluster.yaml] CP[clusters/production/control-plane.yaml] W[clusters/production/workers.yaml] H[clusters/production/health-checks.yaml] I1[inventory/rack1-hosts.yaml] I2[inventory/rack2-hosts.yaml] end subgraph Flux[Flux / ArgoCD] F[Watches git repo] A[Applies changes] end subgraph Mgmt[Management Cluster] CAPI[CAPI Controllers] end GitRepo --> Flux Flux --> Mgmt Mgmt --> |Provisions| Wkld[Workload Clusters]
Note1[1. Engineer opens PR to scale workers] Note2[2. PR reviewed and merged] Note3[3. Flux applies MachineDeployment] Note4[4. CAPI provisions BareMetalHosts] Note5[5. Audit: git log shows who scaled, when, and why] Note1 -.-> Note2 -.-> Note3 -.-> Note4 -.-> Note5Because pruning in Flux deletes cluster definitions if they are removed from the repository, potentially destroying your infrastructure, it is critical to disable pruning for CAPI Kustomizations. Always append prune: false to ensure that an accidental git rm command does not result in the catastrophic deprovisioning of all bare-metal nodes in production.
Did You Know?
Section titled “Did You Know?”- The latest stable Cluster API release, v1.12.0, was shipped on January 27, 2026, and officially introduced chained upgrades. This enables engineers to perform multi-version jumps by allowing the controller to calculate and execute intermediate upgrade steps securely.
- Metal3.io was officially accepted as a CNCF Incubating project on August 27, 2025. At the time of incubation, the project boasted contributions from 57 active organizations, highlighting the massive industry demand for declarative bare-metal provisioning.
- The Cluster API v1beta1 API version will be completely unserved from the API server starting in CAPI v1.14, targeting August 2026. All manifests must be migrated to the newer v1beta2 storage format before executing this platform upgrade.
- Metal3 uses OpenStack Ironic without requiring a full OpenStack deployment. Ironic runs seamlessly inside a container, giving you bare-metal provisioning capabilities without Nova, Neutron, Keystone, or any other OpenStack service.
- Cluster API applies the Kubernetes controller pattern to infrastructure. Just as a Deployment controller ensures the right number of pods exist, the MachineDeployment controller ensures the right number of nodes exist.
- CAPM3 can manage servers from any vendor—Dell iDRAC, HPE iLO, Supermicro IPMI, Lenovo XClarity—as long as they support IPMI or Redfish. Redfish is replacing IPMI, which sends credentials in plaintext over UDP.
- Deutsche Telekom operates one of the largest known bare-metal Cluster API deployments in existence. Their Das Schiff platform utilizes CAPI to autonomously manage thousands of distributed Kubernetes clusters located at remote edge locations and cell towers worldwide.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| No spare BareMetalHosts | MachineHealthCheck creates replacement Machine but no host is available | Keep 10-15% of inventory as spare (e.g., 2 spares for 15 servers) |
| BMC credentials in plain Secrets | IPMI/Redfish credentials exposed in etcd | Use SealedSecrets or SOPS encryption for BMC credentials |
| Pruning enabled in GitOps | Flux deletes cluster resources if removed from git | Set prune: false for cluster Kustomizations |
maxUnhealthy threshold at 100% | Cascading failure: MHC replaces all nodes simultaneously during network partition | Set maxUnhealthy to 30-40% to prevent mass remediation |
| Unversioned OS images | Cannot reproduce node state, drift between nodes | Version OS images, store in HTTP server, reference by checksum |
| Skipping hardware inspection | CAPI provisions a server with bad RAM or failed disk | Let Ironic inspect all hosts before marking them Available |
| Imperative manual modifications | Drift between desired state (git) and actual state | All changes via git PRs, never kubectl edit on CAPI resources |
| Missing cert-manager prerequisite | CAPI initialization fails or webhooks crash | Ensure cert-manager v1.20.0 is installed before bootstrapping |
Question 1
Section titled “Question 1”You operate a data center with 20 bare-metal servers and need to deploy three independent Kubernetes clusters (dev: 3 nodes, staging: 5 nodes, production: 9 nodes). To ensure high availability and resilient operations, how many BareMetalHosts should you register into your inventory, and how many should be strictly labeled as spare?
Answer
Register all 20 servers. Keep 3 as spares.
Allocation breakdown:
- Dev cluster: 1 CP + 2 workers = 3 nodes
- Staging cluster: 1 CP + 4 workers = 5 nodes (control plane nodes must always be an odd number to maintain etcd quorum; 2 CP nodes are worse than 1 because a single failure loses quorum).
- Production cluster: 3 CP + 6 workers = 9 nodes
- Total active allocation: 17 nodes.
- Spare allocation: 3 nodes (representing 15% of total hardware).
Why maintaining 3 spares matters:
- MachineHealthCheck remediation: When a production worker suffers a hardware failure, the MHC requires an
AvailableBareMetalHost to provision as a replacement. Without spares, the replacement Machine remains stuck inPendingindefinitely. - Concurrent failures: In the event of a localized power supply failure affecting two nodes, you require multiple spares to immediately restore full cluster capacity.
- Dynamic Scaling: If production requires temporary horizontal expansion to handle seasonal traffic bursts, the spares provide immediate capacity without impacting development environments.
# Label spares for easy identificationapiVersion: metal3.io/v1alpha1kind: BareMetalHostmetadata: name: spare-01 labels: role: sparespec: online: false # Keep powered off until neededQuestion 2
Section titled “Question 2”Your production MachineHealthCheck is configured with maxUnhealthy: 40% covering a fleet of 10 worker nodes. A severe network switch failure occurs, isolating exactly 5 nodes simultaneously. How does the MachineHealthCheck controller respond to this event?
Answer
The MachineHealthCheck will NOT remediate any nodes. Because 5 out of 10 nodes represents 50% of the fleet, the condition explicitly exceeds the maxUnhealthy: 40% circuit breaker threshold.
This is the exact intended safeguard behavior. The threshold prevents catastrophic cascading failures caused by infrastructure anomalies. A switch failure is an environmental network issue, not an underlying host degradation. Attempting to deprovision and reinstall the operating systems on half of your production fleet would severely worsen the outage. You must troubleshoot the infrastructure failure (such as an invalid image, network anomaly, or bad drive) and apply a fix. Once the network connectivity is restored, the nodes will organically return to a Ready state.
What you should do:
- Check the management cluster for MachineHealthCheck events:
Terminal window kubectl describe machinehealthcheck production-worker-health -n clusters# Events: "Remediation is not allowed, total unhealthy: 50%, max: 40%" - Investigate the infrastructure issue (switch, power, network)
- Fix the root cause
- Nodes recover automatically
If it were a single node failure (1 of 10 = 10% < 40%), MHC would remediate: delete the Machine, deprovision the BareMetalHost, and create a replacement.
Question 3
Section titled “Question 3”You merge a Git pull request to upgrade a production workload cluster from Kubernetes v1.33.0 to v1.34.0. Describe the sequence of actions initiated by the KubeadmControlPlane controller. Furthermore, what happens to the older v1.33.0 nodes if the first newly provisioned v1.34.0 control plane node fails to start?
Answer
Upgrade process execution:
- CAPI provisions a fresh Machine running v1.34.0, leaving existing nodes untouched.
- The BareMetalHost is acquired, OS is imaged, and Kubeadm is executed to join the node to the existing cluster logic.
- The controller rigorously waits for the new node to broadcast a
Readycondition and verifies that etcd quorum has expanded successfully. - Only after validation succeeds does CAPI safely cordon, drain, and terminate one of the legacy v1.33.0 control plane nodes.
- This sequence loops iteratively for each replica.
- After all control plane nodes are successfully upgraded, the MachineDeployment orchestrates a similar rolling upgrade across all worker nodes.
If the new node fails to start:
The controller detects that the newly provisioned node has not reached a Ready state within the nodeStartupTimeout window. The rolling upgrade sequence immediately halts. No legacy nodes are evicted or deleted, ensuring the cluster remains functional and actively servicing workloads on the v1.33.0 infrastructure. Administrators must investigate the failure, resolve the provisioning blocker, and allow CAPI to resume.
# Monitor upgrade progresskubectl get machines -n clusters -l cluster.x-k8s.io/cluster-name=production# NAME PHASE VERSION# production-cp-1 Running v1.33.0 (old, will be removed last)# production-cp-2 Running v1.33.0# production-cp-3 Running v1.33.0# production-cp-4 Provisioning v1.34.0 (new, being created)Question 4
Section titled “Question 4”A junior platform engineer accidentally executes a git rm -r clusters/production/ command against the GitOps repository. If the management cluster is utilizing Flux for synchronization with prune: true configured on the Customization, what is the consequence?
Answer
The entire production cluster will be autonomously and irrevocably destroyed.
Flux pruning logic dictates that if a manifested resource is discovered within the live cluster but is absent from the authoritative Git tree, the resource must be deleted. While this is acceptable for stateless application Pods, deleting CAPI resources (such as Cluster, KubeadmControlPlane, and MachineDeployment) triggers immediate infrastructure deprovisioning cascades. The Bare Metal Operator will acquire the deletion signal, wipe the attached storage drives, and forcibly power off every physical server attached to that cluster. To prevent this, administrators must ensure prune: false is configured for CAPI Kustomizations.
With prune: false:
- Engineer accidentally deletes files from git
- Flux does NOT delete the cluster resources
- Cluster continues running
- Engineer restores the files in a follow-up commit
- No impact
Additional safety measures:
# Add finalizer protection to critical resourcesmetadata: annotations: kustomize.toolkit.fluxcd.io/prune: "disabled"Hands-On Exercise: Explore Cluster API with Docker Provider
Section titled “Hands-On Exercise: Explore Cluster API with Docker Provider”Before attempting bare-metal provisioning on complex hardware, engineers commonly test CAPI architecture utilizing the Docker provider (CAPD). This provider spins up discrete Docker containers that simulate distinct infrastructure “machines.”
Execution Steps
Section titled “Execution Steps”# Install clusterctl and create management clustercurl -L https://github.com/kubernetes-sigs/cluster-api/releases/latest/download/clusterctl-linux-amd64 -o clusterctlchmod +x clusterctl && sudo mv clusterctl /usr/local/bin/kind create cluster --name capi-mgmt
# Initialize CAPI with Docker providerexport CLUSTER_TOPOLOGY=trueclusterctl init --infrastructure docker
# Generate and create a workload clusterclusterctl generate cluster dev-cluster \ --infrastructure docker \ --kubernetes-version v1.33.0 \ --control-plane-machine-count 1 \ --worker-machine-count 2 > dev-cluster.yamlkubectl apply -f dev-cluster.yaml
# Watch provisioning and get kubeconfigkubectl get cluster,machines -Akubectl wait cluster/dev-cluster --for=condition=Ready --timeout=300sclusterctl get kubeconfig dev-cluster > dev-cluster.kubeconfig
# Scale the workload clusterkubectl patch machinedeployment dev-cluster-md-0 \ --type merge -p '{"spec":{"replicas":3}}'
# Verify the scaling operationkubectl get machines -A
# Cleanupkubectl delete cluster dev-clusterkind delete cluster --name capi-mgmtSuccess Criteria
Section titled “Success Criteria”- Evaluated prerequisites and successfully provisioned the upstream management cluster using
kind. - Triggered workload cluster provisioning and validated the existence of exactly one Control Plane and two Worker nodes.
- Issued a declarative patch to seamlessly scale the
MachineDeploymentfrom two to three workers. - Monitored the internal CAPI event loop and observed Machine lifecycles transition from
PendingtoProvisioningtoRunning. - Safely deprovisioned both the workload infrastructure and the management environment.
Next Module
Section titled “Next Module”You have completed the Multi-Cluster & Platform section. Please proceed to Module 6.1: Physical Security & Air-Gapped Environments to learn how to lock down on-premises Kubernetes environments and establish robust physical network security boundaries.