Module 7.1: Kubernetes Upgrades on Bare Metal
Цей контент ще не доступний вашою мовою.
Complexity:
[COMPLEX]| Time: 60 minutesPrerequisites: Module 1.3: Cluster Topology, Module 2.4: Declarative Bare Metal
Why This Module Matters
Section titled “Why This Module Matters”In October 2023, a financial services company running a 45-node bare metal Kubernetes cluster needed to upgrade from 1.27 to 1.28. They had been delaying upgrades for nine months because the process was manual, undocumented, and everyone who had done the last upgrade had left the company. When the security team mandated the upgrade due to a CVE in the kubelet, the remaining team attempted it on a Friday afternoon. They upgraded the control plane to 1.28, then upgraded all 45 worker nodes simultaneously. Twenty minutes in, they discovered that three nodes had a different kernel version that was incompatible with the new kubelet. Those nodes entered a crash loop. But the drain had already evicted pods from those nodes, and the cluster was running at 60% capacity. The PodDisruptionBudgets they had configured were ignored because they used kubectl drain --force. The result: 4 hours of degraded service, 2 hours of complete outage for stateful workloads, and a rollback that took another 3 hours because nobody had tested it.
The fix was straightforward but required discipline: a staging cluster that mirrors production, a written runbook, one-node-at-a-time rolling upgrades, and rollback procedures tested before the upgrade begins. The CTO’s postmortem note: “We treated a bare metal upgrade like a cloud upgrade. It is not. Every node is a snowflake until you prove otherwise.”
In the cloud, managed Kubernetes upgrades are a button click with automatic rollback. On bare metal, you are the managed service. Every upgrade is a planned operation that must account for heterogeneous hardware, limited spare capacity, and the absence of a safety net.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Plan bare-metal Kubernetes upgrades with staging validation, written runbooks, and tested rollback procedures
- Implement rolling node upgrades that respect PodDisruptionBudgets, drain timeouts, and heterogeneous hardware constraints
- Design a staging cluster that mirrors production hardware and workloads for pre-upgrade validation
- Troubleshoot upgrade failures caused by kernel incompatibilities, deprecated APIs, and node-level configuration drift
What You’ll Learn
Section titled “What You’ll Learn”- kubeadm upgrade workflow for control plane and workers
- Version skew policy and why it matters for rolling upgrades
- Draining nodes with limited spare capacity
- Rolling through heterogeneous hardware (different NICs, kernels, BIOS)
- Rollback strategies when an upgrade goes wrong
- Testing upgrades in staging before touching production
Kubernetes Version Skew Policy
Section titled “Kubernetes Version Skew Policy”Before upgrading anything, you must understand what version combinations are supported. Kubernetes enforces strict version skew limits between components.
+---------------------------------------------------------------+| VERSION SKEW POLICY (Kubernetes) || || kube-apiserver must be at the HIGHEST version || || kube-controller- can be at apiserver version || manager or ONE minor version behind || kube-scheduler (e.g., apiserver=1.29, scheduler=1.28) || || kubelet can be up to THREE minor versions || behind apiserver || (e.g., apiserver=1.29, kubelet=1.26) || || kubectl can be ONE minor version ahead or || behind apiserver || || etcd has its own version scheme || kubeadm bundles a compatible version |+---------------------------------------------------------------+Why Three-Version Kubelet Skew Matters on Bare Metal
Section titled “Why Three-Version Kubelet Skew Matters on Bare Metal”In the cloud, you upgrade all nodes within hours. On bare metal with 200 nodes and maintenance windows, the upgrade might stretch over weeks. The three-version kubelet skew means you can run apiserver at 1.30 while some workers still run kubelet 1.27 — but 1.26 kubelets would stop working.
# Check current versions across all nodeskubectl get nodes -o custom-columns=\ NAME:.metadata.name,\ KUBELET:.status.nodeInfo.kubeletVersion,\ OS:.status.nodeInfo.osImage,\ KERNEL:.status.nodeInfo.kernelVersionkubeadm Upgrade Workflow
Section titled “kubeadm Upgrade Workflow”Pause and predict: Before reading the upgrade steps, think about why the first control plane node uses
kubeadm upgrade applywhile subsequent control plane nodes usekubeadm upgrade node. What is different about the first node?
Step 1: Upgrade the First Control Plane Node
Section titled “Step 1: Upgrade the First Control Plane Node”The first control plane upgrade is special: kubeadm upgrade apply upgrades the cluster-wide components (API server, controller manager, scheduler manifests) and etcd. Subsequent nodes only need to update their local kubelet and static pod manifests.
# Check available versionsapt-cache madison kubeadm | head -5
# Unhold packages to allow upgradeapt-mark unhold kubeadm
# Upgrade kubeadm on the first control plane nodeapt-get update && apt-get install -y kubeadm=1.29.0-1.1
# Hold kubeadm to prevent accidental upgradesapt-mark hold kubeadm
# Verify the upgrade plankubeadm upgrade plan
# Apply the upgrade (first control plane only)kubeadm upgrade apply v1.29.0
# Unhold packages, upgrade kubelet and kubectl, then hold againapt-mark unhold kubelet kubectlapt-get install -y kubelet=1.29.0-1.1 kubectl=1.29.0-1.1apt-mark hold kubelet kubectlsystemctl daemon-reloadsystemctl restart kubeletStep 2: Upgrade Additional Control Plane Nodes
Section titled “Step 2: Upgrade Additional Control Plane Nodes”# On each additional control plane nodeapt-mark unhold kubeadmapt-get update && apt-get install -y kubeadm=1.29.0-1.1apt-mark hold kubeadm
# Use 'node' instead of 'apply' for additional control planeskubeadm upgrade node
apt-mark unhold kubelet kubectlapt-get install -y kubelet=1.29.0-1.1 kubectl=1.29.0-1.1apt-mark hold kubelet kubectlsystemctl daemon-reloadsystemctl restart kubeletStep 3: Upgrade Worker Nodes (Rolling)
Section titled “Step 3: Upgrade Worker Nodes (Rolling)”+---------------------------------------------------------------+| ROLLING WORKER UPGRADE SEQUENCE || || Cluster: 12 workers, max unavailable = 2 || || Batch 1: [worker-01] [worker-02] || drain -> upgrade kubeadm -> upgrade node -> || upgrade kubelet -> restart -> uncordon || || Batch 2: [worker-03] [worker-04] || (wait for batch 1 pods to reschedule first) || || Batch 3: [worker-05] [worker-06] || ... || || Batch 6: [worker-11] [worker-12] || final batch, verify cluster health after |+---------------------------------------------------------------+Draining Nodes with Limited Spare Capacity
Section titled “Draining Nodes with Limited Spare Capacity”On bare metal, you cannot spin up temporary nodes during an upgrade. If your cluster runs at 80% CPU utilization, draining even one node might push the remaining nodes above their limits.
Capacity Planning Before Drain
Section titled “Capacity Planning Before Drain”# Check current resource usage across all nodeskubectl top nodes
# Check how much headroom you havekubectl get nodes -o json | jq -r ' .items[] | "\(.metadata.name) Allocatable CPU: \(.status.allocatable.cpu) Allocatable Mem: \(.status.allocatable.memory)"'
# Check PodDisruptionBudgets that might block drainskubectl get pdb --all-namespacesStop and think: Your cluster runs at 80% CPU utilization. You need to drain a node for upgrade. Where do those pods go? What happens if the remaining nodes cannot absorb the evicted workloads?
Safe Drain Procedure
Section titled “Safe Drain Procedure”The drain process happens in stages: first cordon the node to prevent new pods from scheduling, then inspect what will be evicted, and finally drain with explicit safety rails. Never use --force unless you have a specific reason and understand the consequences.
# Step 1: Cordon the node (prevent new scheduling)kubectl cordon worker-07
# Step 2: Check what will be evictedkubectl get pods --field-selector spec.nodeName=worker-07 \ --all-namespaces -o wide
# Step 3: Drain with safety railskubectl drain worker-07 \ --ignore-daemonsets \ --delete-emptydir-data \ --timeout=300s \ --pod-selector='app!=critical-singleton'
# NEVER use --force unless you understand the consequences# --force skips PDB checks and deletes standalone podsHandling Pods That Refuse to Drain
Section titled “Handling Pods That Refuse to Drain”# Check which PDB is blockingkubectl get pdb -A -o wide
# Example output:# NAMESPACE NAME MIN AVAILABLE ALLOWED DISRUPTIONS# prod redis-pdb 2 0
# If allowed disruptions = 0, the drain will hang# Options:# 1. Wait for replicas to become healthy# 2. Scale up the deployment temporarilykubectl scale deployment redis --replicas=4 -n prod# Now drain should proceed (3 healthy > 2 min available)Rolling Through Heterogeneous Hardware
Section titled “Rolling Through Heterogeneous Hardware”On bare metal, not all nodes are identical. You might have three generations of servers with different CPUs, NICs, kernel versions, and firmware. An upgrade that works on one generation might fail on another.
Categorize Your Hardware
Section titled “Categorize Your Hardware”# Create a hardware inventorykubectl get nodes -o json | jq -r ' .items[] | [ .metadata.name, .metadata.labels["node.kubernetes.io/instance-type"] // "unknown", .status.nodeInfo.kernelVersion, .status.nodeInfo.containerRuntimeVersion, .status.nodeInfo.architecture ] | @tsv' | sort -k2 | column -tUpgrade Order by Hardware Generation
Section titled “Upgrade Order by Hardware Generation”+---------------------------------------------------------------+| UPGRADE ORDER FOR HETEROGENEOUS HARDWARE || || Phase 1: Canary (1 node per hardware generation) || [dell-r640-01] [dell-r740-01] [hp-dl380-01] || Monitor for 30 min after each || || Phase 2: Remaining Gen 1 (oldest hardware first) || [dell-r640-02..08] rolling, 2 at a time || || Phase 3: Gen 2 || [dell-r740-02..15] rolling, 3 at a time || || Phase 4: Gen 3 (newest hardware last) || [hp-dl380-02..20] rolling, 3 at a time || || Why oldest first? || - Oldest hardware is most likely to surface problems || - If a kernel incompatibility exists, you find it early || - Newest hardware has the most spare capacity as buffer |+---------------------------------------------------------------+Pre-flight Checks per Hardware Generation
Section titled “Pre-flight Checks per Hardware Generation”#!/bin/bash# pre-flight-check.sh — run on each node before upgradingset -euo pipefail
echo "=== Pre-flight Check ==="echo "Hostname: $(hostname) | Kernel: $(uname -r)"echo "CPU: $(lscpu | grep 'Model name')"
# Check cgroup v2, disk space, container runtimegrep -q cgroup2 /proc/filesystems || { echo "FAIL: no cgroup v2"; exit 1; }DISK_FREE=$(df /var/lib/kubelet --output=pcent | tail -1 | tr -d ' %')[ "$DISK_FREE" -gt 85 ] && { echo "FAIL: disk ${DISK_FREE}%"; exit 1; }crictl info > /dev/null 2>&1 || { echo "FAIL: runtime down"; exit 1; }echo "=== All checks passed ==="Rollback Strategies
Section titled “Rollback Strategies”Rolling back a Kubernetes upgrade is harder than the upgrade itself. You must plan for rollback before you begin.
Control Plane Rollback
Section titled “Control Plane Rollback”# kubeadm does NOT have a built-in rollback command# You must manually downgrade packages
# Step 1: Install the previous kubeadm versionapt-mark unhold kubeadmapt-get install -y kubeadm=1.28.5-1.1apt-mark hold kubeadm
# Step 2: Downgrade kubelet and kubectlapt-mark unhold kubelet kubectlapt-get install -y kubelet=1.28.5-1.1 kubectl=1.28.5-1.1apt-mark hold kubelet kubectlsystemctl daemon-reloadsystemctl restart kubelet
# Step 3: Restore etcd from backup (if schema changed)# This is why you ALWAYS back up etcd before upgradingETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-pre-upgrade.db \ --data-dir /var/lib/etcd-restored \ --name $(hostname) \ --initial-cluster $(hostname)=https://$(hostname):2380
# Step 4: Swap the etcd data directorymv /var/lib/etcd /var/lib/etcd-brokenmv /var/lib/etcd-restored /var/lib/etcd
# Step 5: Restore static pod manifests from pre-upgrade backup# Without this, the control plane containers will still run the newer imagescp /backup/manifests-pre-upgrade/* /etc/kubernetes/manifests/Pause and predict: You are about to upgrade your control plane. The etcd snapshot is your safety net. If you skip this step and the upgrade corrupts etcd’s WAL format, what are your options for recovery?
The etcd Backup Rule
Section titled “The etcd Backup Rule”This is the single most important step before any control plane upgrade. etcd’s Write-Ahead Log format can change between versions, making downgrades impossible without a pre-upgrade snapshot.
# ALWAYS back up etcd before ANY control plane upgradeETCDCTL_API=3 etcdctl snapshot save /backup/etcd-pre-upgrade.db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key
# Verify the backupETCDCTL_API=3 etcdctl snapshot status /backup/etcd-pre-upgrade.db \ --write-out=tableTesting Upgrades in Staging
Section titled “Testing Upgrades in Staging”On bare metal, your staging cluster should mirror production hardware as closely as possible. This means at least one node from each hardware generation.
Staging Cluster Requirements
Section titled “Staging Cluster Requirements”+---------------------------------------------------------------+| STAGING CLUSTER FOR UPGRADE TESTING || || Minimum staging cluster: || - 3 control plane nodes (same hardware as production) || - 1 worker per hardware generation || - Same CNI, CSI, and ingress controller versions || - Representative workloads (not production data) || || Test matrix: || +------------------+----------------------------------+ || | Test | Pass criteria | || +------------------+----------------------------------+ || | kubeadm upgrade | No errors, all nodes Ready | || | Pod scheduling | Pods schedule on all generations | || | CNI networking | Pod-to-pod across nodes works | || | CSI storage | PVCs bind, data persists | || | Ingress | External traffic routes correctly | || | DNS | CoreDNS resolves internal names | || | GPU/SR-IOV | Device plugins register devices | || | Monitoring | Prometheus scrapes all targets | || +------------------+----------------------------------+ |+---------------------------------------------------------------+The Complete Upgrade Runbook
Section titled “The Complete Upgrade Runbook”Here is the sequence for a production bare metal upgrade:
+---------------------------------------------------------------+| PRODUCTION UPGRADE RUNBOOK || || Week -2: Test upgrade on staging cluster || Week -1: Back up etcd, verify backups, update runbook || || Day of upgrade: || 1. Notify stakeholders (email + Slack) || 2. Verify etcd backup is fresh (< 1 hour old) || 3. Record current versions of all components || 4. Upgrade first control plane node || 5. Verify apiserver, scheduler, controller-manager healthy || 6. Upgrade remaining control plane nodes (one at a time) || 7. Verify control plane quorum || 8. Upgrade canary worker (1 per hardware generation) || 9. Monitor for 30 minutes || 10. Roll through remaining workers in batches of 2-3 || 11. Wait 5 min between batches || 12. Run smoke tests after final batch || 13. Update monitoring dashboards for new version || 14. Send completion notification || || Rollback triggers: || - Any control plane node fails to rejoin || - > 5% of pods in CrashLoopBackOff after upgrade || - Networking between nodes fails || - Storage mounts fail on upgraded nodes |+---------------------------------------------------------------+Did You Know?
Section titled “Did You Know?”-
Kubernetes drops support for a minor version approximately 14 months after release. On bare metal, where upgrades take longer to plan and execute, this means you should start planning the next upgrade almost immediately after completing the current one. Falling behind two versions is uncomfortable; falling behind three is an emergency.
-
The kubelet’s three-version skew tolerance was expanded from two in Kubernetes 1.28. This change was specifically motivated by on-premises and air-gapped environments where upgrading all nodes quickly is impractical. The KEP (Kubernetes Enhancement Proposal) cited large bare metal deployments as the primary beneficiary.
-
etcd upgrades are the riskiest part of a control plane upgrade. etcd uses a WAL (Write-Ahead Log) format that can change between versions. If the new etcd version migrates the WAL format, you cannot simply roll back to the old binary. This is why etcd backup before upgrade is non-negotiable.
-
Google’s internal Borg system inspired Kubernetes, but Google upgrades Borg clusters cell-by-cell across thousands of nodes using a dedicated upgrade service. On bare metal Kubernetes, you are building that upgrade service yourself with shell scripts and kubeadm.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Skipping minor versions | kubeadm only supports +1 minor version upgrades | Upgrade sequentially: 1.27 -> 1.28 -> 1.29 |
| No etcd backup before upgrade | Cannot roll back if etcd schema changes | Always etcdctl snapshot save before upgrading |
| Draining all workers at once | Insufficient capacity for running workloads | Roll in batches matching your spare capacity |
Using --force on drain | Skips PDB checks, deletes standalone pods | Use --timeout and fix blocked drains properly |
| Not testing on staging | Hardware-specific failures discovered in production | Maintain staging with representative hardware |
| Ignoring version skew | Components stop communicating | Check all component versions before and after |
| Upgrading on Friday afternoon | No time to handle unexpected failures | Schedule upgrades Tuesday-Wednesday morning |
| Not recording pre-upgrade state | Cannot compare before/after | Script the version inventory before starting |
Question 1
Section titled “Question 1”You have a 30-node cluster running Kubernetes 1.27. You need to get to 1.30. What is the correct upgrade path, and how long should you expect it to take on bare metal?
Answer
Upgrade path: 1.27 -> 1.28 -> 1.29 -> 1.30 (three sequential minor version upgrades).
kubeadm only supports upgrading one minor version at a time. You cannot skip from 1.27 to 1.30 directly.
Time estimate for bare metal:
- Each minor version upgrade: 1-2 days (staging test) + 1 day (production rollout)
- Three upgrades: 6-9 days of calendar time, spread across 3-4 weeks
- Include buffer for unexpected issues: plan for 4-6 weeks total
Per upgrade cycle:
- Test on staging (1 day)
- Upgrade 3 control plane nodes (2-3 hours)
- Roll 30 workers in batches of 3 (3-4 hours with monitoring gaps)
- Post-upgrade validation (1 hour)
The kubelet three-version skew policy means your 1.27 kubelets will still work with a 1.30 apiserver, giving you flexibility in the worker rollout timeline. But the control plane must go through each version.
Question 2
Section titled “Question 2”During an upgrade, kubectl drain worker-12 hangs indefinitely. What are the most likely causes and how do you resolve each?
Answer
Most likely causes:
-
PodDisruptionBudget blocking eviction: A PDB with
minAvailableequal to the current replica count means zero allowed disruptions.Terminal window kubectl get pdb -A -o wide# Fix: scale up the deployment, then drainkubectl scale deploy my-app --replicas=4 -
Pod with no controller (standalone pod): Drain skips standalone pods by default but shows a warning. With
--force, it deletes them permanently.Terminal window kubectl get pods --field-selector spec.nodeName=worker-12 \-A -o json | jq '.items[] | select(.metadata.ownerReferences == null) | .metadata.name' -
Local storage (emptyDir): Pods using emptyDir block drain unless
--delete-emptydir-datais specified. -
DaemonSet pods: These block drain unless
--ignore-daemonsetsis specified. -
Finalizer stuck on pod: A pod with a finalizer that never completes will prevent eviction.
Terminal window kubectl get pods -n stuck-ns -o json | jq '.items[].metadata.finalizers'
Resolution workflow:
- Add
--timeout=300sto see which pod is blocking - Check PDBs, then scale up if needed
- Handle standalone pods case by case
- Always include
--ignore-daemonsets --delete-emptydir-data
Question 3
Section titled “Question 3”Your cluster has three hardware generations: Dell R640 (2018), Dell R740 (2020), and HPE DL380 Gen10 (2022). After upgrading kubelet to 1.29, the R640 nodes show NotReady. The R740 and DL380 nodes are fine. What do you investigate?
Answer
Investigation steps for hardware-generation-specific failures:
-
Check kubelet logs on a failing R640:
Terminal window journalctl -u kubelet -n 100 --no-pager -
Check kernel version: R640 nodes may run an older kernel that lacks features required by the new kubelet (cgroup v2 support, specific sysctls).
Terminal window ssh r640-01 "uname -r"# Compare with working R740: ssh r740-01 "uname -r" -
Check container runtime compatibility: Older containerd versions might not support new kubelet CRI requirements.
Terminal window ssh r640-01 "containerd --version" -
Check CPU instruction set: Very old CPUs might lack instructions required by newer Go binaries (kubelet compiled with newer Go versions may use AVX2 or similar).
-
Check BIOS/firmware: Older BIOS versions might expose hardware features differently, affecting cgroup mounting or NUMA topology detection.
Most common root cause: Kernel version too old for the new kubelet’s cgroup requirements. Solution: upgrade the OS kernel on R640 nodes before the Kubernetes upgrade.
Rollback: Downgrade kubelet on R640 nodes to 1.28 (within skew policy), fix the kernel, then retry.
Question 4
Section titled “Question 4”You need to roll back the control plane from 1.29 to 1.28 after discovering a critical bug. Your etcd was upgraded and has already accepted writes in the new format. What is the procedure?
Answer
This is a critical scenario that requires etcd restore from backup.
-
Stop the apiserver on all control plane nodes to prevent new writes:
Terminal window # Move static pod manifests to stop themmv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/ -
Restore etcd from pre-upgrade backup (this is why you always back up):
Terminal window # Stop etcdmv /etc/kubernetes/manifests/etcd.yaml /tmp/# Restore the backupETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-pre-upgrade.db \--data-dir /var/lib/etcd-restored# Replace the data directorymv /var/lib/etcd /var/lib/etcd-brokenmv /var/lib/etcd-restored /var/lib/etcd -
Downgrade kubeadm, kubelet, kubectl on all control plane nodes:
Terminal window apt-get install -y kubeadm=1.28.5-1.1 kubelet=1.28.5-1.1 kubectl=1.28.5-1.1 -
Restore static pod manifests (the old versions from backup):
Terminal window # Copy the backed-up manifests from before the upgradecp /backup/manifests-pre-upgrade/* /etc/kubernetes/manifests/ -
Restart kubelet on all control plane nodes:
Terminal window systemctl daemon-reload && systemctl restart kubelet -
Verify: Check all control plane components are running at 1.28 and etcd cluster has quorum.
Data loss warning: Any writes between the upgrade and the restore are lost. This is why you should catch upgrade failures as fast as possible and why you run canary workloads that validate behavior immediately after upgrade.
If you did not back up etcd: You cannot safely downgrade if etcd migrated its WAL format. Your only options are to fix the 1.29 issue or rebuild the cluster. This is why etcd backup is non-negotiable.
Hands-On Exercise: Practice Node Draining and PDB Enforcement
Section titled “Hands-On Exercise: Practice Node Draining and PDB Enforcement”Task: Using a kind cluster, practice node draining with PodDisruptionBudgets. Note: kind nodes are containers with pre-baked binaries, so OS-level package upgrades (apt-get install kubeadm) cannot be performed. This exercise focuses on the drain/uncordon workflow that is critical during real upgrades.
# Create a kind cluster running an older versioncat <<'KINDEOF' > /tmp/kind-upgrade-lab.yamlkind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes: - role: control-plane image: kindest/node:v1.28.0 - role: worker image: kindest/node:v1.28.0 - role: worker image: kindest/node:v1.28.0KINDEOF
kind create cluster --config /tmp/kind-upgrade-lab.yaml --name upgrade-lab-
Record current versions:
Terminal window kubectl get nodes -o wide -
Deploy a test workload with PDB:
Terminal window kubectl create deployment nginx --image=nginx --replicas=3kubectl create pdb nginx-pdb --selector=app=nginx --min-available=2 -
Practice draining a worker with PDB enforcement:
Terminal window kubectl drain upgrade-lab-worker --ignore-daemonsets --delete-emptydir-data# Observe how PDB affects the drain -
Uncordon and verify pod redistribution:
Terminal window kubectl uncordon upgrade-lab-workerkubectl get pods -o wide -
Document your observations: Which pods were evicted? How long did the drain take? Did the PDB prevent disruption below the minimum?
Success Criteria
Section titled “Success Criteria”- Recorded all node versions before starting
- Deployed workload with PDB
- Successfully drained a node while PDB was active
- Verified pods rescheduled to remaining nodes
- Uncordoned and verified cluster returned to normal
- Documented the process in a runbook format
Cleanup
Section titled “Cleanup”kind delete cluster --name upgrade-labNext Module
Section titled “Next Module”Continue to Module 7.2: Hardware Lifecycle & Firmware to learn how to manage BIOS updates, disk replacements, and firmware patching without cluster downtime.