Module 7.2: Hardware Lifecycle & Firmware
Complexity:
[COMPLEX]| Time: 60 minutesPrerequisites: Module 7.1: Kubernetes Upgrades on Bare Metal, Module 2.1: Datacenter Fundamentals
Why This Module Matters
Section titled “Why This Module Matters”In February 2024, a healthcare company running a 70-node bare metal Kubernetes cluster received a critical BIOS vulnerability advisory from Dell. The CVE allowed privilege escalation through the BMC interface, and their compliance team gave them 30 days to patch all servers. The infrastructure team’s first instinct was to schedule a weekend maintenance window, power down all servers, update BIOS via USB drives, and power them back up. That would have meant 4-8 hours of complete cluster downtime — unacceptable for a system processing patient data 24/7.
Instead, they built a rolling firmware update pipeline: cordon a node, drain its workloads, use the Redfish API to stage the BIOS update, reboot the server into the firmware update mode, wait for it to come back, verify the new BIOS version via IPMI, uncordon, and move to the next node. The process took 12 days but caused zero downtime. Three servers failed to reboot after the BIOS update due to a known bug with their specific DIMM configuration. Because they were updating one node at a time, these failures were contained. The spare capacity absorbed the missing nodes while the team worked with Dell support to recover them.
In the cloud, you never think about BIOS versions or disk firmware. The cloud provider handles it. On bare metal, hardware lifecycle management is a continuous operational responsibility that directly affects your cluster’s security posture, reliability, and performance.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Implement rolling firmware update pipelines using Redfish/IPMI APIs that patch BIOS and BMC without cluster downtime
- Design a hardware lifecycle management process covering procurement, deployment, maintenance, and decommissioning
- Configure automated hardware health monitoring with IPMI sensors, SMART disk checks, and predictive failure alerts
- Plan disk replacement and memory upgrade procedures that integrate with Kubernetes node drain and cordon workflows
What You’ll Learn
Section titled “What You’ll Learn”- BIOS and firmware update strategies without cluster downtime
- Cordon/drain workflows for hardware maintenance windows
- Disk replacement procedures for running Kubernetes nodes
- Predictive failure detection with SMART monitoring
- Using the Redfish API for out-of-band firmware management
- Building a hardware maintenance calendar
Firmware Update Architecture
Section titled “Firmware Update Architecture”+---------------------------------------------------------------+| FIRMWARE UPDATE PIPELINE (ZERO DOWNTIME) || || +---------+ +--------+ +----------+ +---------+ || | Cordon |--->| Drain |--->| Stage FW |--->| Reboot | || | Node | | Pods | | via | | into FW | || | | | | | Redfish | | update | || +---------+ +--------+ +----------+ +---------+ || | || +---------+ +--------+ +----------+ +----v----+ || | Update |<---| Verify |<---| Wait for |<---| Server | || | CMDB | | FW ver | | node | | reboots | || | | | via | | Ready | | with | || | | | IPMI | | | | new FW | || +---------+ +--------+ +----------+ +---------+ || | || +----v----+ || |Uncordon | || | Node | || +---------+ |+---------------------------------------------------------------+Types of Firmware to Manage
Section titled “Types of Firmware to Manage”| Component | Update Method | Reboot Required | Risk Level |
|---|---|---|---|
| BIOS/UEFI | Redfish/IPMI, USB, in-band tool | Yes | High |
| BMC/iDRAC/iLO | Redfish API | Usually no | Medium |
| NIC firmware | Vendor tool (ethtool, mlxfwmanager) | Sometimes | Medium |
| Disk firmware | Vendor tool (smartctl, perccli) | Sometimes | High |
| GPU firmware | nvidia-smi | Yes | High |
| RAID controller | Vendor CLI (storcli, perccli) | Yes | High |
Cordon and Drain for Maintenance Windows
Section titled “Cordon and Drain for Maintenance Windows”The Maintenance Workflow
Section titled “The Maintenance Workflow”#!/bin/bash# maintenance-drain.sh — safe node drain for hardware maintenanceset -euo pipefail
NODE="$1"REASON="${2:-hardware-maintenance}"
echo "=== Starting maintenance drain for ${NODE} ==="echo "Reason: ${REASON}"echo "Time: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
# Step 1: Label the node with maintenance reasonkubectl label node "$NODE" \ maintenance.kubedojo.io/reason="$REASON" \ maintenance.kubedojo.io/started="$(date +%s)" \ --overwrite
# Step 2: Cordon (prevent new pods)kubectl cordon "$NODE"
# Step 3: Check what needs to be evictedPOD_COUNT=$(kubectl get pods --field-selector spec.nodeName="$NODE" \ --all-namespaces --no-headers | wc -l)echo "Pods to evict: ${POD_COUNT}"
# Step 4: Drain with timeoutkubectl drain "$NODE" \ --ignore-daemonsets \ --delete-emptydir-data \ --timeout=600s \ --grace-period=60
echo "=== Node ${NODE} drained and ready for maintenance ==="Post-Maintenance Return
Section titled “Post-Maintenance Return”#!/bin/bash# maintenance-return.sh — return node to service after maintenanceset -euo pipefail
NODE="$1"
echo "=== Returning ${NODE} to service ==="
# Step 1: Verify the node is ReadySTATUS=$(kubectl get node "$NODE" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')if [ "$STATUS" != "True" ]; then echo "WARN: Node not Ready (status: ${STATUS}). Waiting 60s..." sleep 60 STATUS=$(kubectl get node "$NODE" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}') if [ "$STATUS" != "True" ]; then echo "FAIL: Node still not Ready. Investigate before uncordoning." exit 1 fifi
# Step 2: Verify kubelet version matches clusterNODE_VERSION=$(kubectl get node "$NODE" -o jsonpath='{.status.nodeInfo.kubeletVersion}')echo "Node kubelet version: ${NODE_VERSION}"
# Step 3: Uncordonkubectl uncordon "$NODE"
# Step 4: Remove maintenance labelskubectl label node "$NODE" \ maintenance.kubedojo.io/reason- \ maintenance.kubedojo.io/started-
echo "=== Node ${NODE} returned to service ==="BIOS and Firmware Updates via Redfish
Section titled “BIOS and Firmware Updates via Redfish”Redfish is the modern replacement for IPMI for out-of-band server management. It provides a RESTful API for firmware updates, power control, and hardware inventory.
Pause and predict: BIOS updates require a server reboot. On bare metal, a reboot means the Kubernetes node goes offline. How would you update BIOS on 40 servers without any cluster downtime?
Querying Firmware Inventory
Section titled “Querying Firmware Inventory”Before any firmware update, you need to know what versions are currently running. Redfish provides a REST API to query this information from the BMC without logging into the server’s OS.
# Get current firmware versions via Redfishcurl -sk -u admin:password \ https://bmc-worker-01.internal/redfish/v1/UpdateService/FirmwareInventory \ | jq '.Members[] | ."@odata.id"'
# Get specific BIOS versioncurl -sk -u admin:password \ https://bmc-worker-01.internal/redfish/v1/Systems/1/Bios \ | jq '{BiosVersion: .BiosVersion, Model: .Model}'Staging a BIOS Update
Section titled “Staging a BIOS Update”Staging a firmware update via Redfish means uploading the firmware image to the BMC, which stores it locally and applies it on the next reboot. This two-step process (stage now, apply on reboot) lets you stage firmware on all nodes in parallel without any downtime, then reboot them one at a time.
# Stage firmware via Redfish SimpleUpdate (Dell iDRAC example)# The BMC pulls the image from an HTTP server — host the firmware file# on an internal web server accessible from the BMC management networkcurl -sk -u admin:password \ -X POST \ -H "Content-Type: application/json" \ https://bmc-worker-01.internal/redfish/v1/UpdateService/Actions/UpdateService.SimpleUpdate \ -d '{"ImageURI": "http://internal-web-server/firmware/BIOS_P4GKN_LN_2.19.1.bin", "TransferProtocol": "HTTP", "@Redfish.OperationApplyTime": "OnReset"}'
# Check update statuscurl -sk -u admin:password \ https://bmc-worker-01.internal/redfish/v1/TaskService/Tasks \ | jq '.Members'The rolling firmware update script follows the same pattern: for each node, call maintenance-drain.sh, stage firmware via Redfish SimpleUpdate, trigger a GracefulRestart via Redfish, wait for the node to return to Ready, verify the new firmware version, and call maintenance-return.sh. Add a 5-minute cooldown between nodes to catch issues early.
Disk Replacement Procedures
Section titled “Disk Replacement Procedures”Disk failures are the most common hardware event in a bare metal cluster. With Ceph or other distributed storage, disk replacement can be non-disruptive — but the procedure must be followed precisely.
SMART Monitoring for Predictive Failure
Section titled “SMART Monitoring for Predictive Failure”# Check SMART health on all disksfor DISK in /dev/sd{a..h}; do echo "=== ${DISK} ===" smartctl -H "$DISK" | grep "SMART overall-health" smartctl -A "$DISK" | grep -E "(Reallocated_Sector|Current_Pending|Offline_Uncorrectable)"doneKey SMART Attributes to Monitor
Section titled “Key SMART Attributes to Monitor”+---------------------------------------------------------------+| SMART ATTRIBUTES FOR PREDICTIVE FAILURE || || Attribute Warning Threshold Action || ------------------------------------------------------------ || Reallocated Sector Count > 0 Monitor || Reallocated Sector Count > 100 Replace soon || Current Pending Sectors > 0 Investigate || Offline Uncorrectable > 0 Replace soon || UDMA CRC Error Count Rising Check cable || Wear Leveling (SSD) < 10% Plan replace || Media Wearout (NVMe) < 10% Plan replace || || For NVMe drives, use: || nvme smart-log /dev/nvme0n1 || Key field: percentage_used (replace at > 90%) |+---------------------------------------------------------------+Stop and think: The SMART data shows
Reallocated_Sector_Count = 47andCurrent_Pending_Sector = 3. The disk is part of a Ceph OSD. Should you replace it now or wait for it to fail completely? What is the risk of waiting?
Disk Replacement Workflow (Ceph OSD)
Section titled “Disk Replacement Workflow (Ceph OSD)”This procedure safely removes a failing disk from a Ceph cluster, replaces it, and re-adds the new disk. The critical step is ceph osd set noout — it prevents Ceph from starting a full rebalance while you are swapping the disk, which would add unnecessary I/O load during the maintenance window.
# Step 1: Identify the failing diskceph osd tree # find which OSD is on the failing disk
# Step 2: Set OSD noout to prevent rebalancing during replacementceph osd set noout
# Step 3: Mark the OSD down and outceph osd down osd.5ceph osd out osd.5
# Step 4: Remove the OSD from the cluster# (purge removes the OSD from CRUSH, deletes its auth keys, and removes it from the map)ceph osd purge osd.5 --yes-i-really-mean-it
# Step 5: Physically replace the disk (or wait for datacenter hands)
# Step 6: Prepare the new diskceph-volume lvm create --data /dev/sdc
# Step 7: Unset noout to allow rebalancingceph osd unset noout
# Step 8: Monitor rebalancingceph -w # watch rebalancing progressBuilding a Hardware Maintenance Calendar
Section titled “Building a Hardware Maintenance Calendar”+---------------------------------------------------------------+| ANNUAL HARDWARE MAINTENANCE CALENDAR || || Monthly: || - SMART health check on all disks (automated) || - Review BMC event logs for warnings || - Check PSU redundancy status || || Quarterly: || - Apply critical firmware updates (BIOS, BMC) || - Review NIC firmware versions against vendor advisories || - Test IPMI/Redfish connectivity to all BMCs || - Verify backup power (UPS battery tests) || || Annually: || - Full hardware inventory audit || - Warranty status review (identify expiring warranties) || - Thermal audit (clean dust filters, check airflow) || - Cable audit (reseat suspect connections) || - Evaluate hardware refresh candidates || || As-needed: || - Emergency firmware patches (CVEs) || - Disk replacements (SMART alerts) || - Memory DIMM replacements (ECC error alerts) || - PSU replacements (redundancy lost alerts) |+---------------------------------------------------------------+Pause and predict: The “bathtub curve” predicts high failure rates in the first 90 days and again in years 4-5 of a server’s lifecycle. How should your spare inventory strategy differ between a brand-new hardware deployment and a fleet entering year 4?
Prometheus Alerts for Hardware Health
Section titled “Prometheus Alerts for Hardware Health”These alerting rules turn SMART data and IPMI sensor readings into actionable notifications. The severity levels map to response times: critical means act within hours, warning means plan a replacement within days.
# hardware-alerts.yaml — Prometheus alerting rulesgroups: - name: hardware-health rules: - alert: DiskSmartPrefailure expr: smartmon_device_smart_healthy == 0 for: 5m labels: severity: critical annotations: summary: "Disk SMART prefailure on {{ $labels.instance }}" description: "Disk {{ $labels.disk }} reports SMART health check failed." runbook: "Follow disk replacement procedure in ops runbook."
- alert: DiskWearoutHigh expr: smartmon_wear_leveling_count_value < 20 for: 1h labels: severity: warning annotations: summary: "SSD wear leveling low on {{ $labels.instance }}" description: "Disk {{ $labels.disk }} has {{ $value }}% life remaining."
- alert: MemoryECCErrors expr: node_edac_correctable_errors_total > 100 for: 10m labels: severity: warning annotations: summary: "ECC memory errors on {{ $labels.instance }}" description: "{{ $value }} correctable ECC errors detected. DIMM may be failing."
- alert: PSURedundancyLost expr: ipmi_sensor_state{name=~".*PSU.*",state="critical"} == 1 for: 1m labels: severity: critical annotations: summary: "PSU redundancy lost on {{ $labels.instance }}" description: "Server is running on a single PSU. Replace failed PSU immediately."Did You Know?
Section titled “Did You Know?”-
Dell ships approximately 50 BIOS updates per server model over its 5-year lifecycle. Most are cumulative, so you do not need to apply every one — but skipping more than 2-3 versions increases the risk of hitting a bug that was fixed in an intermediate release.
-
Redfish was developed by the DMTF (Distributed Management Task Force) starting in 2014 as a replacement for IPMI, which was designed in 1998. While IPMI uses a binary protocol on port 623, Redfish uses HTTPS with JSON payloads. Most servers shipped after 2018 support both, but IPMI is being deprecated by major vendors.
-
NVMe drives do not use SMART in the traditional sense. They use the NVMe health log (
nvme smart-log), which reportspercentage_used— a value that can exceed 100% because it measures total bytes written against the vendor’s endurance rating. A drive at 150% used may still work fine but is operating beyond its warranty. -
The “bathtub curve” describes hardware failure rates: high failure rate in the first 90 days (infant mortality), low and stable for years (useful life), then rising failure rate as components age (wear-out). Plan your spare inventory accordingly — you need more spares in the first quarter after a hardware refresh and in years 4-5 of the lifecycle.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Updating BIOS on all nodes simultaneously | Cluster-wide outage if update fails | Rolling update, one node at a time |
| No SMART monitoring | Disk failures are surprises | Deploy smartmon_exporter, alert on prefailure |
| Ignoring ECC memory errors | Correctable errors precede uncorrectable ones | Alert on correctable errors, replace DIMMs proactively |
| Skipping BMC firmware updates | BMC vulnerabilities allow remote compromise | Include BMC in firmware update cycle |
| No spare disk inventory | Hours/days waiting for replacement parts | Keep 5-10% spare disks on-site |
| Updating NIC firmware without testing | Network driver compatibility issues | Test NIC firmware in staging first |
| Not documenting which firmware is on which node | Cannot audit compliance, cannot reproduce issues | Maintain CMDB with firmware versions |
| Forgetting to unset Ceph noout after maintenance | Ceph stops rebalancing indefinitely | Script the unset into maintenance-return workflow |
Question 1
Section titled “Question 1”You need to update the BIOS on 40 bare metal Kubernetes nodes to patch a critical CVE. Your compliance deadline is 14 days. Each BIOS update requires a reboot that takes approximately 8 minutes. How do you plan this?
Answer
Plan: Rolling BIOS update, 4 nodes per day, completing in 10 working days.
Calculation:
- Per node: drain (5 min) + stage firmware (2 min) + reboot (8 min) + verify (5 min) + uncordon (1 min) = ~21 min
- With a 5-minute buffer between nodes: ~26 min per node
- 4 nodes per day = ~2 hours of active maintenance per day
- 40 nodes / 4 per day = 10 days (within 14-day deadline)
Why 4 per day, not more?
- Maintains spare capacity (90% of cluster remains available)
- Allows monitoring between updates to catch issues early
- If a node fails to return, you have time to investigate before the next batch
Procedure:
- Stage firmware on all 40 BMCs via Redfish (can be done in parallel, no reboot needed)
- Each day: cordon, drain, reboot (triggers staged update), wait for Ready, verify BIOS version, uncordon
- Track progress in a spreadsheet: node, old version, new version, timestamp, status
Risk mitigation:
- Keep 2 buffer days (days 11-12) for failed updates
- Have Dell/HPE support case pre-opened in case of boot failures
- Test on 1 node from each hardware generation on day 1
Question 2
Section titled “Question 2”A SMART check on /dev/sdb shows Reallocated_Sector_Count = 47 and Current_Pending_Sector = 3. The disk is part of a Ceph OSD. What do you do?
Answer
This disk is showing signs of imminent failure and should be replaced proactively.
Interpretation:
Reallocated_Sector_Count = 47: The drive has already remapped 47 bad sectors to spare sectors. This is not immediately fatal but indicates the media is degrading.Current_Pending_Sector = 3: Three sectors have pending read errors that the drive cannot resolve. These will either be reallocated (if a write succeeds) or becomeOffline_Uncorrectable(data loss).
Action plan:
- Set Ceph noout to prevent unnecessary rebalancing:
Terminal window ceph osd set noout - Mark the OSD out to begin data migration away from this disk:
Terminal window ceph osd out osd.X - Wait for data to migrate (watch
ceph -wuntil all PGs are active+clean) - Purge the OSD:
Terminal window ceph osd purge osd.X --yes-i-really-mean-it - Replace the physical disk (schedule datacenter hands if remote)
- Create new OSD on replacement disk:
Terminal window ceph-volume lvm create --data /dev/sdb - Unset noout:
Terminal window ceph osd unset noout
Do not wait for the disk to fail completely. With Current_Pending_Sector > 0, the risk of data unavailability increases with each day. Ceph’s replication protects you, but only if you act before multiple disks fail.
Question 3
Section titled “Question 3”You are managing firmware updates for a mixed fleet: 20 Dell R640s, 15 Dell R740s, and 10 HPE DL380 Gen10s. Each vendor has different Redfish API endpoints. How do you handle this?
Answer
Build an abstraction layer that maps vendor-specific APIs to a common interface.
Approach:
-
Create a hardware inventory database (CMDB) with:
- Hostname, BMC IP, vendor, model, current firmware versions
- Redfish API base URL and credentials (from a secrets manager)
-
Write vendor-specific adapters:
Terminal window # Dell iDRAC Redfish endpoint for BIOS updatePOST /redfish/v1/UpdateService/Actions/UpdateService.SimpleUpdate# HPE iLO Redfish endpoint for BIOS updatePOST /redfish/v1/UpdateService/Actions/UpdateService.SimpleUpdate# (Same standard endpoint, but authentication and image staging differ) -
Use a configuration file per vendor:
vendors:dell:bios_uri: /redfish/v1/Systems/System.Embedded.1/Biosupdate_uri: /redfish/v1/UpdateService/Actions/UpdateService.SimpleUpdateauth_type: basichpe:bios_uri: /redfish/v1/Systems/1/Biosupdate_uri: /redfish/v1/UpdateService/Actions/UpdateService.SimpleUpdateauth_type: session -
Alternatively, use existing tools:
- Dell:
racadmCLI or Dell OpenManage - HPE:
ilorestCLI or HPE OneView - Multi-vendor:
sushy-tools(OpenStack’s Redfish library) or Ansible modules (community.general.redfish_*)
- Dell:
-
Test each vendor’s update path independently before running a mixed-fleet rolling update.
The Redfish standard (DMTF) aims for vendor interoperability, but implementations diverge on authentication, image staging, and task tracking. Always test with actual hardware.
Question 4
Section titled “Question 4”After a BIOS update, a server fails to boot and sits at a blank screen. The BMC (iDRAC/iLO) is still reachable. What are your recovery options?
Answer
Recovery options, in order of preference:
-
Roll back BIOS via BMC:
Terminal window # Dell iDRAC: restore previous BIOS versioncurl -sk -u admin:password \-X POST \"https://bmc-addr/redfish/v1/Managers/iDRAC.Embedded.1/Actions/Oem/DellManager.BIOSRollback"# HPE iLO: BIOS has a backup ROM, boot from it# Use iLO virtual console to select "Boot from Backup ROM" -
Use BMC virtual console + virtual media:
- Mount a recovery ISO via iDRAC/iLO virtual media
- Boot from the ISO
- Flash the previous BIOS version from within the recovery environment
-
Reset BIOS to factory defaults:
Terminal window # Via Redfishcurl -sk -u admin:password \-X POST \"https://bmc-addr/redfish/v1/Systems/1/Bios/Actions/Bios.ResetBios"This loses custom BIOS settings (boot order, SR-IOV, virtualization) but should make the server bootable.
-
Physical intervention:
- Clear CMOS via jumper on the motherboard
- Requires datacenter hands or physical access
Impact on Kubernetes:
- The node was already drained before the BIOS update (if you followed the procedure)
- Workloads are running on other nodes
- No data loss (no pods were running on this node during the update)
- You have time to recover without pressure
Prevention: Always test BIOS updates on one node from each hardware generation before rolling to the fleet.
Hands-On Exercise: Build a Hardware Health Dashboard
Section titled “Hands-On Exercise: Build a Hardware Health Dashboard”Task: Deploy SMART monitoring and create Prometheus alerts for disk health.
# Deploy smartmon_exporter (example using node_exporter textfile collector)cat <<'SMARTEOF' > /tmp/smartmon-collector.sh#!/bin/bash# Collects SMART metrics for node_exporter textfile collectorOUTPUT="/var/lib/node_exporter/textfile/smartmon.prom"
for DISK in /dev/sd{a..z}; do [ -b "$DISK" ] || continue DEVICE=$(basename "$DISK")
HEALTH=$(smartctl -H "$DISK" 2>/dev/null | grep -c "PASSED" || echo 0) echo "smartmon_device_smart_healthy{disk=\"${DEVICE}\"} ${HEALTH}"
REALLOC=$(smartctl -A "$DISK" 2>/dev/null | awk '/Reallocated_Sector/ {print $10}' || echo 0) echo "smartmon_reallocated_sector_count{disk=\"${DEVICE}\"} ${REALLOC:-0}"
PENDING=$(smartctl -A "$DISK" 2>/dev/null | awk '/Current_Pending/ {print $10}' || echo 0) echo "smartmon_current_pending_sector{disk=\"${DEVICE}\"} ${PENDING:-0}"
# SSD wear leveling (Wear_Leveling_Count or Media_Wearout_Indicator) WEAR=$(smartctl -A "$DISK" 2>/dev/null | awk '/Wear_Leveling_Count|Media_Wearout_Indicator/ {print $4}' || echo -1) [ "$WEAR" != "-1" ] && echo "smartmon_wear_leveling_count_value{disk=\"${DEVICE}\"} ${WEAR}"done > "$OUTPUT"SMARTEOFchmod +x /tmp/smartmon-collector.sh- Review the collector script and understand which SMART attributes it exports
- Create Prometheus alerting rules for disk prefailure conditions
- Define escalation thresholds:
- Warning: Reallocated sectors > 0
- Critical: Current pending sectors > 0 or SMART health check failed
- Plan the disk replacement workflow as a runbook document
Success Criteria
Section titled “Success Criteria”- Understand which SMART attributes indicate imminent failure
- Can explain the difference between reallocated, pending, and uncorrectable sectors
- Know the Ceph OSD replacement procedure (noout, out, purge, replace, create)
- Can use Redfish API to query firmware versions
- Understand the rolling firmware update workflow (drain, update, reboot, verify, uncordon)
Next Module
Section titled “Next Module”Continue to Module 7.3: Node Failure & Auto-Remediation to learn how to detect and automatically recover from node failures using Machine Health Checks and node problem detector.