Day-2 Operations

Day-2 operations on bare metal Kubernetes clusters are fundamentally different from managed cloud services. There is no “upgrade cluster” button, no auto-replacing failed nodes, no built-in observability stack. You own the hardware, the OS, the control plane, and every component in between.

These modules cover the operational practices that keep on-premises clusters healthy, current, and scalable over multi-year hardware lifecycles.

Modules

Module	Description	Time
7.1 Kubernetes Upgrades on Bare Metal	kubeadm upgrade path, drain strategies, rollback, version skew	60 min
7.2 Hardware Lifecycle & Firmware	BIOS/firmware updates, disk replacement, SMART monitoring, Redfish API	60 min
7.3 Node Failure & Auto-Remediation	Machine Health Checks, node problem detector, automated reboot/reprovision	60 min
7.4 Observability Without Cloud Services	Self-hosted Prometheus + Thanos, Grafana, Loki, IPMI exporter	60 min
7.5 Capacity Expansion & Hardware Refresh	Adding racks, mixed CPU generations, topology spread, refresh cycles	60 min