Skip to content

Day-2 Operations

Day-2 operations on bare metal Kubernetes clusters are fundamentally different from managed cloud services. There is no “upgrade cluster” button, no auto-replacing failed nodes, no built-in observability stack. You own the hardware, the OS, the control plane, and every component in between.

These modules cover the operational practices that keep on-premises clusters healthy, current, and scalable over multi-year hardware lifecycles.

ModuleDescriptionTime
7.1 Kubernetes Upgrades on Bare Metalkubeadm upgrade path, drain strategies, rollback, version skew60 min
7.2 Hardware Lifecycle & FirmwareBIOS/firmware updates, disk replacement, SMART monitoring, Redfish API60 min
7.3 Node Failure & Auto-RemediationMachine Health Checks, node problem detector, automated reboot/reprovision60 min
7.4 Observability Without Cloud ServicesSelf-hosted Prometheus + Thanos, Grafana, Loki, IPMI exporter60 min
7.5 Capacity Expansion & Hardware RefreshAdding racks, mixed CPU generations, topology spread, refresh cycles60 min