Module 7.5: Capacity Expansion & Hardware Refresh
Цей контент ще не доступний вашою мовою.
Complexity:
[COMPLEX]| Time: 60 minutesPrerequisites: Module 7.4: Observability Without Cloud Services, Module 1.2: Server Sizing
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Plan capacity expansion that accounts for CPU generation differences, topology constraints, and scheduler behavior with heterogeneous hardware
- Implement node labeling, taints, and topology spread constraints to manage mixed-generation server pools effectively
- Design a hardware decommissioning process that respects capacity limits, PodDisruptionBudgets, and storage rebalancing
- Optimize cluster scheduling policies to distribute workloads appropriately across nodes with different performance characteristics
Why This Module Matters
Section titled “Why This Module Matters”When a team expands a large bare-metal Kubernetes cluster with a newer hardware generation, scheduling behavior can change in ways that are not obvious during initial bring-up.
Workloads can perform differently across CPU generations, while the default Kubernetes scheduler still reasons about requested CPU as a quantity rather than benchmarked per-core performance. If teams respond by manually pinning workloads to newer nodes, overall cluster utilization can become uneven.
Decommissioning older nodes without spread constraints and spare capacity can overload the remaining cluster, trigger evictions or OOM-related instability, and even degrade the monitoring systems you need during the change.
The lesson: adding hardware to a Kubernetes cluster is not just racking and stacking. You need to account for CPU generation differences, topology constraints, scheduling policies, and a decommission plan that respects capacity limits.
What You’ll Learn
Section titled “What You’ll Learn”- Adding new racks and nodes to existing clusters
- Managing mixed CPU generations (Intel and AMD)
- Topology spread constraints for heterogeneous hardware
- Decommissioning old nodes safely
- 3-year vs 5-year hardware refresh cycles
- Capacity planning with hardware generations
Adding New Racks to Existing Clusters
Section titled “Adding New Racks to Existing Clusters”Physical and Network Prerequisites
Section titled “Physical and Network Prerequisites”flowchart TD subgraph Before ["Before Racking Servers"] direction TB B1["Network: leaf switch installed, cabled to spines"] B2["Power: PDUs installed, circuits provisioned"] B3["VLANs: management, production, storage trunked on leaf"] B4["BGP: leaf peering with spines (new AS number for rack)"] B5["PXE: DHCP relay configured for new subnet"] B6["DNS: reverse DNS entries for new BMC/management IPs"] B7["IPAM: IP ranges allocated for nodes, pods, services"] B1 --> B2 --> B3 --> B4 --> B5 --> B6 --> B7 end
subgraph After ["After Racking Servers"] direction TB A1["BMC configured (IP, credentials, NTP)"] A2["PXE boot OS image"] A3["Configure networking (bonds, VLANs, routes)"] A4["Install kubelet, kubeadm, container runtime (cgroup v2)"] A5["Join cluster with kubeadm join"] A6["Label nodes (rack, generation, hardware model)"] A7["Verify CNI connectivity to existing nodes"] A8["Verify CSI storage access"] A1 --> A2 --> A3 --> A4 --> A5 --> A6 --> A7 --> A8 end
Before --> AfterStop and think: If you provision a new rack of older OS images (which default to cgroup v1) and try to join them to a Kubernetes 1.35+ cluster, what will happen? By default, the kubelet will refuse to start because cgroup v1 is officially deprecated. Both the kubelet and your container runtime must strictly use cgroup v2 with the systemd cgroup driver to successfully register the node.
Pause and predict: You are adding 40 new AMD EPYC servers to a cluster running Intel Xeon nodes. The Kubernetes scheduler sees “32 cores available” on both, but the AMD cores are 44% faster per-core. How would you prevent latency-sensitive pods from being scheduled on slower Intel nodes without hardcoding node names?
Node Provisioning Script for New Rack
Section titled “Node Provisioning Script for New Rack”Unlike cloud environments, the vanilla Kubernetes Cluster Autoscaler does not support on-premises bare-metal node provisioning because it relies on cloud provider node pool APIs. This means bare-metal capacity expansion requires manual or scripted provisioning processes. Before running any automation, ensure that each bare-metal server has a unique hostname, MAC address, and product_uuid, as kubeadm will fail to register nodes if these are duplicated.
This script automates the most error-prone part of rack expansion: waiting for each server to PXE boot, joining it to the cluster, and applying the correct topology labels. Labels for rack, hardware generation, and CPU model enable scheduling policies that account for heterogeneous hardware.
#!/bin/bash# provision-new-rack.sh — add a rack of servers to existing clusterset -euo pipefail
RACK_ID="$1" # e.g., rack-eNODES_FILE="$2" # hostname,bmc-ip,mgmt-ipJOIN_TOKEN="$3" # from kubeadm token create (default TTL is 24h)CA_CERT_HASH="$4" # from kubeadmAPI_SERVER="$5" # e.g., 10.0.10.10:6443
while IFS=, read -r HOSTNAME BMC_IP MGMT_IP; do echo "=== Provisioning ${HOSTNAME} in ${RACK_ID} ==="
# Wait for node to be PXE booted and accessible echo "Waiting for ${HOSTNAME} to be reachable via SSH..." until ssh -o ConnectTimeout=5 root@"$MGMT_IP" true 2>/dev/null; do sleep 10 done
# Configure node labels and join cluster ssh root@"$MGMT_IP" bash <<REMOTE_EOF # Join the cluster kubeadm join ${API_SERVER} \ --token ${JOIN_TOKEN} \ --discovery-token-ca-cert-hash sha256:${CA_CERT_HASH}REMOTE_EOF
# Wait for the node to register with the API server # (kubeadm join returns before the Node object is fully created) echo "Waiting for ${HOSTNAME} to register..." until kubectl get node "$HOSTNAME" &>/dev/null; do sleep 5 done kubectl wait --for=condition=Ready "node/$HOSTNAME" --timeout=120s
# Label the node from a control plane echo "Labeling ${HOSTNAME}..." kubectl label node "$HOSTNAME" \ topology.kubernetes.io/zone="${RACK_ID}" \ kubedojo.io/rack="${RACK_ID}" \ kubedojo.io/hardware-gen="gen4" \ kubedojo.io/cpu-vendor="amd" \ kubedojo.io/cpu-model="epyc-9354" \ --overwrite
echo "=== ${HOSTNAME} joined and labeled ==="done < "$NODES_FILE"
echo "All nodes in ${RACK_ID} provisioned."echo "Run: kubectl get nodes -l kubedojo.io/rack=${RACK_ID}"Mixed CPU Generations
Section titled “Mixed CPU Generations”The Problem with Heterogeneous Performance
Section titled “The Problem with Heterogeneous Performance”When mixing CPU generations, you must account for varying hardware capabilities. In Kubernetes 1.35, advanced features like the Topology Manager’s max-allowable-numa-nodes reached General Availability (GA), giving you granular control over workload placement on modern multi-socket AMD and Intel systems. However, even with advanced topology management, the primary challenge remains: raw performance differences across generations.
| Model | Year | Cores | Single-Thread | Passmark |
|---|---|---|---|---|
| Xeon Silver 4214 | 2019 | 12 | 1,800 | 15,200 |
| Xeon Gold 6330 | 2021 | 28 | 2,100 | 35,000 |
| EPYC 9354 | 2023 | 32 | 2,600 | 53,000 |
The EPYC 9354 substantially outperforms the 4214 in both single-threaded and multithreaded benchmark data, but the exact percentage depends on the benchmark source and snapshot date.
Kubernetes natively sees: “32 cores available” on both. Reality dictates: 32 EPYC cores >> 32 Xeon Silver cores.
Labeling Hardware Generations
Section titled “Labeling Hardware Generations”# Label all nodes with their hardware generation# This enables scheduling policies based on performance tier
# Gen 1: 2019 hardware (Cascade Lake)kubectl label nodes -l kubedojo.io/cpu-model=xeon-4214 \ kubedojo.io/performance-tier=standard
# Gen 2: 2021 hardware (Ice Lake)kubectl label nodes -l kubedojo.io/cpu-model=xeon-6330 \ kubedojo.io/performance-tier=high
# Gen 3: 2023 hardware (Genoa)kubectl label nodes -l kubedojo.io/cpu-model=epyc-9354 \ kubedojo.io/performance-tier=premiumScheduling Policies for Mixed Hardware
Section titled “Scheduling Policies for Mixed Hardware”# Option 1: Prefer newer hardware (soft preference)apiVersion: apps/v1kind: Deploymentmetadata: name: latency-sensitive-appspec: template: spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: kubedojo.io/performance-tier operator: In values: [premium] - weight: 50 preference: matchExpressions: - key: kubedojo.io/performance-tier operator: In values: [high] containers: - name: app image: my-app:latest resources: requests: cpu: "4" memory: 8Gi---# Option 2: Require specific hardware (hard requirement)apiVersion: apps/v1kind: Deploymentmetadata: name: ml-training-jobspec: template: spec: nodeSelector: kubedojo.io/cpu-vendor: amd # Needs AVX-512 kubedojo.io/performance-tier: premium containers: - name: training image: tensorflow:latest resources: requests: cpu: "16" memory: 64GiWeighted Resource Capacity
Section titled “Weighted Resource Capacity”Kubernetes sees all CPU cores as equal, but they are not. Use benchmark data to calculate normalized capacity, because removing older nodes can reduce effective compute by much less than raw node counts suggest. Always use weighted capacity calculations when planning decommissions.
Topology Spread Constraints for Heterogeneous Hardware
Section titled “Topology Spread Constraints for Heterogeneous Hardware”When you have multiple hardware generations across multiple racks, topology spread constraints ensure workloads are distributed to survive rack failures and hardware-specific issues.
Stop and think: You have a critical service with 6 replicas spread across 3 racks. You add a 4th rack. New pods will not schedule on the 4th rack because
maxSkew: 1withDoNotSchedulecannot be satisfied. How would you rebalance pods across all 4 racks?
Multi-Dimensional Topology Spread
Section titled “Multi-Dimensional Topology Spread”apiVersion: apps/v1kind: Deploymentmetadata: name: critical-servicespec: replicas: 6 template: metadata: labels: app: critical-service spec: topologySpreadConstraints: # Spread across racks (survive rack failure) - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: critical-service # Spread across hardware generations (survive generation-specific bug) - maxSkew: 2 topologyKey: kubedojo.io/hardware-gen whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: app: critical-service containers: - name: app image: critical-service:latest resources: requests: cpu: "2" memory: 4GiVisualizing Topology Distribution
Section titled “Visualizing Topology Distribution”flowchart TD subgraph RackA ["Rack A (Gen 1 + Gen 3)"] direction TB A1["[worker-01 gen1] pod-1"] A2["[worker-02 gen1] (empty)"] A3["[worker-21 gen3] pod-2"] end subgraph RackB ["Rack B (Gen 1 + Gen 2)"] direction TB B1["[worker-05 gen1] pod-3"] B2["[worker-11 gen2] pod-4"] B3["[worker-12 gen2] (empty)"] end subgraph RackC ["Rack C (Gen 2 + Gen 3)"] direction TB C1["[worker-15 gen2] pod-5"] C2["[worker-25 gen3] pod-6"] C3["[worker-26 gen3] (empty)"] end
RackA ~~~ RackB ~~~ RackCResult: 2 pods per rack (maxSkew=1 satisfied) Gen distribution: gen1=2, gen2=2, gen3=2 (maxSkew=2 OK) Rack failure: lose 2/6 pods = service continues Gen-specific bug: affects 2/6 pods = service continues
Decommissioning Old Nodes
Section titled “Decommissioning Old Nodes”Removing nodes requires careful capacity planning to avoid overloading the remaining cluster.
Pause and predict: Before decommissioning 20 nodes, you need to verify the remaining cluster can handle the load. But Kubernetes reports CPU in cores — and not all cores are equal. A 2023 AMD core delivers 44% more throughput than a 2019 Intel core. How do you calculate the true capacity impact of removing 20 Intel nodes?
Decommission Checklist
Section titled “Decommission Checklist”This script performs safety checks before removing a node: verifying remaining capacity will stay below 80%, checking for local PersistentVolumes that would be lost, and then draining and deleting the node.
#!/bin/bash# decommission-node.sh — safely remove a node from the clusterset -euo pipefail
NODE="$1"
echo "=== Pre-decommission checks for ${NODE} ==="
# Check 1: Will remaining capacity handle the load?# Normalize CPU values to millicores — K8s returns either "4" (cores) or "3900m" (millicores)TOTAL_CPU=$(kubectl get nodes -o json | jq ' [.items[].status.allocatable.cpu | if endswith("m") then rtrimstr("m") | tonumber else tonumber * 1000 end ] | add')NODE_CPU=$(kubectl get node "$NODE" -o json | jq ' .status.allocatable.cpu | if endswith("m") then rtrimstr("m") | tonumber else tonumber * 1000 end')REMAINING_CPU=$((TOTAL_CPU - NODE_CPU))REQUESTED_CPU=$(kubectl get pods -A -o json | jq ' [.items[].spec.containers[].resources.requests.cpu // "0" | if endswith("m") then rtrimstr("m") | tonumber else tonumber * 1000 end ] | add')
echo "Total allocatable CPU: ${TOTAL_CPU}m"echo "This node CPU: ${NODE_CPU}m"echo "Remaining CPU after removal: ${REMAINING_CPU}m"echo "Total requested CPU: ${REQUESTED_CPU}m"echo "Utilization after removal: $((REQUESTED_CPU * 100 / REMAINING_CPU))%"
if [ $((REQUESTED_CPU * 100 / REMAINING_CPU)) -gt 80 ]; then echo "WARNING: Cluster will be at >80% CPU utilization after removing this node." echo "Consider adding capacity before decommissioning." read -p "Continue anyway? (y/N) " -n 1 -r echo [[ $REPLY =~ ^[Yy]$ ]] || exit 1fi
# Check 2: Any local PVs on this node?LOCAL_PVS=$(kubectl get pv -o json | jq -r --arg node "$NODE" ' .items[] | select( .spec.nodeAffinity.required.nodeSelectorTerms[].matchExpressions[].values[] == $node ) | .metadata.name')
if [ -n "$LOCAL_PVS" ]; then echo "WARNING: Node has local PVs that will be lost:" echo "$LOCAL_PVS" echo "Migrate data before proceeding." exit 1fi
# Check 3: Drain the nodeecho "Draining ${NODE}..."kubectl drain "$NODE" \ --ignore-daemonsets \ --delete-emptydir-data \ --timeout=600s
# Check 4: Remove from clusterecho "Removing ${NODE} from cluster..."kubectl delete node "$NODE"
# Check 5: On the node itself (via SSH or BMC):# kubeadm reset# Clean up iptables, IPVS rules, CNI config
echo "=== ${NODE} decommissioned ==="echo "Remember to:"echo " 1. Power off the server"echo " 2. Update CMDB/inventory"echo " 3. Reclaim rack space"echo " 4. Update monitoring targets"echo " 5. Update PXE/DHCP reservations"When physically powering down decommissioned nodes, do not simply turn them off. While Graceful Node Shutdown is enabled by default in Kubernetes, it is not actually activated unless you have explicitly configured shutdownGracePeriod to a non-zero value in your kubelet configuration. Always use the kubectl drain process to safely evict workloads.
When decommissioning in batches, remove 5 nodes at a time over 1-2 day phases. Monitor utilization overnight after each batch. Never exceed 80% cluster utilization during the process. After all nodes are removed, verify no orphaned PVs remain and update monitoring targets, alerting thresholds, and spare node counts.
3-Year vs 5-Year Hardware Refresh Cycles
Section titled “3-Year vs 5-Year Hardware Refresh Cycles”Cost Comparison
Section titled “Cost Comparison”For a 100-node cluster, refresh-cycle cost depends heavily on hardware pricing, support terms, energy costs, and workload requirements:
| Factor | 3-Year Cycle | 5-Year Cycle |
|---|---|---|
| Amortized CapEx/year | Higher annualized spend with faster refresh | Lower annualized spend with slower refresh |
| Support contracts | Typically lower over a shorter lifecycle | Typically higher as hardware ages |
| Power (total) | Typically lower with newer, more efficient nodes | Typically higher when older nodes stay in service longer |
| Failure rate (end of life) | Typically lower | Typically higher |
| Performance vs current gen | Closer to current-generation performance | Further behind current-generation performance |
| Total cost | Depends on your hardware, power, support, and failure assumptions | Depends on your hardware, power, support, and failure assumptions |
A shorter refresh cycle usually increases annualized capital spend, while a longer cycle can increase operational risk, support burden, and power costs.
Choose 3-year cycles for performance-sensitive workloads, rapid growth, or when power efficiency matters. Choose 5-year cycles for budget-constrained environments with stable, predictable loads that are not CPU-bound.
Staggered Refresh Strategy
Section titled “Staggered Refresh Strategy”timeline title Staggered Refresh (33 nodes/year rolling) Year 1 : Buy 33 new nodes (Gen N+3) : Decommission 33 oldest Year 2 : Buy 33 new nodes (Gen N+4) : Decommission 33 oldest Year 3 : Buy 34 new nodes (Gen N+5) : Decommission 34 oldest Year 4 : Buy 33 new nodes (Gen N+6) : Decommission 33 oldestBenefits:
- Smooth CapEx (1M every 3 years)
- Always have recent hardware in the fleet
- Never need to decommission more than 33% at once
- Team practices add/remove procedure regularly
- Each year you learn what works for the new hardware gen
Challenges:
- 3 hardware generations in the cluster simultaneously
- Must handle CPU/memory heterogeneity in scheduling
- Firmware update process covers multiple vendor models
Capacity Planning with Hardware Generations
Section titled “Capacity Planning with Hardware Generations”When forecasting long-term growth across multiple hardware refresh cycles, remember that Kubernetes v1.35 has official large-cluster tested limits: a maximum of 5,000 nodes, 110 pods per node, 150,000 total pods, and 300,000 total containers. Your capacity plans must ensure all four constraints are met simultaneously.
Monitoring Capacity Trends
Section titled “Monitoring Capacity Trends”Create Prometheus recording rules that track CPU capacity and utilization broken down by hardware generation. The most valuable metric is cluster:capacity_days_remaining, which uses deriv() over a 30-day window to project when current capacity will be exhausted at the current growth rate. Alert when this drops below 60 days to trigger procurement.
Did You Know?
Section titled “Did You Know?”-
Kubernetes 1.35 is the last release to support the containerd 1.x series. If your hardware refresh involves reinstalling the operating system and container runtime on new servers, plan to use containerd 2.x or another CRI-conformant runtime before upgrading beyond 1.35, because newer kubelets continue tightening runtime compatibility.
-
Switching CPU vendors usually means replacing the server platform, not just the processor, because server sockets and platform compatibility differ by vendor and generation. This is why vendor choice in the initial purchase has long-term implications.
-
Large HPC operators often plan refreshes years in advance and may run old and new systems in parallel during transitions. Similar overlap planning can help large Kubernetes operators reduce migration risk during hardware refreshes.
-
Kubernetes 1.35 graduated In-place Pod Resize to General Availability (GA). This lets you change CPU and memory requests and limits for running containers without recreating the Pod, which can reduce disruption when migrating long-running workloads across mixed hardware.
-
Kubernetes 1.24 added the
MinDomainsInPodTopologySpreadfeature (stable in 1.30) that lets you specify the minimum number of topology domains a workload should span. This is particularly useful during hardware refresh: you can require pods to be spread across at least 2 hardware generations, ensuring a generation-specific bug does not take down all replicas. -
Recent industry reporting suggests that many operators are extending server lifecycles beyond the traditional three-year window. Even so, newer hardware can still offer meaningful efficiency gains, so refresh timing should be based on measured total cost of ownership rather than purchase price alone.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| No node labels for hardware generation | Cannot schedule based on performance tier | Label all nodes with generation, CPU model, and tier |
| Assuming all CPU cores are equal | Uneven performance across hardware generations | Use weighted capacity calculations for planning |
| Decommissioning without capacity check | Cluster overloaded after removing nodes | Calculate post-removal utilization before draining |
| No topology spread across generations | Generation-specific bug (BIOS, kernel) affects all replicas | Use topologySpreadConstraints with hardware-gen key |
| Big-bang hardware refresh | All 100 nodes replaced at once = massive risk | Stagger refreshes: 33 nodes/year rolling |
| Ignoring power efficiency in refresh math | Old servers cost more to power | Include power costs in TCO comparison |
| Not updating monitoring after adding rack | New nodes invisible to alerting | Add new BMC addresses to IPMI exporter, update Prometheus targets |
| Mixing Intel and AMD without testing | Application-level differences (AVX, memory model) | Test workloads on new architecture in staging first |
Question 1
Section titled “Question 1”You have a 100-node cluster: 60 nodes with Intel Xeon Silver 4214 (12 cores, 2019) and 40 nodes with AMD EPYC 9354 (32 cores, 2023). You need to decommission 20 of the oldest Intel nodes. What is the actual capacity impact, and how do you validate that the cluster can handle it?
Answer
You must normalize the CPU capacity using performance benchmarks because Kubernetes scheduling is naive and treats all CPU millicores as identical. By calculating the weighted capacity, you reveal that removing 20 older nodes only impacts overall performance by 9.3%, rather than the 12% that raw core counts suggest. This prevents you from over-provisioning replacement hardware or accidentally starving workloads during the decommission phase. Validating the cluster can handle it involves checking the actual allocated resources against this newly calculated baseline, ensuring you stay below the 80% safety threshold.
Capacity impact analysis:
Before decommission:
- Intel nodes: 60 x 12 cores = 720 cores
- AMD nodes: 40 x 32 cores = 1,280 cores
- Total: 2,000 cores
After decommission (remove 20 Intel):
- Intel nodes: 40 x 12 cores = 480 cores
- AMD nodes: 40 x 32 cores = 1,280 cores
- Total: 1,760 cores
- Reduction: 240 cores = 12% of total core count
However, performance-adjusted capacity:
- Intel 4214 passmark per core: ~1,800
- AMD 9354 passmark per core: ~2,600
- Before: (720 x 1,800) + (1,280 x 2,600) = 1,296,000 + 3,328,000 = 4,624,000 units
- After: (480 x 1,800) + (1,280 x 2,600) = 864,000 + 3,328,000 = 4,192,000 units
- Actual performance reduction: 9.3% (less than the 12% core count suggests)
Validation steps:
- Check current cluster-wide CPU utilization:
Terminal window kubectl top nodes --sort-by=cpu - Calculate requested vs allocatable:
Terminal window kubectl describe nodes | grep -A 5 "Allocated resources" - Verify no workloads are pinned to the Intel nodes being removed
- Check PDBs and topology constraints will still be satisfiable with 80 nodes
- Run the decommission in batches (5 nodes at a time) with monitoring
Question 2
Section titled “Question 2”Your cluster runs on 3 racks with 20 nodes each. You are adding a 4th rack with 20 new nodes (newer hardware generation). Your critical service has a topology spread constraint of maxSkew: 1 on topology.kubernetes.io/zone. After adding the new rack, new pods are not scheduling on the 4th rack. Why?
Answer
This scheduling failure happens because the topology spread constraint evaluates where scheduling the new pod would produce the lowest skew across all domains. With the DoNotSchedule strict constraint, no placement satisfies the maximum skew of 1 because the new rack starts completely empty at zero pods, meaning the skew would immediately jump to 2 or 3. To fix this in modern Kubernetes (1.27+), you should use matchLabelKeys targeting the pod template hash. This scopes the skew calculation only to the new ReplicaSet being rolled out, allowing a standard rollout restart to rebalance the pods seamlessly across all four racks without violating the constraint during the transition.
The math:
- Existing: 3 racks, each with some pods of the critical service
- Say the service has 9 replicas: 3 per rack (skew = 0, within maxSkew=1)
- New rack-d has 0 replicas
When a new pod needs to be scheduled:
- rack-a: 3, rack-b: 3, rack-c: 3, rack-d: 0
- Minimum count: 0 (rack-d), maximum count: 3 (any existing rack)
- Skew = 3 - 0 = 3, which exceeds maxSkew=1
- Result: Pod CANNOT schedule on rack-d
Fix options:
- Use
matchLabelKeys(recommended, K8s 1.27+): AddmatchLabelKeys: ["pod-template-hash"]to the topology spread constraint:Then runtopologySpreadConstraints:- maxSkew: 1topologyKey: topology.kubernetes.io/zonewhenUnsatisfiable: DoNotSchedulematchLabelKeys:- pod-template-hashlabelSelector:matchLabels:app: critical-servicekubectl rollout restart deployment critical-serviceto rebalance across all 4 racks. - Temporarily relax the constraint:
maxSkew: 2 # allow wider skew during expansion
- Use
whenUnsatisfiable: ScheduleAnyway(soft constraint) - Scale up the deployment so pods can be placed on rack-d, then scale back down
Note: A plain rollout restart without matchLabelKeys will NOT fix this. The default labelSelector matches pods from both old and new ReplicaSets, so the skew calculation still sees the old pod distribution and new pods cannot schedule on the new rack.
Question 3
Section titled “Question 3”Your company uses a 5-year refresh cycle. It is now year 4 and disk failure rates have increased from 1% to 6% annually. The CFO asks whether to extend to 7 years to save money. How do you argue against this?
Answer
Extending the hardware lifecycle to seven years introduces compounding hidden costs that negate the deferred capital expenditure. As servers age past year five, component failure rates skyrocket, particularly for mechanical or heavily written storage drives, increasing labor and emergency replacement costs. Additionally, older hardware is significantly less power-efficient than newer generations, leading to inflated electricity bills that can completely offset the price of new servers. Finally, keeping slower legacy processors limits application throughput, forcing you to run more nodes to handle the same workload volume and increasing the operational burden on the infrastructure team.
Argument against extending to 7 years:
1. Disk failure cost escalation:
- Year 4: 6% failure rate across 200 disks = 12 failures/year
- Year 5 (projected): 10% = 20 failures
- Year 6 (projected): 15% = 30 failures
- Year 7 (projected): 22% = 44 failures
- Each disk replacement: 200 (labor) + risk of data loss
- Years 6-7 disk costs: 74 failures x 51,800
2. Increasing support contract costs:
- Vendors charge 30-60% more for extended support beyond 5 years
- Some vendors refuse to support hardware past 7 years
- Parts availability decreases (end-of-life components)
3. Power efficiency gap:
- Year 4 hardware uses ~30% more power per compute unit than current generation
- Year 7: ~50% more power per compute unit
- At 438,000/year
- New servers at 350W equivalent performance: $306,600/year
- Power savings: $131,400/year (pays for 13 new servers)
4. Performance opportunity cost:
- Applications running on 7-year-old hardware are 2-3x slower per core
- Need 2-3x more servers to achieve the same throughput
- Hiring developers is more expensive than buying faster hardware
5. Risk:
- Cascading failures become more likely (correlated aging)
- If 5 nodes fail in the same week (common in aging batches), the cluster may not have spare capacity
- Security patches may stop being available for older firmware
Summary for the CFO: “Extending to 7 years saves 150,000+ in disk replacements, $130,000 in excess power costs, and significant operational risk. The net savings is near zero, but the risk is substantially higher.”
Question 4
Section titled “Question 4”You are planning a staggered refresh, replacing 33 nodes per year in a 100-node cluster. You currently have Intel Xeon Gold 6330 nodes. Next year’s refresh will use AMD EPYC 9554. What testing should you do before deploying the AMD nodes into your production cluster?
Answer
Migrating workloads between different CPU vendors introduces subtle architectural differences that can unexpectedly impact application performance or stability. Because AMD and Intel processors handle NUMA topologies, memory models, and advanced vector extensions (like AVX-512) differently, workloads heavily reliant on specific instruction sets or memory bandwidth may behave unpredictably. Comprehensive testing ensures that the container runtime, CNI plugins, and underlying storage drivers interact correctly with the new hardware architecture before entering production. Gradually rolling out the new nodes as a canary deployment allows you to observe these architectural nuances under real-world traffic patterns without risking widespread outages.
Testing plan for cross-vendor CPU migration:
Phase 1: Hardware validation (1 week)
- Boot the AMD servers, verify BIOS settings (SR-IOV, VT-x/AMD-V, NUMA, power management)
- Run hardware stress tests:
stress-ng,memtester,fio - Verify NIC driver compatibility (especially if using Mellanox/Broadcom with RDMA)
- Confirm container runtime works (containerd, kernel cgroup v2)
- Test storage: Ceph OSD performance, CSI driver compatibility
Phase 2: Kubernetes integration (1 week)
- Join 3 AMD nodes to a staging cluster alongside Intel nodes
- Verify kubelet starts correctly
- Test CNI (Calico/Cilium) BGP peering from AMD nodes
- Verify pod scheduling, inter-node networking (pod-to-pod across architectures)
- Run the standard networking test suite (iperf3, curl, DNS resolution)
Phase 3: Application testing (2 weeks)
- Deploy representative workloads on AMD nodes
- Compare performance metrics: latency, throughput, CPU utilization
- Test language-specific behavior:
- Java: JVM may select different JIT optimizations on AMD vs Intel
- Go: Should work identically (portable assembly)
- Python/NumPy: May use different BLAS/LAPACK optimizations
- TensorFlow: Check AVX-512 compatibility
- Run load tests comparing AMD vs Intel node behavior
Phase 4: Production canary (1 week)
- Add 3 AMD nodes to production
- Do NOT label them differently from production nodes
- Let the scheduler place normal workloads
- Monitor for 7 days: error rates, latency distributions, resource usage
- If stable, proceed with full 33-node deployment
Key risk areas:
- Memory model differences (AMD uses a different NUMA topology)
numactland CPU pinning may need reconfiguration- BIOS power management settings affect performance under load
- Some monitoring tools report different CPU metrics on AMD vs Intel
Hands-On Exercise: Plan a Hardware Expansion
Section titled “Hands-On Exercise: Plan a Hardware Expansion”The Scenario
Section titled “The Scenario”You manage a 60-node bare metal Kubernetes cluster spread across 3 racks (20 nodes each). The cluster is currently running at 65% CPU utilization. The business is forecasting a 40% growth in traffic next quarter, so you have just racked and powered on 20 new servers (a newer hardware generation) in a 4th rack. The new servers have a Passmark score 44% higher per core than the old servers.
The Objective
Section titled “The Objective”Design a safe capacity expansion and decommission plan that successfully integrates the new hardware, spreads workloads across all 4 racks, and safely retires 10 of the oldest nodes without exceeding an 80% cluster-wide utilization ceiling.
The Challenge
Section titled “The Challenge”Use your understanding of Kubernetes scheduling, topology spread constraints, and normalized CPU capacity to document the necessary node labels, workload constraints, and the mathematical justification for your decommission strategy. Do not rely on naive core counts.
Tiered Hints
Section titled “Tiered Hints”Hint 1: The Concept
Because the new servers are 44% faster per core, a simple sum of CPU cores will underestimate your new total capacity. You need to calculate "performance-adjusted units" to accurately predict post-expansion and post-decommission utilization.Hint 2: The Component
To ensure high availability across the heterogeneous hardware, your workloads need `topologySpreadConstraints`. Since you are adding a 4th rack that starts empty, remember how `maxSkew: 1` behaves when a new topology domain is introduced.Hint 3: The Command
When decommissioning the 10 oldest nodes, use `kubectl drainVerification
Section titled “Verification”Review your expansion plan against these checks:
- Did you define specific labels for hardware generation and performance tier?
- Did you include
matchLabelKeys: ["pod-template-hash"]in your topology spread constraints to allow pods to schedule on the new rack? - Does your decommission math prove that removing 10 old nodes will leave the cluster below 80% utilization?
Reflection
Section titled “Reflection”Why is it dangerous to treat all CPU cores as identical when combining servers from 2019 and 2024 in the same cluster? How would naive pod scheduling impact latency-sensitive microservices?
Next Module
Section titled “Next Module”This concludes the Day-2 Operations section. Return to the Operations index to review all modules, or continue to the next section in the on-premises track.
Sources
Section titled “Sources”- kubernetes.io: cgroups — The upstream cgroups documentation explicitly documents the v1.35 deprecation and default kubelet behavior.
- kubernetes.io: container runtimes — The container runtime documentation covers the shared-driver requirement and the cgroup v2/systemd recommendation.
- github.com: FAQ.md — Upstream Cluster Autoscaler docs assume node groups and external provisioning/registration tooling, but they do not state this bare-metal limitation in exactly those words.
- kubernetes.io: install kubeadm — The kubeadm install guide explicitly requires these identifiers to be unique and warns that installation may fail otherwise.
- kubernetes.io: topology manager — The Topology Manager task page states that
max-allowable-numa-nodesis GA in Kubernetes 1.35. - kubernetes.io: volumes — The volumes documentation explains local PV node affinity and the reduced availability/data-loss risk tied to the underlying node.
- kubernetes.io: node shutdown — The upstream node shutdown docs explicitly describe the default-enabled gate and the zero-value configuration caveat.
- kubernetes.io: cluster large — These exact scalability limits are documented in the upstream large-cluster guidance.
- kubernetes.io: topology spread constraints — The topology spread documentation states the pre-1.30 gate requirement and the stable availability from 1.30 onward.