Module 7.5: Capacity Expansion & Hardware Refresh
Complexity:
[COMPLEX]| Time: 60 minutesPrerequisites: Module 7.4: Observability Without Cloud Services, Module 1.2: Server Sizing
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Plan capacity expansion that accounts for CPU generation differences, topology constraints, and scheduler behavior with heterogeneous hardware
- Implement node labeling, taints, and topology spread constraints to manage mixed-generation server pools effectively
- Design a hardware decommissioning process that respects capacity limits, PodDisruptionBudgets, and storage rebalancing
- Optimize cluster scheduling policies to distribute workloads appropriately across nodes with different performance characteristics
Why This Module Matters
Section titled “Why This Module Matters”In January 2024, a gaming company running a 120-node bare metal Kubernetes cluster needed to add 40 new servers for a game launch. Their existing cluster used Intel Xeon Silver 4214 (Cascade Lake, 2019) processors. The procurement team ordered Dell R760 servers with AMD EPYC 9354 (Genoa, 2023) processors because they offered better price/performance. The infrastructure team racked the servers, PXE-booted them, and joined them to the cluster. Everything appeared fine until the scheduler started placing workloads on the new nodes.
The Java applications ran 15% faster on the AMD nodes due to higher single-thread performance. The data science team’s TensorFlow jobs ran 40% faster because the EPYC CPUs had AVX-512 support. But here was the problem: the Kubernetes scheduler does not understand CPU performance differences. It sees “48 CPU cores available” on both the old Intel and new AMD nodes. The scheduler distributed pods evenly, but the AMD nodes could handle more work per core. Some teams manually pinned their pods to AMD nodes using nodeSelectors, creating an uneven resource distribution. Other teams complained that their pods were “slow” when scheduled on the older Intel nodes.
The real disaster came when the team tried to decommission 20 of the oldest Intel servers. They had not configured topology spread constraints, so draining those nodes pushed 300 pods onto nodes that were already at 75% utilization. The cluster hit 95% CPU utilization, OOM kills spiked, and the monitoring stack (also on the cluster) slowed to a crawl — right when they needed it most.
The lesson: adding hardware to a Kubernetes cluster is not just racking and stacking. You need to account for CPU generation differences, topology constraints, scheduling policies, and a decommission plan that respects capacity limits.
What You’ll Learn
Section titled “What You’ll Learn”- Adding new racks and nodes to existing clusters
- Managing mixed CPU generations (Intel and AMD)
- Topology spread constraints for heterogeneous hardware
- Decommissioning old nodes safely
- 3-year vs 5-year hardware refresh cycles
- Capacity planning with hardware generations
Adding New Racks to Existing Clusters
Section titled “Adding New Racks to Existing Clusters”Physical and Network Prerequisites
Section titled “Physical and Network Prerequisites”+---------------------------------------------------------------+| ADDING A NEW RACK TO AN EXISTING CLUSTER || || Before racking servers: || 1. Network: leaf switch installed, cabled to spines || 2. Power: PDUs installed, circuits provisioned || 3. VLANs: management, production, storage trunked on leaf || 4. BGP: leaf peering with spines (new AS number for rack) || 5. PXE: DHCP relay configured for new subnet || 6. DNS: reverse DNS entries for new BMC/management IPs || 7. IPAM: IP ranges allocated for nodes, pods, services || || After racking servers: || 1. BMC configured (IP, credentials, NTP) || 2. PXE boot OS image || 3. Configure networking (bonds, VLANs, routes) || 4. Install kubelet, kubeadm, container runtime || 5. Join cluster with kubeadm join || 6. Label nodes (rack, generation, hardware model) || 7. Verify CNI connectivity to existing nodes || 8. Verify CSI storage access |+---------------------------------------------------------------+Pause and predict: You are adding 40 new AMD EPYC servers to a cluster running Intel Xeon nodes. The Kubernetes scheduler sees “32 cores available” on both, but the AMD cores are 44% faster per-core. How would you prevent latency-sensitive pods from being scheduled on slower Intel nodes without hardcoding node names?
Node Provisioning Script for New Rack
Section titled “Node Provisioning Script for New Rack”This script automates the most error-prone part of rack expansion: waiting for each server to PXE boot, joining it to the cluster, and applying the correct topology labels. Labels for rack, hardware generation, and CPU model enable scheduling policies that account for heterogeneous hardware.
#!/bin/bash# provision-new-rack.sh — add a rack of servers to existing clusterset -euo pipefail
RACK_ID="$1" # e.g., rack-eNODES_FILE="$2" # hostname,bmc-ip,mgmt-ipJOIN_TOKEN="$3" # from kubeadm token createCA_CERT_HASH="$4" # from kubeadmAPI_SERVER="$5" # e.g., 10.0.10.10:6443
while IFS=, read -r HOSTNAME BMC_IP MGMT_IP; do echo "=== Provisioning ${HOSTNAME} in ${RACK_ID} ==="
# Wait for node to be PXE booted and accessible echo "Waiting for ${HOSTNAME} to be reachable via SSH..." until ssh -o ConnectTimeout=5 root@"$MGMT_IP" true 2>/dev/null; do sleep 10 done
# Configure node labels and join cluster ssh root@"$MGMT_IP" bash <<REMOTE_EOF # Join the cluster kubeadm join ${API_SERVER} \ --token ${JOIN_TOKEN} \ --discovery-token-ca-cert-hash sha256:${CA_CERT_HASH}REMOTE_EOF
# Wait for the node to register with the API server # (kubeadm join returns before the Node object is fully created) echo "Waiting for ${HOSTNAME} to register..." until kubectl get node "$HOSTNAME" &>/dev/null; do sleep 5 done kubectl wait --for=condition=Ready "node/$HOSTNAME" --timeout=120s
# Label the node from a control plane echo "Labeling ${HOSTNAME}..." kubectl label node "$HOSTNAME" \ topology.kubernetes.io/zone="${RACK_ID}" \ kubedojo.io/rack="${RACK_ID}" \ kubedojo.io/hardware-gen="gen4" \ kubedojo.io/cpu-vendor="amd" \ kubedojo.io/cpu-model="epyc-9354" \ --overwrite
echo "=== ${HOSTNAME} joined and labeled ==="done < "$NODES_FILE"
echo "All nodes in ${RACK_ID} provisioned."echo "Run: kubectl get nodes -l kubedojo.io/rack=${RACK_ID}"Mixed CPU Generations
Section titled “Mixed CPU Generations”The Problem with Heterogeneous Performance
Section titled “The Problem with Heterogeneous Performance”+---------------------------------------------------------------+| CPU PERFORMANCE ACROSS GENERATIONS || || Model Year Cores Single-Thread Passmark || ─────────────────────────────────────────────────── || Xeon Silver 4214 2019 12 1,800 15,200 || Xeon Gold 6330 2021 28 2,100 35,000 || EPYC 9354 2023 32 2,600 53,000 || || The EPYC 9354 delivers ~44% more single-thread and || ~3.5x more multi-thread performance than the 4214. || || Kubernetes sees: "32 cores available" on both. || Reality: 32 EPYC cores >> 32 Xeon Silver cores. || |+---------------------------------------------------------------+Labeling Hardware Generations
Section titled “Labeling Hardware Generations”# Label all nodes with their hardware generation# This enables scheduling policies based on performance tier
# Gen 1: 2019 hardware (Cascade Lake)kubectl label nodes -l kubedojo.io/cpu-model=xeon-4214 \ kubedojo.io/performance-tier=standard
# Gen 2: 2021 hardware (Ice Lake)kubectl label nodes -l kubedojo.io/cpu-model=xeon-6330 \ kubedojo.io/performance-tier=high
# Gen 3: 2023 hardware (Genoa)kubectl label nodes -l kubedojo.io/cpu-model=epyc-9354 \ kubedojo.io/performance-tier=premiumScheduling Policies for Mixed Hardware
Section titled “Scheduling Policies for Mixed Hardware”# Option 1: Prefer newer hardware (soft preference)apiVersion: apps/v1kind: Deploymentmetadata: name: latency-sensitive-appspec: template: spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: kubedojo.io/performance-tier operator: In values: [premium] - weight: 50 preference: matchExpressions: - key: kubedojo.io/performance-tier operator: In values: [high] containers: - name: app image: my-app:latest resources: requests: cpu: "4" memory: 8Gi---# Option 2: Require specific hardware (hard requirement)apiVersion: apps/v1kind: Deploymentmetadata: name: ml-training-jobspec: template: spec: nodeSelector: kubedojo.io/cpu-vendor: amd # Needs AVX-512 kubedojo.io/performance-tier: premium containers: - name: training image: tensorflow:latest resources: requests: cpu: "16" memory: 64GiWeighted Resource Capacity
Section titled “Weighted Resource Capacity”Kubernetes sees all CPU cores as equal, but they are not. Use benchmark scores (e.g., Passmark single-thread) to calculate “normalized” capacity. Example: decommissioning 20 older Xeon 4214 nodes from a 120-node mixed cluster removes only ~8% of effective compute capacity, not the 17% that a simple node count would suggest. Always use weighted capacity calculations when planning decommissions.
Topology Spread Constraints for Heterogeneous Hardware
Section titled “Topology Spread Constraints for Heterogeneous Hardware”When you have multiple hardware generations across multiple racks, topology spread constraints ensure workloads are distributed to survive rack failures and hardware-specific issues.
Stop and think: You have a critical service with 6 replicas spread across 3 racks. You add a 4th rack. New pods will not schedule on the 4th rack because
maxSkew: 1withDoNotSchedulecannot be satisfied. How would you rebalance pods across all 4 racks?
Multi-Dimensional Topology Spread
Section titled “Multi-Dimensional Topology Spread”apiVersion: apps/v1kind: Deploymentmetadata: name: critical-servicespec: replicas: 6 template: metadata: labels: app: critical-service spec: topologySpreadConstraints: # Spread across racks (survive rack failure) - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: critical-service # Spread across hardware generations (survive generation-specific bug) - maxSkew: 2 topologyKey: kubedojo.io/hardware-gen whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: app: critical-service containers: - name: app image: critical-service:latest resources: requests: cpu: "2" memory: 4GiVisualizing Topology Distribution
Section titled “Visualizing Topology Distribution”+---------------------------------------------------------------+| TOPOLOGY SPREAD: 6 REPLICAS ACROSS 3 RACKS, 2 GENS || || Rack A (Gen 1 + Gen 3): || [worker-01 gen1] pod-1 || [worker-02 gen1] || [worker-21 gen3] pod-2 || || Rack B (Gen 1 + Gen 2): || [worker-05 gen1] pod-3 || [worker-11 gen2] pod-4 || [worker-12 gen2] || || Rack C (Gen 2 + Gen 3): || [worker-15 gen2] pod-5 || [worker-25 gen3] pod-6 || [worker-26 gen3] || || Result: 2 pods per rack (maxSkew=1 satisfied) || Gen distribution: gen1=2, gen2=2, gen3=2 (maxSkew=2 OK) || Rack failure: lose 2/6 pods = service continues || Gen-specific bug: affects 2/6 pods = service continues |+---------------------------------------------------------------+Decommissioning Old Nodes
Section titled “Decommissioning Old Nodes”Removing nodes requires careful capacity planning to avoid overloading the remaining cluster.
Pause and predict: Before decommissioning 20 nodes, you need to verify the remaining cluster can handle the load. But Kubernetes reports CPU in cores — and not all cores are equal. A 2023 AMD core delivers 44% more throughput than a 2019 Intel core. How do you calculate the true capacity impact of removing 20 Intel nodes?
Decommission Checklist
Section titled “Decommission Checklist”This script performs safety checks before removing a node: verifying remaining capacity will stay below 80%, checking for local PersistentVolumes that would be lost, and then draining and deleting the node.
#!/bin/bash# decommission-node.sh — safely remove a node from the clusterset -euo pipefail
NODE="$1"
echo "=== Pre-decommission checks for ${NODE} ==="
# Check 1: Will remaining capacity handle the load?# Normalize CPU values to millicores — K8s returns either "4" (cores) or "3900m" (millicores)TOTAL_CPU=$(kubectl get nodes -o json | jq ' [.items[].status.allocatable.cpu | if endswith("m") then rtrimstr("m") | tonumber else tonumber * 1000 end ] | add')NODE_CPU=$(kubectl get node "$NODE" -o json | jq ' .status.allocatable.cpu | if endswith("m") then rtrimstr("m") | tonumber else tonumber * 1000 end')REMAINING_CPU=$((TOTAL_CPU - NODE_CPU))REQUESTED_CPU=$(kubectl get pods -A -o json | jq ' [.items[].spec.containers[].resources.requests.cpu // "0" | if endswith("m") then rtrimstr("m") | tonumber else tonumber * 1000 end ] | add')
echo "Total allocatable CPU: ${TOTAL_CPU}m"echo "This node CPU: ${NODE_CPU}m"echo "Remaining CPU after removal: ${REMAINING_CPU}m"echo "Total requested CPU: ${REQUESTED_CPU}m"echo "Utilization after removal: $((REQUESTED_CPU * 100 / REMAINING_CPU))%"
if [ $((REQUESTED_CPU * 100 / REMAINING_CPU)) -gt 80 ]; then echo "WARNING: Cluster will be at >80% CPU utilization after removing this node." echo "Consider adding capacity before decommissioning." read -p "Continue anyway? (y/N) " -n 1 -r echo [[ $REPLY =~ ^[Yy]$ ]] || exit 1fi
# Check 2: Any local PVs on this node?LOCAL_PVS=$(kubectl get pv -o json | jq -r --arg node "$NODE" ' .items[] | select( .spec.nodeAffinity.required.nodeSelectorTerms[].matchExpressions[].values[] == $node ) | .metadata.name')
if [ -n "$LOCAL_PVS" ]; then echo "WARNING: Node has local PVs that will be lost:" echo "$LOCAL_PVS" echo "Migrate data before proceeding." exit 1fi
# Check 3: Drain the nodeecho "Draining ${NODE}..."kubectl drain "$NODE" \ --ignore-daemonsets \ --delete-emptydir-data \ --timeout=600s
# Check 4: Remove from clusterecho "Removing ${NODE} from cluster..."kubectl delete node "$NODE"
# Check 5: On the node itself (via SSH or BMC):# kubeadm reset# Clean up iptables, IPVS rules, CNI config
echo "=== ${NODE} decommissioned ==="echo "Remember to:"echo " 1. Power off the server"echo " 2. Update CMDB/inventory"echo " 3. Reclaim rack space"echo " 4. Update monitoring targets"echo " 5. Update PXE/DHCP reservations"When decommissioning in batches, remove 5 nodes at a time over 1-2 day phases. Monitor utilization overnight after each batch. Never exceed 80% cluster utilization during the process. After all nodes are removed, verify no orphaned PVs remain and update monitoring targets, alerting thresholds, and spare node counts.
3-Year vs 5-Year Hardware Refresh Cycles
Section titled “3-Year vs 5-Year Hardware Refresh Cycles”Cost Comparison
Section titled “Cost Comparison”For a 100-node cluster at $10,000/node:
| Factor | 3-Year Cycle | 5-Year Cycle |
|---|---|---|
| Amortized CapEx/year | $333,333 | $200,000 |
| Support contracts | $150,000 total | $310,000 total (higher in years 4-5) |
| Power (total) | $360,000 | $648,000 (efficiency degrades ~5%/year) |
| Failure rate (end of life) | 1-2% | 5-8% |
| Performance vs current gen | 100% at refresh | 60% at end of life |
| Total cost | $1.52M ($506k/yr) | $2.04M ($407k/yr) |
The 3-year cycle is 24% more expensive per year but keeps you on the latest hardware with lower failure rates and power costs.
Choose 3-year cycles for performance-sensitive workloads, rapid growth, or when power efficiency matters. Choose 5-year cycles for budget-constrained environments with stable, predictable loads that are not CPU-bound.
Staggered Refresh Strategy
Section titled “Staggered Refresh Strategy”+---------------------------------------------------------------+| STAGGERED REFRESH (recommended for large clusters) || || Instead of replacing all 100 nodes every 3 years: || Replace ~33 nodes every year || || Year 1: Buy 33 new nodes (Gen N+3), decommission 33 oldest || Year 2: Buy 33 new nodes (Gen N+4), decommission 33 oldest || Year 3: Buy 34 new nodes (Gen N+5), decommission 34 oldest || Year 4: Buy 33 new nodes (Gen N+6), decommission 33 oldest || ... || || Benefits: || - Smooth CapEx ($333k/year instead of $1M every 3 years) || - Always have recent hardware in the fleet || - Never need to decommission more than 33% at once || - Team practices add/remove procedure regularly || - Each year you learn what works for the new hardware gen || || Challenge: || - 3 hardware generations in the cluster simultaneously || - Must handle CPU/memory heterogeneity in scheduling || - Firmware update process covers multiple vendor models |+---------------------------------------------------------------+Capacity Planning with Hardware Generations
Section titled “Capacity Planning with Hardware Generations”Monitoring Capacity Trends
Section titled “Monitoring Capacity Trends”Create Prometheus recording rules that track CPU capacity and utilization broken down by hardware generation. The most valuable metric is cluster:capacity_days_remaining, which uses deriv() over a 30-day window to project when current capacity will be exhausted at the current growth rate. Alert when this drops below 60 days to trigger procurement.
Did You Know?
Section titled “Did You Know?”-
Intel and AMD use different socket standards, so you cannot swap CPUs between vendors in the same server chassis. A hardware refresh that switches from Intel to AMD (or vice versa) requires entirely new servers, not just new CPUs. This is why vendor choice in the initial purchase has long-term implications.
-
The US Department of Energy’s supercomputing centers use a 5-year refresh cycle because their systems cost hundreds of millions of dollars. They plan each refresh 3 years in advance, with a 2-year overlap period where old and new systems run simultaneously. This staggered approach is now being adopted by large Kubernetes operators.
-
Kubernetes 1.24 added the
MinDomainsInPodTopologySpreadfeature (stable in 1.30) that lets you specify the minimum number of topology domains a workload should span. This is particularly useful during hardware refresh: you can require pods to be spread across at least 2 hardware generations, ensuring a generation-specific bug does not take down all replicas. -
A 2023 Uptime Institute report found that the average server lifespan has increased from 3.5 years in 2015 to 5.2 years in 2023, driven by companies extending server lifecycles to reduce CapEx during economic uncertainty. However, power efficiency improvements mean newer servers often pay for themselves in energy savings within 2 years.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| No node labels for hardware generation | Cannot schedule based on performance tier | Label all nodes with generation, CPU model, and tier |
| Assuming all CPU cores are equal | Uneven performance across hardware generations | Use weighted capacity calculations for planning |
| Decommissioning without capacity check | Cluster overloaded after removing nodes | Calculate post-removal utilization before draining |
| No topology spread across generations | Generation-specific bug (BIOS, kernel) affects all replicas | Use topologySpreadConstraints with hardware-gen key |
| Big-bang hardware refresh | All 100 nodes replaced at once = massive risk | Stagger refreshes: 33 nodes/year rolling |
| Ignoring power efficiency in refresh math | Old servers cost more to power | Include power costs in TCO comparison |
| Not updating monitoring after adding rack | New nodes invisible to alerting | Add new BMC addresses to IPMI exporter, update Prometheus targets |
| Mixing Intel and AMD without testing | Application-level differences (AVX, memory model) | Test workloads on new architecture in staging first |
Question 1
Section titled “Question 1”You have a 100-node cluster: 60 nodes with Intel Xeon Silver 4214 (12 cores, 2019) and 40 nodes with AMD EPYC 9354 (32 cores, 2023). You need to decommission 20 of the oldest Intel nodes. What is the actual capacity impact, and how do you validate that the cluster can handle it?
Answer
Capacity impact analysis:
Before decommission:
- Intel nodes: 60 x 12 cores = 720 cores
- AMD nodes: 40 x 32 cores = 1,280 cores
- Total: 2,000 cores
After decommission (remove 20 Intel):
- Intel nodes: 40 x 12 cores = 480 cores
- AMD nodes: 40 x 32 cores = 1,280 cores
- Total: 1,760 cores
- Reduction: 240 cores = 12% of total core count
However, performance-adjusted capacity:
- Intel 4214 passmark per core: ~1,800
- AMD 9354 passmark per core: ~2,600
- Before: (720 x 1,800) + (1,280 x 2,600) = 1,296,000 + 3,328,000 = 4,624,000 units
- After: (480 x 1,800) + (1,280 x 2,600) = 864,000 + 3,328,000 = 4,192,000 units
- Actual performance reduction: 9.3% (less than the 12% core count suggests)
Validation steps:
- Check current cluster-wide CPU utilization:
Terminal window kubectl top nodes --sort-by=cpu - Calculate requested vs allocatable:
Terminal window kubectl describe nodes | grep -A 5 "Allocated resources" - Verify no workloads are pinned to the Intel nodes being removed
- Check PDBs and topology constraints will still be satisfiable with 80 nodes
- Run the decommission in batches (5 nodes at a time) with monitoring
Question 2
Section titled “Question 2”Your cluster runs on 3 racks with 20 nodes each. You are adding a 4th rack with 20 new nodes (newer hardware generation). Your critical service has a topology spread constraint of maxSkew: 1 on topology.kubernetes.io/zone. After adding the new rack, new pods are not scheduling on the 4th rack. Why?
Answer
The topology spread constraint is preventing scheduling on the new rack.
The math:
- Existing: 3 racks, each with some pods of the critical service
- Say the service has 9 replicas: 3 per rack (skew = 0, within maxSkew=1)
- New rack-d has 0 replicas
When a new pod needs to be scheduled:
- rack-a: 3, rack-b: 3, rack-c: 3, rack-d: 0
- Minimum count: 0 (rack-d), maximum count: 3 (any existing rack)
- Skew = 3 - 0 = 3, which exceeds maxSkew=1
- Result: Pod CAN schedule on rack-d (it would reduce the skew to 3-1=2… wait, no)
Actually, the topology spread constraint evaluates where scheduling the new pod would produce the lowest skew:
- Schedule on rack-d: counts become 3,3,3,1 -> max-min = 3-1 = 2 -> skew=2 > maxSkew=1 -> rejected if
DoNotSchedule - Schedule on rack-a: counts become 4,3,3,0 -> skew=4 -> worse
The problem: With whenUnsatisfiable: DoNotSchedule, no placement satisfies maxSkew=1 because the new rack starts at 0.
Fix options:
- Use
matchLabelKeys(recommended, K8s 1.27+): AddmatchLabelKeys: ["pod-template-hash"]to the topology spread constraint. This makes the skew calculation scoped to only the current ReplicaSet, so arollout restartwill redistribute pods evenly because it only counts new pods:Then runtopologySpreadConstraints:- maxSkew: 1topologyKey: topology.kubernetes.io/zonewhenUnsatisfiable: DoNotSchedulematchLabelKeys:- pod-template-hashlabelSelector:matchLabels:app: critical-servicekubectl rollout restart deployment critical-serviceto rebalance across all 4 racks. - Temporarily relax the constraint:
maxSkew: 2 # allow wider skew during expansion
- Use
whenUnsatisfiable: ScheduleAnyway(soft constraint) - Scale up the deployment so pods can be placed on rack-d, then scale back down
Note: A plain rollout restart without matchLabelKeys will NOT fix this. The default labelSelector matches pods from both old and new ReplicaSets (they share the app: critical-service label), so the skew calculation still sees the old pod distribution and new pods cannot schedule on rack-d.
After rebalancing with matchLabelKeys: 9 replicas across 4 racks = 3,2,2,2 or 2,3,2,2 (skew=1, satisfied).
Question 3
Section titled “Question 3”Your company uses a 5-year refresh cycle. It is now year 4 and disk failure rates have increased from 1% to 6% annually. The CFO asks whether to extend to 7 years to save money. How do you argue against this?
Answer
Argument against extending to 7 years:
1. Disk failure cost escalation:
- Year 4: 6% failure rate across 200 disks = 12 failures/year
- Year 5 (projected): 10% = 20 failures
- Year 6 (projected): 15% = 30 failures
- Year 7 (projected): 22% = 44 failures
- Each disk replacement: $500 (disk) + $200 (labor) + risk of data loss
- Years 6-7 disk costs: 74 failures x $700 = $51,800
2. Increasing support contract costs:
- Vendors charge 30-60% more for extended support beyond 5 years
- Some vendors refuse to support hardware past 7 years
- Parts availability decreases (end-of-life components)
3. Power efficiency gap:
- Year 4 hardware uses ~30% more power per compute unit than current generation
- Year 7: ~50% more power per compute unit
- At $0.10/kWh with 100 servers at 500W: $438,000/year
- New servers at 350W equivalent performance: $306,600/year
- Power savings: $131,400/year (pays for 13 new servers)
4. Performance opportunity cost:
- Applications running on 7-year-old hardware are 2-3x slower per core
- Need 2-3x more servers to achieve the same throughput
- Hiring developers is more expensive than buying faster hardware
5. Risk:
- Cascading failures become more likely (correlated aging)
- If 5 nodes fail in the same week (common in aging batches), the cluster may not have spare capacity
- Security patches may stop being available for older firmware
Summary for the CFO: “Extending to 7 years saves $200,000 in deferred CapEx but adds $150,000+ in disk replacements, $130,000 in excess power costs, and significant operational risk. The net savings is near zero, but the risk is substantially higher.”
Question 4
Section titled “Question 4”You are planning a staggered refresh, replacing 33 nodes per year in a 100-node cluster. You currently have Intel Xeon Gold 6330 nodes. Next year’s refresh will use AMD EPYC 9554. What testing should you do before deploying the AMD nodes into your production cluster?
Answer
Testing plan for cross-vendor CPU migration:
Phase 1: Hardware validation (1 week)
- Boot the AMD servers, verify BIOS settings (SR-IOV, VT-x/AMD-V, NUMA, power management)
- Run hardware stress tests:
stress-ng,memtester,fio - Verify NIC driver compatibility (especially if using Mellanox/Broadcom with RDMA)
- Confirm container runtime works (containerd, kernel cgroup v2)
- Test storage: Ceph OSD performance, CSI driver compatibility
Phase 2: Kubernetes integration (1 week)
- Join 3 AMD nodes to a staging cluster alongside Intel nodes
- Verify kubelet starts correctly
- Test CNI (Calico/Cilium) BGP peering from AMD nodes
- Verify pod scheduling, inter-node networking (pod-to-pod across architectures)
- Run the standard networking test suite (iperf3, curl, DNS resolution)
Phase 3: Application testing (2 weeks)
- Deploy representative workloads on AMD nodes
- Compare performance metrics: latency, throughput, CPU utilization
- Test language-specific behavior:
- Java: JVM may select different JIT optimizations on AMD vs Intel
- Go: Should work identically (portable assembly)
- Python/NumPy: May use different BLAS/LAPACK optimizations
- TensorFlow: Check AVX-512 compatibility
- Run load tests comparing AMD vs Intel node behavior
Phase 4: Production canary (1 week)
- Add 3 AMD nodes to production
- Do NOT label them differently from production nodes
- Let the scheduler place normal workloads
- Monitor for 7 days: error rates, latency distributions, resource usage
- If stable, proceed with full 33-node deployment
Key risk areas:
- Memory model differences (AMD uses a different NUMA topology)
numactland CPU pinning may need reconfiguration- BIOS power management settings affect performance under load
- Some monitoring tools report different CPU metrics on AMD vs Intel
Hands-On Exercise: Plan a Hardware Expansion
Section titled “Hands-On Exercise: Plan a Hardware Expansion”Task: Design a capacity expansion plan for adding a new rack of servers to an existing cluster.
Scenario
Section titled “Scenario”You have a 60-node cluster across 3 racks (20 nodes each) running at 65% CPU utilization. You need to add 20 new nodes in a 4th rack to support a 40% growth forecast.
-
Document current state:
Terminal window kubectl get nodes -o custom-columns=\NAME:.metadata.name,\CPU:.status.allocatable.cpu,\MEM:.status.allocatable.memory,\RACK:.metadata.labels.kubedojo\\.io/rack,\GEN:.metadata.labels.kubedojo\\.io/hardware-gen -
Calculate post-expansion capacity:
- Current: 60 nodes x 32 cores = 1,920 cores at 65% = 1,248 cores used
- After: 80 nodes x average cores = ? cores
- Target utilization: < 60% (headroom for spikes)
-
Design the node labeling scheme for the new rack (rack, generation, performance tier)
-
Write topology spread constraints that span all 4 racks
-
Create a decommission plan for 10 oldest nodes (to happen 3 months after new rack is stable)
Success Criteria
Section titled “Success Criteria”- Calculated current and projected utilization accurately
- Defined label taxonomy for mixed hardware
- Topology spread constraints cover rack and generation dimensions
- Decommission plan respects 80% utilization ceiling
- Monitoring plan includes new rack in all dashboards and alerts
Next Module
Section titled “Next Module”This concludes the Day-2 Operations section. Return to the Operations index to review all modules, or continue to the next section in the on-premises track.