Перейти до вмісту

Module 7.5: Capacity Expansion & Hardware Refresh

Цей контент ще не доступний вашою мовою.

Complexity: [COMPLEX] | Time: 60 minutes

Prerequisites: Module 7.4: Observability Without Cloud Services, Module 1.2: Server Sizing


After completing this module, you will be able to:

  1. Plan capacity expansion that accounts for CPU generation differences, topology constraints, and scheduler behavior with heterogeneous hardware
  2. Implement node labeling, taints, and topology spread constraints to manage mixed-generation server pools effectively
  3. Design a hardware decommissioning process that respects capacity limits, PodDisruptionBudgets, and storage rebalancing
  4. Optimize cluster scheduling policies to distribute workloads appropriately across nodes with different performance characteristics

When a team expands a large bare-metal Kubernetes cluster with a newer hardware generation, scheduling behavior can change in ways that are not obvious during initial bring-up.

Workloads can perform differently across CPU generations, while the default Kubernetes scheduler still reasons about requested CPU as a quantity rather than benchmarked per-core performance. If teams respond by manually pinning workloads to newer nodes, overall cluster utilization can become uneven.

Decommissioning older nodes without spread constraints and spare capacity can overload the remaining cluster, trigger evictions or OOM-related instability, and even degrade the monitoring systems you need during the change.

The lesson: adding hardware to a Kubernetes cluster is not just racking and stacking. You need to account for CPU generation differences, topology constraints, scheduling policies, and a decommission plan that respects capacity limits.


  • Adding new racks and nodes to existing clusters
  • Managing mixed CPU generations (Intel and AMD)
  • Topology spread constraints for heterogeneous hardware
  • Decommissioning old nodes safely
  • 3-year vs 5-year hardware refresh cycles
  • Capacity planning with hardware generations

flowchart TD
subgraph Before ["Before Racking Servers"]
direction TB
B1["Network: leaf switch installed, cabled to spines"]
B2["Power: PDUs installed, circuits provisioned"]
B3["VLANs: management, production, storage trunked on leaf"]
B4["BGP: leaf peering with spines (new AS number for rack)"]
B5["PXE: DHCP relay configured for new subnet"]
B6["DNS: reverse DNS entries for new BMC/management IPs"]
B7["IPAM: IP ranges allocated for nodes, pods, services"]
B1 --> B2 --> B3 --> B4 --> B5 --> B6 --> B7
end
subgraph After ["After Racking Servers"]
direction TB
A1["BMC configured (IP, credentials, NTP)"]
A2["PXE boot OS image"]
A3["Configure networking (bonds, VLANs, routes)"]
A4["Install kubelet, kubeadm, container runtime (cgroup v2)"]
A5["Join cluster with kubeadm join"]
A6["Label nodes (rack, generation, hardware model)"]
A7["Verify CNI connectivity to existing nodes"]
A8["Verify CSI storage access"]
A1 --> A2 --> A3 --> A4 --> A5 --> A6 --> A7 --> A8
end
Before --> After

Stop and think: If you provision a new rack of older OS images (which default to cgroup v1) and try to join them to a Kubernetes 1.35+ cluster, what will happen? By default, the kubelet will refuse to start because cgroup v1 is officially deprecated. Both the kubelet and your container runtime must strictly use cgroup v2 with the systemd cgroup driver to successfully register the node.

Pause and predict: You are adding 40 new AMD EPYC servers to a cluster running Intel Xeon nodes. The Kubernetes scheduler sees “32 cores available” on both, but the AMD cores are 44% faster per-core. How would you prevent latency-sensitive pods from being scheduled on slower Intel nodes without hardcoding node names?

Unlike cloud environments, the vanilla Kubernetes Cluster Autoscaler does not support on-premises bare-metal node provisioning because it relies on cloud provider node pool APIs. This means bare-metal capacity expansion requires manual or scripted provisioning processes. Before running any automation, ensure that each bare-metal server has a unique hostname, MAC address, and product_uuid, as kubeadm will fail to register nodes if these are duplicated.

This script automates the most error-prone part of rack expansion: waiting for each server to PXE boot, joining it to the cluster, and applying the correct topology labels. Labels for rack, hardware generation, and CPU model enable scheduling policies that account for heterogeneous hardware.

#!/bin/bash
# provision-new-rack.sh — add a rack of servers to existing cluster
set -euo pipefail
RACK_ID="$1" # e.g., rack-e
NODES_FILE="$2" # hostname,bmc-ip,mgmt-ip
JOIN_TOKEN="$3" # from kubeadm token create (default TTL is 24h)
CA_CERT_HASH="$4" # from kubeadm
API_SERVER="$5" # e.g., 10.0.10.10:6443
while IFS=, read -r HOSTNAME BMC_IP MGMT_IP; do
echo "=== Provisioning ${HOSTNAME} in ${RACK_ID} ==="
# Wait for node to be PXE booted and accessible
echo "Waiting for ${HOSTNAME} to be reachable via SSH..."
until ssh -o ConnectTimeout=5 root@"$MGMT_IP" true 2>/dev/null; do
sleep 10
done
# Configure node labels and join cluster
ssh root@"$MGMT_IP" bash <<REMOTE_EOF
# Join the cluster
kubeadm join ${API_SERVER} \
--token ${JOIN_TOKEN} \
--discovery-token-ca-cert-hash sha256:${CA_CERT_HASH}
REMOTE_EOF
# Wait for the node to register with the API server
# (kubeadm join returns before the Node object is fully created)
echo "Waiting for ${HOSTNAME} to register..."
until kubectl get node "$HOSTNAME" &>/dev/null; do
sleep 5
done
kubectl wait --for=condition=Ready "node/$HOSTNAME" --timeout=120s
# Label the node from a control plane
echo "Labeling ${HOSTNAME}..."
kubectl label node "$HOSTNAME" \
topology.kubernetes.io/zone="${RACK_ID}" \
kubedojo.io/rack="${RACK_ID}" \
kubedojo.io/hardware-gen="gen4" \
kubedojo.io/cpu-vendor="amd" \
kubedojo.io/cpu-model="epyc-9354" \
--overwrite
echo "=== ${HOSTNAME} joined and labeled ==="
done < "$NODES_FILE"
echo "All nodes in ${RACK_ID} provisioned."
echo "Run: kubectl get nodes -l kubedojo.io/rack=${RACK_ID}"

The Problem with Heterogeneous Performance

Section titled “The Problem with Heterogeneous Performance”

When mixing CPU generations, you must account for varying hardware capabilities. In Kubernetes 1.35, advanced features like the Topology Manager’s max-allowable-numa-nodes reached General Availability (GA), giving you granular control over workload placement on modern multi-socket AMD and Intel systems. However, even with advanced topology management, the primary challenge remains: raw performance differences across generations.

ModelYearCoresSingle-ThreadPassmark
Xeon Silver 42142019121,80015,200
Xeon Gold 63302021282,10035,000
EPYC 93542023322,60053,000

The EPYC 9354 substantially outperforms the 4214 in both single-threaded and multithreaded benchmark data, but the exact percentage depends on the benchmark source and snapshot date.

Kubernetes natively sees: “32 cores available” on both. Reality dictates: 32 EPYC cores >> 32 Xeon Silver cores.

Terminal window
# Label all nodes with their hardware generation
# This enables scheduling policies based on performance tier
# Gen 1: 2019 hardware (Cascade Lake)
kubectl label nodes -l kubedojo.io/cpu-model=xeon-4214 \
kubedojo.io/performance-tier=standard
# Gen 2: 2021 hardware (Ice Lake)
kubectl label nodes -l kubedojo.io/cpu-model=xeon-6330 \
kubedojo.io/performance-tier=high
# Gen 3: 2023 hardware (Genoa)
kubectl label nodes -l kubedojo.io/cpu-model=epyc-9354 \
kubedojo.io/performance-tier=premium
# Option 1: Prefer newer hardware (soft preference)
apiVersion: apps/v1
kind: Deployment
metadata:
name: latency-sensitive-app
spec:
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: kubedojo.io/performance-tier
operator: In
values: [premium]
- weight: 50
preference:
matchExpressions:
- key: kubedojo.io/performance-tier
operator: In
values: [high]
containers:
- name: app
image: my-app:latest
resources:
requests:
cpu: "4"
memory: 8Gi
---
# Option 2: Require specific hardware (hard requirement)
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-training-job
spec:
template:
spec:
nodeSelector:
kubedojo.io/cpu-vendor: amd # Needs AVX-512
kubedojo.io/performance-tier: premium
containers:
- name: training
image: tensorflow:latest
resources:
requests:
cpu: "16"
memory: 64Gi

Kubernetes sees all CPU cores as equal, but they are not. Use benchmark data to calculate normalized capacity, because removing older nodes can reduce effective compute by much less than raw node counts suggest. Always use weighted capacity calculations when planning decommissions.


Topology Spread Constraints for Heterogeneous Hardware

Section titled “Topology Spread Constraints for Heterogeneous Hardware”

When you have multiple hardware generations across multiple racks, topology spread constraints ensure workloads are distributed to survive rack failures and hardware-specific issues.

Stop and think: You have a critical service with 6 replicas spread across 3 racks. You add a 4th rack. New pods will not schedule on the 4th rack because maxSkew: 1 with DoNotSchedule cannot be satisfied. How would you rebalance pods across all 4 racks?

apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-service
spec:
replicas: 6
template:
metadata:
labels:
app: critical-service
spec:
topologySpreadConstraints:
# Spread across racks (survive rack failure)
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: critical-service
# Spread across hardware generations (survive generation-specific bug)
- maxSkew: 2
topologyKey: kubedojo.io/hardware-gen
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: critical-service
containers:
- name: app
image: critical-service:latest
resources:
requests:
cpu: "2"
memory: 4Gi
flowchart TD
subgraph RackA ["Rack A (Gen 1 + Gen 3)"]
direction TB
A1["[worker-01 gen1] pod-1"]
A2["[worker-02 gen1] (empty)"]
A3["[worker-21 gen3] pod-2"]
end
subgraph RackB ["Rack B (Gen 1 + Gen 2)"]
direction TB
B1["[worker-05 gen1] pod-3"]
B2["[worker-11 gen2] pod-4"]
B3["[worker-12 gen2] (empty)"]
end
subgraph RackC ["Rack C (Gen 2 + Gen 3)"]
direction TB
C1["[worker-15 gen2] pod-5"]
C2["[worker-25 gen3] pod-6"]
C3["[worker-26 gen3] (empty)"]
end
RackA ~~~ RackB ~~~ RackC

Result: 2 pods per rack (maxSkew=1 satisfied) Gen distribution: gen1=2, gen2=2, gen3=2 (maxSkew=2 OK) Rack failure: lose 2/6 pods = service continues Gen-specific bug: affects 2/6 pods = service continues


Removing nodes requires careful capacity planning to avoid overloading the remaining cluster.

Pause and predict: Before decommissioning 20 nodes, you need to verify the remaining cluster can handle the load. But Kubernetes reports CPU in cores — and not all cores are equal. A 2023 AMD core delivers 44% more throughput than a 2019 Intel core. How do you calculate the true capacity impact of removing 20 Intel nodes?

This script performs safety checks before removing a node: verifying remaining capacity will stay below 80%, checking for local PersistentVolumes that would be lost, and then draining and deleting the node.

#!/bin/bash
# decommission-node.sh — safely remove a node from the cluster
set -euo pipefail
NODE="$1"
echo "=== Pre-decommission checks for ${NODE} ==="
# Check 1: Will remaining capacity handle the load?
# Normalize CPU values to millicores — K8s returns either "4" (cores) or "3900m" (millicores)
TOTAL_CPU=$(kubectl get nodes -o json | jq '
[.items[].status.allocatable.cpu |
if endswith("m") then rtrimstr("m") | tonumber
else tonumber * 1000 end
] | add')
NODE_CPU=$(kubectl get node "$NODE" -o json | jq '
.status.allocatable.cpu |
if endswith("m") then rtrimstr("m") | tonumber
else tonumber * 1000 end')
REMAINING_CPU=$((TOTAL_CPU - NODE_CPU))
REQUESTED_CPU=$(kubectl get pods -A -o json | jq '
[.items[].spec.containers[].resources.requests.cpu // "0" |
if endswith("m") then rtrimstr("m") | tonumber
else tonumber * 1000 end
] | add')
echo "Total allocatable CPU: ${TOTAL_CPU}m"
echo "This node CPU: ${NODE_CPU}m"
echo "Remaining CPU after removal: ${REMAINING_CPU}m"
echo "Total requested CPU: ${REQUESTED_CPU}m"
echo "Utilization after removal: $((REQUESTED_CPU * 100 / REMAINING_CPU))%"
if [ $((REQUESTED_CPU * 100 / REMAINING_CPU)) -gt 80 ]; then
echo "WARNING: Cluster will be at >80% CPU utilization after removing this node."
echo "Consider adding capacity before decommissioning."
read -p "Continue anyway? (y/N) " -n 1 -r
echo
[[ $REPLY =~ ^[Yy]$ ]] || exit 1
fi
# Check 2: Any local PVs on this node?
LOCAL_PVS=$(kubectl get pv -o json | jq -r --arg node "$NODE" '
.items[] | select(
.spec.nodeAffinity.required.nodeSelectorTerms[].matchExpressions[].values[] == $node
) | .metadata.name')
if [ -n "$LOCAL_PVS" ]; then
echo "WARNING: Node has local PVs that will be lost:"
echo "$LOCAL_PVS"
echo "Migrate data before proceeding."
exit 1
fi
# Check 3: Drain the node
echo "Draining ${NODE}..."
kubectl drain "$NODE" \
--ignore-daemonsets \
--delete-emptydir-data \
--timeout=600s
# Check 4: Remove from cluster
echo "Removing ${NODE} from cluster..."
kubectl delete node "$NODE"
# Check 5: On the node itself (via SSH or BMC):
# kubeadm reset
# Clean up iptables, IPVS rules, CNI config
echo "=== ${NODE} decommissioned ==="
echo "Remember to:"
echo " 1. Power off the server"
echo " 2. Update CMDB/inventory"
echo " 3. Reclaim rack space"
echo " 4. Update monitoring targets"
echo " 5. Update PXE/DHCP reservations"

When physically powering down decommissioned nodes, do not simply turn them off. While Graceful Node Shutdown is enabled by default in Kubernetes, it is not actually activated unless you have explicitly configured shutdownGracePeriod to a non-zero value in your kubelet configuration. Always use the kubectl drain process to safely evict workloads.

When decommissioning in batches, remove 5 nodes at a time over 1-2 day phases. Monitor utilization overnight after each batch. Never exceed 80% cluster utilization during the process. After all nodes are removed, verify no orphaned PVs remain and update monitoring targets, alerting thresholds, and spare node counts.


For a 100-node cluster, refresh-cycle cost depends heavily on hardware pricing, support terms, energy costs, and workload requirements:

Factor3-Year Cycle5-Year Cycle
Amortized CapEx/yearHigher annualized spend with faster refreshLower annualized spend with slower refresh
Support contractsTypically lower over a shorter lifecycleTypically higher as hardware ages
Power (total)Typically lower with newer, more efficient nodesTypically higher when older nodes stay in service longer
Failure rate (end of life)Typically lowerTypically higher
Performance vs current genCloser to current-generation performanceFurther behind current-generation performance
Total costDepends on your hardware, power, support, and failure assumptionsDepends on your hardware, power, support, and failure assumptions

A shorter refresh cycle usually increases annualized capital spend, while a longer cycle can increase operational risk, support burden, and power costs.

Choose 3-year cycles for performance-sensitive workloads, rapid growth, or when power efficiency matters. Choose 5-year cycles for budget-constrained environments with stable, predictable loads that are not CPU-bound.

timeline
title Staggered Refresh (33 nodes/year rolling)
Year 1 : Buy 33 new nodes (Gen N+3) : Decommission 33 oldest
Year 2 : Buy 33 new nodes (Gen N+4) : Decommission 33 oldest
Year 3 : Buy 34 new nodes (Gen N+5) : Decommission 34 oldest
Year 4 : Buy 33 new nodes (Gen N+6) : Decommission 33 oldest

Benefits:

  • Smooth CapEx (333k/yearinsteadof333k/year instead of 1M every 3 years)
  • Always have recent hardware in the fleet
  • Never need to decommission more than 33% at once
  • Team practices add/remove procedure regularly
  • Each year you learn what works for the new hardware gen

Challenges:

  • 3 hardware generations in the cluster simultaneously
  • Must handle CPU/memory heterogeneity in scheduling
  • Firmware update process covers multiple vendor models

Capacity Planning with Hardware Generations

Section titled “Capacity Planning with Hardware Generations”

When forecasting long-term growth across multiple hardware refresh cycles, remember that Kubernetes v1.35 has official large-cluster tested limits: a maximum of 5,000 nodes, 110 pods per node, 150,000 total pods, and 300,000 total containers. Your capacity plans must ensure all four constraints are met simultaneously.

Create Prometheus recording rules that track CPU capacity and utilization broken down by hardware generation. The most valuable metric is cluster:capacity_days_remaining, which uses deriv() over a 30-day window to project when current capacity will be exhausted at the current growth rate. Alert when this drops below 60 days to trigger procurement.


  • Kubernetes 1.35 is the last release to support the containerd 1.x series. If your hardware refresh involves reinstalling the operating system and container runtime on new servers, plan to use containerd 2.x or another CRI-conformant runtime before upgrading beyond 1.35, because newer kubelets continue tightening runtime compatibility.

  • Switching CPU vendors usually means replacing the server platform, not just the processor, because server sockets and platform compatibility differ by vendor and generation. This is why vendor choice in the initial purchase has long-term implications.

  • Large HPC operators often plan refreshes years in advance and may run old and new systems in parallel during transitions. Similar overlap planning can help large Kubernetes operators reduce migration risk during hardware refreshes.

  • Kubernetes 1.35 graduated In-place Pod Resize to General Availability (GA). This lets you change CPU and memory requests and limits for running containers without recreating the Pod, which can reduce disruption when migrating long-running workloads across mixed hardware.

  • Kubernetes 1.24 added the MinDomainsInPodTopologySpread feature (stable in 1.30) that lets you specify the minimum number of topology domains a workload should span. This is particularly useful during hardware refresh: you can require pods to be spread across at least 2 hardware generations, ensuring a generation-specific bug does not take down all replicas.

  • Recent industry reporting suggests that many operators are extending server lifecycles beyond the traditional three-year window. Even so, newer hardware can still offer meaningful efficiency gains, so refresh timing should be based on measured total cost of ownership rather than purchase price alone.


MistakeProblemSolution
No node labels for hardware generationCannot schedule based on performance tierLabel all nodes with generation, CPU model, and tier
Assuming all CPU cores are equalUneven performance across hardware generationsUse weighted capacity calculations for planning
Decommissioning without capacity checkCluster overloaded after removing nodesCalculate post-removal utilization before draining
No topology spread across generationsGeneration-specific bug (BIOS, kernel) affects all replicasUse topologySpreadConstraints with hardware-gen key
Big-bang hardware refreshAll 100 nodes replaced at once = massive riskStagger refreshes: 33 nodes/year rolling
Ignoring power efficiency in refresh mathOld servers cost more to powerInclude power costs in TCO comparison
Not updating monitoring after adding rackNew nodes invisible to alertingAdd new BMC addresses to IPMI exporter, update Prometheus targets
Mixing Intel and AMD without testingApplication-level differences (AVX, memory model)Test workloads on new architecture in staging first

You have a 100-node cluster: 60 nodes with Intel Xeon Silver 4214 (12 cores, 2019) and 40 nodes with AMD EPYC 9354 (32 cores, 2023). You need to decommission 20 of the oldest Intel nodes. What is the actual capacity impact, and how do you validate that the cluster can handle it?

Answer

You must normalize the CPU capacity using performance benchmarks because Kubernetes scheduling is naive and treats all CPU millicores as identical. By calculating the weighted capacity, you reveal that removing 20 older nodes only impacts overall performance by 9.3%, rather than the 12% that raw core counts suggest. This prevents you from over-provisioning replacement hardware or accidentally starving workloads during the decommission phase. Validating the cluster can handle it involves checking the actual allocated resources against this newly calculated baseline, ensuring you stay below the 80% safety threshold.

Capacity impact analysis:

Before decommission:

  • Intel nodes: 60 x 12 cores = 720 cores
  • AMD nodes: 40 x 32 cores = 1,280 cores
  • Total: 2,000 cores

After decommission (remove 20 Intel):

  • Intel nodes: 40 x 12 cores = 480 cores
  • AMD nodes: 40 x 32 cores = 1,280 cores
  • Total: 1,760 cores
  • Reduction: 240 cores = 12% of total core count

However, performance-adjusted capacity:

  • Intel 4214 passmark per core: ~1,800
  • AMD 9354 passmark per core: ~2,600
  • Before: (720 x 1,800) + (1,280 x 2,600) = 1,296,000 + 3,328,000 = 4,624,000 units
  • After: (480 x 1,800) + (1,280 x 2,600) = 864,000 + 3,328,000 = 4,192,000 units
  • Actual performance reduction: 9.3% (less than the 12% core count suggests)

Validation steps:

  1. Check current cluster-wide CPU utilization:
    Terminal window
    kubectl top nodes --sort-by=cpu
  2. Calculate requested vs allocatable:
    Terminal window
    kubectl describe nodes | grep -A 5 "Allocated resources"
  3. Verify no workloads are pinned to the Intel nodes being removed
  4. Check PDBs and topology constraints will still be satisfiable with 80 nodes
  5. Run the decommission in batches (5 nodes at a time) with monitoring

Your cluster runs on 3 racks with 20 nodes each. You are adding a 4th rack with 20 new nodes (newer hardware generation). Your critical service has a topology spread constraint of maxSkew: 1 on topology.kubernetes.io/zone. After adding the new rack, new pods are not scheduling on the 4th rack. Why?

Answer

This scheduling failure happens because the topology spread constraint evaluates where scheduling the new pod would produce the lowest skew across all domains. With the DoNotSchedule strict constraint, no placement satisfies the maximum skew of 1 because the new rack starts completely empty at zero pods, meaning the skew would immediately jump to 2 or 3. To fix this in modern Kubernetes (1.27+), you should use matchLabelKeys targeting the pod template hash. This scopes the skew calculation only to the new ReplicaSet being rolled out, allowing a standard rollout restart to rebalance the pods seamlessly across all four racks without violating the constraint during the transition.

The math:

  • Existing: 3 racks, each with some pods of the critical service
  • Say the service has 9 replicas: 3 per rack (skew = 0, within maxSkew=1)
  • New rack-d has 0 replicas

When a new pod needs to be scheduled:

  • rack-a: 3, rack-b: 3, rack-c: 3, rack-d: 0
  • Minimum count: 0 (rack-d), maximum count: 3 (any existing rack)
  • Skew = 3 - 0 = 3, which exceeds maxSkew=1
  • Result: Pod CANNOT schedule on rack-d

Fix options:

  1. Use matchLabelKeys (recommended, K8s 1.27+): Add matchLabelKeys: ["pod-template-hash"] to the topology spread constraint:
    topologySpreadConstraints:
    - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    matchLabelKeys:
    - pod-template-hash
    labelSelector:
    matchLabels:
    app: critical-service
    Then run kubectl rollout restart deployment critical-service to rebalance across all 4 racks.
  2. Temporarily relax the constraint:
    maxSkew: 2 # allow wider skew during expansion
  3. Use whenUnsatisfiable: ScheduleAnyway (soft constraint)
  4. Scale up the deployment so pods can be placed on rack-d, then scale back down

Note: A plain rollout restart without matchLabelKeys will NOT fix this. The default labelSelector matches pods from both old and new ReplicaSets, so the skew calculation still sees the old pod distribution and new pods cannot schedule on the new rack.

Your company uses a 5-year refresh cycle. It is now year 4 and disk failure rates have increased from 1% to 6% annually. The CFO asks whether to extend to 7 years to save money. How do you argue against this?

Answer

Extending the hardware lifecycle to seven years introduces compounding hidden costs that negate the deferred capital expenditure. As servers age past year five, component failure rates skyrocket, particularly for mechanical or heavily written storage drives, increasing labor and emergency replacement costs. Additionally, older hardware is significantly less power-efficient than newer generations, leading to inflated electricity bills that can completely offset the price of new servers. Finally, keeping slower legacy processors limits application throughput, forcing you to run more nodes to handle the same workload volume and increasing the operational burden on the infrastructure team.

Argument against extending to 7 years:

1. Disk failure cost escalation:

  • Year 4: 6% failure rate across 200 disks = 12 failures/year
  • Year 5 (projected): 10% = 20 failures
  • Year 6 (projected): 15% = 30 failures
  • Year 7 (projected): 22% = 44 failures
  • Each disk replacement: 500(disk)+500 (disk) + 200 (labor) + risk of data loss
  • Years 6-7 disk costs: 74 failures x 700=700 = 51,800

2. Increasing support contract costs:

  • Vendors charge 30-60% more for extended support beyond 5 years
  • Some vendors refuse to support hardware past 7 years
  • Parts availability decreases (end-of-life components)

3. Power efficiency gap:

  • Year 4 hardware uses ~30% more power per compute unit than current generation
  • Year 7: ~50% more power per compute unit
  • At 0.10/kWhwith100serversat500W:0.10/kWh with 100 servers at 500W: 438,000/year
  • New servers at 350W equivalent performance: $306,600/year
  • Power savings: $131,400/year (pays for 13 new servers)

4. Performance opportunity cost:

  • Applications running on 7-year-old hardware are 2-3x slower per core
  • Need 2-3x more servers to achieve the same throughput
  • Hiring developers is more expensive than buying faster hardware

5. Risk:

  • Cascading failures become more likely (correlated aging)
  • If 5 nodes fail in the same week (common in aging batches), the cluster may not have spare capacity
  • Security patches may stop being available for older firmware

Summary for the CFO: “Extending to 7 years saves 200,000indeferredCapExbutadds200,000 in deferred CapEx but adds 150,000+ in disk replacements, $130,000 in excess power costs, and significant operational risk. The net savings is near zero, but the risk is substantially higher.”

You are planning a staggered refresh, replacing 33 nodes per year in a 100-node cluster. You currently have Intel Xeon Gold 6330 nodes. Next year’s refresh will use AMD EPYC 9554. What testing should you do before deploying the AMD nodes into your production cluster?

Answer

Migrating workloads between different CPU vendors introduces subtle architectural differences that can unexpectedly impact application performance or stability. Because AMD and Intel processors handle NUMA topologies, memory models, and advanced vector extensions (like AVX-512) differently, workloads heavily reliant on specific instruction sets or memory bandwidth may behave unpredictably. Comprehensive testing ensures that the container runtime, CNI plugins, and underlying storage drivers interact correctly with the new hardware architecture before entering production. Gradually rolling out the new nodes as a canary deployment allows you to observe these architectural nuances under real-world traffic patterns without risking widespread outages.

Testing plan for cross-vendor CPU migration:

Phase 1: Hardware validation (1 week)

  • Boot the AMD servers, verify BIOS settings (SR-IOV, VT-x/AMD-V, NUMA, power management)
  • Run hardware stress tests: stress-ng, memtester, fio
  • Verify NIC driver compatibility (especially if using Mellanox/Broadcom with RDMA)
  • Confirm container runtime works (containerd, kernel cgroup v2)
  • Test storage: Ceph OSD performance, CSI driver compatibility

Phase 2: Kubernetes integration (1 week)

  • Join 3 AMD nodes to a staging cluster alongside Intel nodes
  • Verify kubelet starts correctly
  • Test CNI (Calico/Cilium) BGP peering from AMD nodes
  • Verify pod scheduling, inter-node networking (pod-to-pod across architectures)
  • Run the standard networking test suite (iperf3, curl, DNS resolution)

Phase 3: Application testing (2 weeks)

  • Deploy representative workloads on AMD nodes
  • Compare performance metrics: latency, throughput, CPU utilization
  • Test language-specific behavior:
    • Java: JVM may select different JIT optimizations on AMD vs Intel
    • Go: Should work identically (portable assembly)
    • Python/NumPy: May use different BLAS/LAPACK optimizations
    • TensorFlow: Check AVX-512 compatibility
  • Run load tests comparing AMD vs Intel node behavior

Phase 4: Production canary (1 week)

  • Add 3 AMD nodes to production
  • Do NOT label them differently from production nodes
  • Let the scheduler place normal workloads
  • Monitor for 7 days: error rates, latency distributions, resource usage
  • If stable, proceed with full 33-node deployment

Key risk areas:

  • Memory model differences (AMD uses a different NUMA topology)
  • numactl and CPU pinning may need reconfiguration
  • BIOS power management settings affect performance under load
  • Some monitoring tools report different CPU metrics on AMD vs Intel

Hands-On Exercise: Plan a Hardware Expansion

Section titled “Hands-On Exercise: Plan a Hardware Expansion”

You manage a 60-node bare metal Kubernetes cluster spread across 3 racks (20 nodes each). The cluster is currently running at 65% CPU utilization. The business is forecasting a 40% growth in traffic next quarter, so you have just racked and powered on 20 new servers (a newer hardware generation) in a 4th rack. The new servers have a Passmark score 44% higher per core than the old servers.

Design a safe capacity expansion and decommission plan that successfully integrates the new hardware, spreads workloads across all 4 racks, and safely retires 10 of the oldest nodes without exceeding an 80% cluster-wide utilization ceiling.

Use your understanding of Kubernetes scheduling, topology spread constraints, and normalized CPU capacity to document the necessary node labels, workload constraints, and the mathematical justification for your decommission strategy. Do not rely on naive core counts.

Hint 1: The Concept Because the new servers are 44% faster per core, a simple sum of CPU cores will underestimate your new total capacity. You need to calculate "performance-adjusted units" to accurately predict post-expansion and post-decommission utilization.
Hint 2: The Component To ensure high availability across the heterogeneous hardware, your workloads need `topologySpreadConstraints`. Since you are adding a 4th rack that starts empty, remember how `maxSkew: 1` behaves when a new topology domain is introduced.
Hint 3: The Command When decommissioning the 10 oldest nodes, use `kubectl drain --ignore-daemonsets --delete-emptydir-data --timeout=600s`. Before running this, you must calculate: (Current Requested CPU) / (Total CPU - Removed Node CPU) using weighted capacity to ensure it stays below 80%.

Review your expansion plan against these checks:

  1. Did you define specific labels for hardware generation and performance tier?
  2. Did you include matchLabelKeys: ["pod-template-hash"] in your topology spread constraints to allow pods to schedule on the new rack?
  3. Does your decommission math prove that removing 10 old nodes will leave the cluster below 80% utilization?

Why is it dangerous to treat all CPU cores as identical when combining servers from 2019 and 2024 in the same cluster? How would naive pod scheduling impact latency-sensitive microservices?


This concludes the Day-2 Operations section. Return to the Operations index to review all modules, or continue to the next section in the on-premises track.

  • kubernetes.io: cgroups — The upstream cgroups documentation explicitly documents the v1.35 deprecation and default kubelet behavior.
  • kubernetes.io: container runtimes — The container runtime documentation covers the shared-driver requirement and the cgroup v2/systemd recommendation.
  • github.com: FAQ.md — Upstream Cluster Autoscaler docs assume node groups and external provisioning/registration tooling, but they do not state this bare-metal limitation in exactly those words.
  • kubernetes.io: install kubeadm — The kubeadm install guide explicitly requires these identifiers to be unique and warns that installation may fail otherwise.
  • kubernetes.io: topology manager — The Topology Manager task page states that max-allowable-numa-nodes is GA in Kubernetes 1.35.
  • kubernetes.io: volumes — The volumes documentation explains local PV node affinity and the reduced availability/data-loss risk tied to the underlying node.
  • kubernetes.io: node shutdown — The upstream node shutdown docs explicitly describe the default-enabled gate and the zero-value configuration caveat.
  • kubernetes.io: cluster large — These exact scalability limits are documented in the upstream large-cluster guidance.
  • kubernetes.io: topology spread constraints — The topology spread documentation states the pre-1.30 gate requirement and the stable availability from 1.30 onward.