Module 2.9: Workload Autoscaling
Complexity:
[MEDIUM]- CKA exam topicTime to Complete: 40-50 minutes
Prerequisites: Module 2.2 (Deployments), Module 2.5 (Resource Management)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After this module, you will be able to:
- Configure Horizontal Pod Autoscaler (HPA) with CPU and custom metrics
- Explain the HPA decision algorithm (target utilization, scaling velocity, cooldown)
- Debug HPA not scaling by checking metrics-server, current vs target utilization, and events
- Compare HPA, VPA, and cluster autoscaler and explain when to use each
Why This Module Matters
Section titled “Why This Module Matters”Static replica counts waste money or cause outages. Too many replicas = wasted resources. Too few = users get errors during traffic spikes. Autoscaling dynamically adjusts capacity based on actual demand.
The CKA exam tests your ability to create and configure HorizontalPodAutoscalers. You’ll need to do this quickly under pressure.
The Thermostat Analogy
A Horizontal Pod Autoscaler is like a smart thermostat. You set the desired “temperature” (target CPU utilization), and the system automatically turns on more “heaters” (pods) when it’s cold (high load) and turns them off when it’s warm (low load). You don’t manually adjust the heating — the thermostat does it based on the current reading.
Did You Know?
Section titled “Did You Know?”-
HPA checks metrics every 15 seconds by default (configurable via
--horizontal-pod-autoscaler-sync-period). Scaling decisions are based on the average metric value across all pods. -
HPA has a cooldown period: After scaling up, HPA waits 3 minutes before considering scale-down (configurable). This prevents “flapping” — rapidly scaling up and down.
-
metrics-server is required: HPA can’t function without metrics-server installed in the cluster. It provides the CPU/memory metrics that HPA needs. This is a common gotcha in practice environments.
-
VPA + In-Place Pod Resize (K8s 1.35): The Vertical Pod Autoscaler can now leverage in-place pod resize to adjust CPU/memory without restarting pods — a game changer for stateful workloads.
Part 1: Horizontal Pod Autoscaler (HPA)
Section titled “Part 1: Horizontal Pod Autoscaler (HPA)”1.1 Prerequisites: metrics-server
Section titled “1.1 Prerequisites: metrics-server”HPA needs metrics-server to read CPU/memory usage:
# Check if metrics-server is installedk top nodes# If "error: Metrics API not available", install it:
# Install metrics-serverkubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# For local clusters (kind/minikube), you may need to add --kubelet-insecure-tlskubectl patch deployment metrics-server -n kube-system --type=json \ -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'
# Verify it worksk top nodesk top pods1.2 Creating an HPA
Section titled “1.2 Creating an HPA”Imperative (exam-fast):
# Create HPA: scale between 2-10 replicas, target 80% CPUk autoscale deployment web --min=2 --max=10 --cpu-percent=80
# Verifyk get hpa# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE# web Deployment/web 12%/80% 2 10 2 30sDeclarative:
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: web-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: web minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 80 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 85Pause and predict: You create an HPA with
targetCPUUtilization: 50%andmin: 2, max: 10. Your 3 pods are currently at 90% CPU utilization. How many replicas will the HPA calculate as needed? (Hint: the formula isceil(currentReplicas * (currentMetric / targetMetric)))
1.3 How HPA Decides
Section titled “1.3 How HPA Decides”┌─────────────────────────────────────────────────────────────┐│ HPA Decision Loop (every 15s) │├─────────────────────────────────────────────────────────────┤│ ││ 1. Read current metric values from metrics-server ││ │ ││ ▼ ││ 2. Calculate: desired = ceil(current * (actual / target)) ││ Example: 3 pods at 90% CPU, target 50% ││ desired = ceil(3 * (90/50)) = ceil(5.4) = 6 pods ││ │ ││ ▼ ││ 3. Clamp to min/max range ││ min: 2, max: 10 → result: 6 (within range) ││ │ ││ ▼ ││ 4. Scale deployment to 6 replicas ││ │└─────────────────────────────────────────────────────────────┘1.4 Monitoring HPA
Section titled “1.4 Monitoring HPA”# Check HPA statusk get hpa web-hpak describe hpa web-hpa
# Watch scaling eventsk get hpa -w
# Check events for scaling decisionsk get events --field-selector reason=SuccessfulRescalePart 2: Load Testing Your HPA
Section titled “Part 2: Load Testing Your HPA”# Deploy a test app with resource requestsk create deployment web --image=nginx --replicas=1k set resources deployment web --requests=cpu=100m,memory=128Mi --limits=cpu=200m,memory=256Mi
# Create HPAk autoscale deployment web --min=1 --max=5 --cpu-percent=50
# Generate load (in another terminal)k run load-generator --image=busybox --restart=Never -- \ /bin/sh -c "while true; do wget -q -O- http://web; done"
# Watch HPA respondk get hpa web -w# You should see CPU% increase and replicas scale up
# Stop loadk delete pod load-generator
# Watch HPA scale back down (after cooldown)k get hpa web -wPart 3: Vertical Pod Autoscaler (VPA)
Section titled “Part 3: Vertical Pod Autoscaler (VPA)”VPA automatically adjusts CPU and memory requests/limits based on observed usage. Unlike HPA (more pods), VPA adjusts the size of each pod.
Stop and think: Your team runs a PostgreSQL database as a StatefulSet with a single replica. During peak hours, the database needs more CPU and memory, but you can’t just add more replicas (that’s not how databases work). What autoscaling approach would you use here — HPA or VPA? What mode would you start with if you’re cautious?
3.1 When to Use VPA vs HPA
Section titled “3.1 When to Use VPA vs HPA”| Scenario | Use |
|---|---|
| Stateless web apps | HPA (add more pods) |
| Databases, caches | VPA (bigger pods — can’t easily add replicas) |
| Unknown resource needs | VPA in recommend mode first |
| Batch jobs | VPA (right-size the job pods) |
| Combine both | HPA on custom metrics + VPA on resources |
3.2 VPA Modes
Section titled “3.2 VPA Modes”apiVersion: autoscaling.k8s.io/v1kind: VerticalPodAutoscalermetadata: name: web-vpaspec: targetRef: apiVersion: apps/v1 kind: Deployment name: web updatePolicy: updateMode: "Auto" # Options: Off, Initial, Recreate, Auto| Mode | Behavior |
|---|---|
Off | VPA only recommends — doesn’t change anything (safe for auditing) |
Initial | Sets resources only when pods are created (not running ones) |
Recreate | Evicts and recreates pods with new resources |
Auto | Uses in-place resize (K8s 1.35+) when possible, falls back to recreate |
K8s 1.35 + VPA: With in-place pod resize GA, VPA in
Automode can now adjust CPU and memory on running pods without restart — a major improvement for stateful workloads.
Pause and predict: You set up HPA on a Deployment but
kubectl get hpashowsTARGETS: <unknown>/80%. The HPA never scales. What is likely missing from your cluster, and what else might be missing from your pod spec?
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| No metrics-server | HPA shows <unknown> for targets | Install metrics-server first |
| No resource requests on pods | HPA can’t calculate utilization | Always set requests |
| Min = Max replicas | HPA can’t scale | Set different min and max |
| CPU target too low (e.g., 10%) | Scales too aggressively, wastes resources | Start at 50-80% |
| Using HPA + VPA on same metric | Conflict — both try to adjust | Use HPA for scaling, VPA for right-sizing (different metrics) |
| Forgetting cooldown | Wonder why HPA doesn’t scale down immediately | Default 5m stabilization window |
-
You deployed an HPA for your web application, but
kubectl get hpashowsTARGETS: <unknown>/80%and the replica count never changes. The application is clearly under heavy load. Walk through your troubleshooting steps to get the HPA working.Answer
The `` target means the HPA cannot read metrics. First, check if metrics-server is installed: run `kubectl top nodes` -- if it returns an error ("Metrics API not available"), install metrics-server. Second, even with metrics-server running, the HPA needs the Deployment's pods to have `resources.requests.cpu` set. Without CPU requests, HPA cannot calculate utilization percentage (utilization = current usage / request). Fix by running `kubectl set resources deployment/web --requests=cpu=100m`. After both fixes, the HPA should show actual utilization within 15-30 seconds and begin making scaling decisions. -
Your e-commerce API has an HPA with
min: 2, max: 20, targetCPU: 50%. During Black Friday, traffic spikes and all 20 replicas are running at 95% CPU. The HPA can’t scale beyond 20, and users are getting timeouts. What are three approaches to handle this situation, both for the immediate crisis and for next year?Answer
For the immediate crisis: (1) Increase the HPA's `maxReplicas` with `kubectl patch hpa web --patch '{"spec":{"maxReplicas":40}}'` to allow more pods. (2) If nodes are full, the cluster autoscaler needs to add more nodes -- verify it's enabled and has headroom in the node group's max size. For next year: (3) Pre-scale before the event by manually setting a higher `minReplicas` before traffic hits (e.g., `kubectl patch hpa web --patch '{"spec":{"minReplicas":15}}'`). This avoids the latency of reactive scaling. Also consider using HPA with custom metrics (requests-per-second) instead of CPU, which responds faster to traffic changes than CPU utilization does. -
Your team runs a single-replica Redis cache as a StatefulSet. During peak hours, it needs more CPU and memory but adding replicas isn’t an option since the app uses a single Redis instance. A colleague suggests HPA. Why won’t HPA work here, what should you use instead, and what mode would you start with?
Answer
HPA won't work because Redis is a single-instance stateful workload -- adding replicas doesn't create a clustered cache, it creates independent caches that the application doesn't know about. Use VPA (Vertical Pod Autoscaler) instead, which adjusts the CPU and memory requests/limits on the existing pod rather than adding replicas. Start with `updateMode: "Off"` (recommendation-only mode) to observe what VPA suggests without making changes. Once you trust the recommendations, switch to `updateMode: "Auto"` which, on Kubernetes 1.35+, uses in-place pod resize to adjust resources without restarting the container -- critical for a cache that would lose data on restart. -
An engineer configured both HPA (targeting CPU at 50%) and VPA on the same Deployment. During a load test, they notice erratic behavior: the pod count oscillates between 3 and 8 replicas while resource requests keep changing. Explain why this happens and how to properly use both autoscalers together.
Answer
HPA and VPA conflict when targeting the same metric (CPU). Here's the oscillation cycle: VPA increases the CPU request on each pod (making pods "bigger"). HPA sees that per-pod CPU utilization dropped (because the request denominator increased) and scales down replicas. With fewer replicas, per-pod CPU usage rises again, HPA scales back up, and VPA sees high utilization and increases requests further. To use both together correctly, configure HPA to scale on custom metrics (like requests-per-second or queue depth) rather than CPU, and let VPA handle CPU/memory right-sizing. This way they operate on orthogonal dimensions: HPA adjusts replica count based on traffic, while VPA adjusts pod size based on resource consumption patterns. Never let both autoscalers compete over the same metric.
Hands-On Exercise
Section titled “Hands-On Exercise”Challenge: Auto-Scale a Web Application
Set up a deployment, configure HPA, generate load, and verify scaling.
# 1. Create deployment with resource requestsk create deployment challenge-web --image=nginx --replicas=1k set resources deployment challenge-web \ --requests=cpu=50m,memory=64Mi --limits=cpu=100m,memory=128Mi
# 2. Expose itk expose deployment challenge-web --port=80
# 3. Create HPA: 2-8 replicas, 50% CPU targetk autoscale deployment challenge-web --min=2 --max=8 --cpu-percent=50
# 4. Verify HPA is workingk get hpa challenge-web# Should show TARGETS and current replica count
# 5. Generate loadk run load --image=busybox --restart=Never -- \ /bin/sh -c "while true; do wget -q -O- http://challenge-web; done"
# 6. Watch scaling happenk get hpa challenge-web -w# Wait until you see replicas increase
# 7. Stop load and watch scale-downk delete pod loadk get hpa challenge-web -w# Replicas should decrease after cooldown (5 min)
# 8. Cleanupk delete deployment challenge-webk delete svc challenge-webk delete hpa challenge-webSuccess Criteria:
- HPA created with correct min/max/target
- Replicas scale up during load
- Replicas scale down after load stops
- No
<unknown>in HPA targets
Next Module
Section titled “Next Module”Return to Part 2 Overview.