Module 1.3: Workload Rightsizing & Optimization
Discipline Module | Complexity:
[MEDIUM]| Time: 2.5h
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: Module 1.2: Kubernetes Cost Allocation — Cost visibility and attribution
- Required: Understanding of Kubernetes resource requests and limits
- Required: Familiarity with Deployments, Pods, and container resource management
- Recommended: Experience with
kubectl topand metrics-server - Recommended: Access to a local Kubernetes cluster (kind or minikube)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Implement resource rightsizing recommendations using VPA, Goldilocks, or custom analysis scripts
- Design rightsizing workflows that validate changes in staging before applying to production workloads
- Analyze resource request and limit patterns to identify over-provisioned and under-provisioned workloads
- Build automated rightsizing pipelines that continuously optimize resource allocations based on actual usage
Why This Module Matters
Section titled “Why This Module Matters”In Module 1.2, you learned that the average Kubernetes cluster runs at 13-18% CPU utilization. That means for every dollar you spend on compute, roughly 82-87 cents buys unused capacity.
Why does this happen? Because engineers are rational.
When a developer sets resource requests, they face an asymmetric risk: request too little and the app crashes at 3 AM. Request too much and… nothing bad happens. The cost is invisible, the outage is a PagerDuty alert. So developers round up. Way up.
graph TD subgraph "The Developer's Dilemma" A["'My app uses ~200m CPU normally, but once last<br>quarter it spiked to 800m during Black Friday.<br>I'll request 1000m to be safe.'"] B["Actual usage (p95): 250m CPU<br>Requested: 1000m CPU<br>Wasted: 750m CPU (75%)"] C["Annual waste per replica: ~$270<br>× 6 replicas: ~$1,620/year<br>× 80 similar services: ~$129,600/year"] D["That's one senior engineer's salary in waste."] end A --> B B --> C C --> DRightsizing is the practice of aligning resource requests with actual usage. It’s the single highest-ROI FinOps activity for Kubernetes — and this module shows you exactly how to do it.
Did You Know?
Section titled “Did You Know?”-
Google’s internal research showed that container resource requests are typically set 5-10x higher than actual usage across most workloads. This isn’t laziness — it’s rational risk aversion. Nobody gets fired for over-provisioning, but under-provisioning causes visible outages.
-
The Vertical Pod Autoscaler (VPA) was created specifically to solve rightsizing. Originally developed by Google, it’s now a Kubernetes autoscaler project that observes actual resource consumption over time and recommends (or automatically applies) right-sized resource requests.
-
Memory rightsizing is trickier than CPU rightsizing. If you under-provision CPU, the container gets throttled (slow but alive). If you under-provision memory, the container gets OOM-killed (dead). This asymmetry means memory requests should include a larger safety margin — typically 15-25% above peak observed usage.
Identifying Over-Provisioned Workloads
Section titled “Identifying Over-Provisioned Workloads”The Request-Usage Gap
Section titled “The Request-Usage Gap”The first step in rightsizing is finding where the biggest gaps exist between what’s requested and what’s used.
# Quick check: resource requests vs actual usagekubectl top pods -n payments --containersNAMESPACE POD CONTAINER CPU(cores) MEMORY(bytes)payments payment-api-7d8f9c-abc12 api 23m 84Mipayments payment-api-7d8f9c-def34 api 31m 91Mipayments payment-api-7d8f9c-ghi56 api 18m 78Mipayments payment-worker-5b6c7-jkl89 worker 8m 42Mipayments payment-worker-5b6c7-mno01 worker 5m 38MiCompare against requests:
payment-api: Requested: 200m CPU, 256Mi memory (per replica) Actual: ~24m CPU, ~84Mi memory (average) Gap: 176m CPU (88%), 172Mi memory (67%)
payment-worker: Requested: 100m CPU, 128Mi memory (per replica) Actual: ~7m CPU, ~40Mi memory (average) Gap: 93m CPU (93%), 88Mi memory (69%)Using Prometheus Queries
Section titled “Using Prometheus Queries”For historical analysis over days or weeks (not just a point-in-time snapshot):
# Average CPU usage vs requests over 7 days, by containeravg by (namespace, pod, container) ( rate(container_cpu_usage_seconds_total{container!=""}[5m])) / on(namespace, pod, container) group_left()kube_pod_container_resource_requests{resource="cpu"}
# Returns values like 0.12, meaning 12% of requested CPU is actually used# Memory usage vs requests over 7 daysavg by (namespace, pod, container) ( container_memory_working_set_bytes{container!=""}) / on(namespace, pod, container) group_left()kube_pod_container_resource_requests{resource="memory"}
# Returns values like 0.33, meaning 33% of requested memory is used# Find the worst offenders: pods where avg CPU usage < 10% of requestsavg by (namespace, pod) ( rate(container_cpu_usage_seconds_total{container!=""}[1h])) / on(namespace, pod) group_left()sum by (namespace, pod) ( kube_pod_container_resource_requests{resource="cpu"}) < 0.10The Rightsizing Matrix
Section titled “The Rightsizing Matrix”Categorize workloads based on their usage patterns:
| Category | CPU Usage vs Request | Memory Usage vs Request | Action |
|---|---|---|---|
| Massively over-provisioned | < 15% | < 30% | Rightsize immediately (easy win) |
| Moderately over-provisioned | 15-40% | 30-60% | Rightsize with monitoring |
| Reasonably sized | 40-70% | 60-80% | Monitor, minor adjustments |
| Tight | 70-85% | 80-90% | Watch carefully, might need increase |
| Under-provisioned | > 85% | > 90% | Increase requests immediately |
Pause and predict: If you scale up replicas using HPA based on CPU, and VPA also tries to change CPU requests, what might happen?
The Vertical Pod Autoscaler (VPA)
Section titled “The Vertical Pod Autoscaler (VPA)”What VPA Does
Section titled “What VPA Does”VPA watches actual resource consumption over time and adjusts (or recommends) resource requests accordingly.
graph LR A["Observe<br>usage<br>metrics<br>(Recommender)"] --> B["Calculate<br>optimal<br>requests<br>(Recommender)"] B --> C["Apply<br>new<br>requests<br>(Updater — optional)"]VPA Components
Section titled “VPA Components”| Component | Role | Required? |
|---|---|---|
| Recommender | Watches usage, calculates recommendations | Yes |
| Updater | Evicts pods to apply new requests | Only for Auto mode |
| Admission Controller | Sets requests on new pods | Only for Auto/Initial modes |
VPA Update Modes
Section titled “VPA Update Modes”| Mode | Behavior | Use Case |
|---|---|---|
Off | Only generates recommendations, applies nothing | Start here — review before changing anything |
Initial | Sets requests on pod creation, doesn’t change running pods | Safe for new deployments |
Auto | Evicts and recreates pods with updated requests | Fully automated rightsizing |
Recreate | Same as Auto (legacy name) | Avoid, use Auto instead |
Best practice: Always start with Off mode to review recommendations before trusting VPA to change anything automatically.
VPA Limitations
Section titled “VPA Limitations”Before you go all-in on VPA, know the gotchas:
-
VPA and HPA conflict on CPU/memory — Don’t use both to scale the same metric. VPA adjusts requests; HPA adjusts replicas. If both try to respond to CPU, they fight.
-
VPA evicts pods to update — In Auto mode, VPA kills running pods to apply new resource values. This means brief disruption. Use PodDisruptionBudgets.
-
VPA needs history — Recommendations improve with more data. Give VPA at least 24-48 hours (ideally 7 days) of data before trusting its recommendations.
-
VPA doesn’t set limits — It only manages requests. You need separate policies for limits.
-
VPA ignores burst patterns — If your app spikes to 2000m CPU for 5 seconds every hour, VPA might not capture that in its recommendation.
HPA Tuning for Cost
Section titled “HPA Tuning for Cost”The Horizontal Pod Autoscaler (HPA) scales replicas. Most teams configure it for availability — but it’s also a powerful cost optimization tool.
Aggressive vs Conservative Scaling
Section titled “Aggressive vs Conservative Scaling”# Cost-optimized HPA (scales down quickly, scales up carefully)apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: payment-api namespace: paymentsspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: payment-api minReplicas: 2 # Don't go below 2 for HA maxReplicas: 12 # Cap the spend metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 65 # Scale up at 65% — more aggressive than default 50% behavior: scaleUp: stabilizationWindowSeconds: 120 # Wait 2 min before scaling up policies: - type: Pods value: 2 # Add max 2 pods at a time periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 # Wait 5 min before scaling down policies: - type: Percent value: 25 # Remove max 25% of pods at a time periodSeconds: 120Cost Impact of HPA Settings
Section titled “Cost Impact of HPA Settings”| Setting | Cost Impact | Risk |
|---|---|---|
| Higher target utilization (65-80%) | Lower cost — fewer replicas needed | Higher latency during spikes |
| Lower minReplicas | Lower baseline cost | Slower response to sudden load |
| Faster scaleDown | Less idle capacity | Thrashing if load fluctuates |
| Slower scaleUp | Temporary under-capacity | Brief degradation during ramp |
| Custom metrics (queue depth) | Scale on actual demand, not CPU | Requires metrics pipeline setup |
Combining HPA + VPA Safely
Section titled “Combining HPA + VPA Safely”The trick is: let VPA handle resource requests and HPA handle replica count — but on different metrics.
# VPA: Right-size the per-pod resourcesapiVersion: autoscaling.k8s.io/v1kind: VerticalPodAutoscalermetadata: name: payment-api-vpa namespace: paymentsspec: targetRef: apiVersion: apps/v1 kind: Deployment name: payment-api updatePolicy: updateMode: "Off" # Recommendation only resourcePolicy: containerPolicies: - containerName: api controlledResources: ["memory"] # VPA manages memory ONLY minAllowed: memory: "64Mi" maxAllowed: memory: "2Gi"
# HPA: Scale replicas based on CPUapiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: payment-api-hpa namespace: paymentsspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: payment-api minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70Rule: VPA on memory, HPA on CPU. They don’t conflict because they manage different dimensions.
Stop and think: Does Kubernetes evict Pods based on how much they cost, or based on how their resources are configured?
Quality of Service (QoS) for Cost
Section titled “Quality of Service (QoS) for Cost”Kubernetes assigns QoS classes to pods based on how requests and limits are configured. QoS affects eviction priority, which has cost implications.
The Three QoS Classes
Section titled “The Three QoS Classes”# Guaranteed — highest priority, evicted last# requests == limits for ALL containersresources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "500m" # Same as request memory: "512Mi" # Same as request
# Burstable — medium priority# requests < limits (or limits not set for some resources)resources: requests: cpu: "200m" memory: "256Mi" limits: cpu: "1000m" # Higher than request memory: "1Gi" # Higher than request
# BestEffort — lowest priority, evicted first# NO requests or limits set at allresources: {} # Empty — no guaranteesQoS and Cost Strategy
Section titled “QoS and Cost Strategy”| QoS Class | When to Use | Cost Implication |
|---|---|---|
| Guaranteed | Critical production workloads (databases, payment APIs) | Highest — you pay for the exact resources at all times |
| Burstable | Most production services | Medium — pay for requests, can burst higher when available |
| BestEffort | Batch jobs, dev/test, non-critical tasks | Lowest — no cost guarantee, but evicted under pressure |
Cost-optimized strategy: Use Guaranteed only for truly critical workloads (< 20% of pods). Make most workloads Burstable. Use BestEffort for development and batch processing.
pie title Cost-Optimized QoS Distribution (Target utilization: 55-70%) "Burstable (standard)" : 65 "BestEffort (dev/batch)" : 20 "Guaranteed (critical)" : 15Pause and predict: If a node runs out of memory, which Pod gets evicted first: a Burstable pod using 90% of its requested memory, or a BestEffort pod using 10% of its node’s memory?
Profiling vs Utilization-Based Rightsizing
Section titled “Profiling vs Utilization-Based Rightsizing”Utilization-Based (Reactive)
Section titled “Utilization-Based (Reactive)”Look at historical usage, set requests to match:
Approach: Watch metrics → set requests = p95 usage + margin
payment-api over 14 days: CPU p50: 85m → Not useful (too low) CPU p95: 210m → This is the target CPU p99: 380m → Rare spikes CPU max: 820m → One-time outlier
Recommendation: requests.cpu = 250m (p95 + 19% margin)Previous: requests.cpu = 1000mSavings: 750m CPU per replica (75% reduction)Pros: Simple, data-driven, works for all workloads Cons: Backward-looking, doesn’t account for future growth or rare events
Profiling-Based (Proactive)
Section titled “Profiling-Based (Proactive)”Measure actual resource needs through controlled tests:
# Load test to find true resource ceiling# Using k6 or similar load testing tool
# Step 1: Deploy with generous resourceskubectl set resources deployment/payment-api \ --requests=cpu=2000m,memory=2Gi \ --limits=cpu=4000m,memory=4Gi
# Step 2: Run load test at expected peak traffick6 run --vus 200 --duration 30m load-test.js
# Step 3: Observe actual consumption during peakkubectl top pods -n payments
# Step 4: Set requests = observed peak + 20% marginPros: Accounts for peak load, forward-looking, gives confidence Cons: Requires load testing infrastructure, time-intensive
Which Approach to Use?
Section titled “Which Approach to Use?”| Scenario | Recommended Approach |
|---|---|
| Existing service with 30+ days of data | Utilization-based |
| New service, no production data | Profiling (load test first) |
| Seasonal workload (Black Friday, etc.) | Profiling + seasonal adjustment |
| Batch/cron jobs | Utilization-based on last 10 runs |
| Critical path (payment, auth) | Both — profile then validate with utilization |
Rightsizing Workflow
Section titled “Rightsizing Workflow”A structured approach to rightsizing across your cluster:
Phase 1: Discovery (Week 1)
Section titled “Phase 1: Discovery (Week 1)”# Find the biggest gaps between requests and usage# This script ranks workloads by waste potential
cat > /tmp/rightsizing_discovery.sh << 'SCRIPT'#!/bin/bashecho "=== Rightsizing Discovery Report ==="echo "Date: $(date +%Y-%m-%d)"echo ""
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | grep -v kube); do echo "--- Namespace: $ns ---" kubectl get pods -n "$ns" -o json 2>/dev/null | python3 -c "import json, sysdata = json.load(sys.stdin)for pod in data.get('items', []): name = pod['metadata']['name'] for c in pod['spec']['containers']: cname = c['name'] req = c.get('resources', {}).get('requests', {}) lim = c.get('resources', {}).get('limits', {}) cpu_req = req.get('cpu', 'none') mem_req = req.get('memory', 'none') cpu_lim = lim.get('cpu', 'none') mem_lim = lim.get('memory', 'none') print(f' {name}/{cname}: req={cpu_req}/{mem_req} lim={cpu_lim}/{mem_lim}')" 2>/dev/null echo ""doneSCRIPT
chmod +x /tmp/rightsizing_discovery.shbash /tmp/rightsizing_discovery.shPhase 2: Recommend (Week 2)
Section titled “Phase 2: Recommend (Week 2)”Deploy VPA in recommendation mode:
# Deploy VPA for all workloads in target namespaceapiVersion: autoscaling.k8s.io/v1kind: VerticalPodAutoscalermetadata: name: payment-api-vpa namespace: paymentsspec: targetRef: apiVersion: apps/v1 kind: Deployment name: payment-api updatePolicy: updateMode: "Off" # Recommendations only resourcePolicy: containerPolicies: - containerName: api minAllowed: cpu: "25m" memory: "64Mi" maxAllowed: cpu: "2000m" memory: "4Gi"After 24-48 hours, check recommendations:
kubectl get vpa payment-api-vpa -n payments -o json | \ python3 -c "import json, sysvpa = json.load(sys.stdin)recs = vpa.get('status', {}).get('recommendation', {}).get('containerRecommendations', [])for r in recs: print(f\"Container: {r['containerName']}\") print(f\" Lower bound: CPU={r['lowerBound']['cpu']}, Mem={r['lowerBound']['memory']}\") print(f\" Target: CPU={r['target']['cpu']}, Mem={r['target']['memory']}\") print(f\" Upper bound: CPU={r['upperBound']['cpu']}, Mem={r['upperBound']['memory']}\") print(f\" Uncapped: CPU={r['uncappedTarget']['cpu']}, Mem={r['uncappedTarget']['memory']}\")"Phase 3: Apply (Week 3-4)
Section titled “Phase 3: Apply (Week 3-4)”Apply changes progressively:
# Start with non-critical workloads# Apply VPA target recommendation + 15% margin for CPU, +20% for memory
# Example: VPA recommends cpu=120m, memory=180Mi# Apply: cpu=138m (round to 150m), memory=216Mi (round to 256Mi)
kubectl set resources deployment/payment-api -n payments \ --requests=cpu=150m,memory=256Mi \ --limits=cpu=500m,memory=512MiPhase 4: Validate (Week 4+)
Section titled “Phase 4: Validate (Week 4+)”# Monitor after rightsizing# Watch for OOMKills, CPU throttling, and latency changes
# Check for OOMKillskubectl get events -n payments --field-selector reason=OOMKilling
# Check for CPU throttling (Prometheus)# container_cpu_cfs_throttled_seconds_total should stay low
# Check application latency (compare before/after)# Use your APM tool or Prometheus histogramsStop and think: Why is it dangerous to set memory requests equal to average usage instead of p95 or p99?
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Rightsizing without monitoring | ”Just reduce requests, what could go wrong?” | Always monitor OOMKills and throttling for 72+ hours after changes |
| Setting requests = average usage | Average hides peaks | Use p95 or p99 + margin, never average |
| Rightsizing memory too aggressively | Memory OOMKill is instant death | Keep 20-25% margin above p99 for memory |
| Ignoring JVM/Go runtime overhead | Language runtimes reserve memory beyond app needs | Account for GC heap, goroutine stacks, etc. |
| Rightsizing once and forgetting | Usage patterns change over time | Review VPA recommendations monthly |
| Applying Auto VPA in production immediately | Pod evictions during traffic | Start with Off mode, then Initial, then Auto with PDBs |
| Not setting VPA bounds | VPA might recommend 1m CPU or 100 CPU | Always set minAllowed and maxAllowed |
| Rightsizing without PDBs | VPA evicts pods, service goes down | Set PodDisruptionBudgets before enabling Auto mode |
Question 1
Section titled “Question 1”Scenario: You are auditing a legacy batch processing application. The main Pod requests 2 CPU and 8Gi memory. Over the last 14 days, Prometheus metrics show its CPU p95 usage is 340m and memory p95 is 2.1Gi. The tech lead asks you to provide new resource request recommendations to cut costs without risking stability. What would you recommend, and how did you arrive at those numbers?
Show Answer
CPU: Recommend 400m. Memory: Recommend 2.5Gi to 3Gi.
Here is why: For CPU, we take the p95 usage of 340m and add a ~15% safety margin (340m * 1.15 = 391m), rounding up to 400m. For memory, because under-provisioning leads to catastrophic OOM-kills rather than just graceful throttling, we apply a larger safety margin of at least 20%. Taking the 2.1Gi p95 and adding 20% gives us 2.52Gi, which we round up to 2.5Gi or 3Gi for extra safety. By applying these calculated margins, you safely reduce CPU waste by 80% and memory waste by over 60% without risking application stability.
Question 2
Section titled “Question 2”Scenario: A junior engineer on your team proposes a new rightsizing policy: “Set all resource requests (both CPU and memory) to exactly the p95 usage observed over the last 30 days.” You need to explain why this policy is dangerous for the application’s reliability. How do you explain the difference between CPU and memory under-provisioning?
Show Answer
If a container exceeds its allocated CPU, the Linux kernel simply throttles it by limiting its CPU time. The application will run slower and latency will increase, but the process remains alive and can eventually recover once the load decreases. However, memory is an incompressible resource; if a container attempts to allocate more memory than its limit, the kernel immediately terminates it via an OOM-kill. This catastrophic termination can corrupt in-flight transactions, cause data loss, and lead to service outages. Therefore, memory requests and limits must always include a significantly larger safety margin than CPU to absorb sudden spikes without killing the application.
Question 3
Section titled “Question 3”Scenario: Your team has deployed a critical payment API using the Horizontal Pod Autoscaler (HPA) to scale replicas based on CPU utilization. A colleague now wants to enable the Vertical Pod Autoscaler (VPA) to automatically optimize resource requests for the same Deployment. They ask you if this is a safe configuration. How do you advise them to configure VPA and HPA to work together?
Show Answer
You should advise them that VPA and HPA can only safely coexist if they are configured to manage entirely different resource dimensions. If both autoscalers attempt to respond to CPU metrics simultaneously, they will conflict—VPA will try to increase the per-pod CPU requests while HPA tries to add more replicas, leading to unpredictable scaling behavior and thrashing. The safe pattern is to configure VPA to manage only memory by setting its controlledResources to ["memory"], while allowing HPA to continue scaling the replica count based purely on CPU utilization. This ensures each autoscaler operates independently without interfering with the other’s scaling logic.
Question 4
Section titled “Question 4”Scenario: You are tasked with rolling out VPA across a production cluster that hosts dozens of microservices. You want to gain visibility into resource waste, but the engineering teams are terrified that automated changes will cause pod evictions and unexpected downtime. Which VPA update mode should you use to start this initiative, and how does the adoption path look over time?
Show Answer
You should start by deploying VPA in Off mode for all workloads. In this mode, VPA acts purely as an observability tool—it analyzes historical usage and generates recommendations without applying any changes or evicting running pods. This allows engineering teams to review the suggested requests, compare them against their own understanding of the workload, and build trust in the tool’s accuracy. Once the teams are confident in the recommendations, you can transition to Initial mode for new deployments, and eventually to Auto mode for full automation, provided that proper PodDisruptionBudgets are in place to ensure safe evictions.
Question 5
Section titled “Question 5”Scenario: You’ve run a cluster-wide analysis and identified that 50 different Deployments are significantly over-provisioned. Your FinOps manager wants to see a quick reduction in the monthly cloud bill, but the SRE team insists on minimizing risk to critical user journeys. How do you prioritize which Deployments to rightsize first?
Show Answer
You should prioritize workloads by calculating their ‘waste potential’, which is the difference between requested and used resources multiplied by the number of replicas and the unit cost. To balance cost savings with risk, you start by targeting non-critical workloads (such as staging environments, batch jobs, or internal tools) that exhibit the largest request-usage gaps and run with high replica counts. Additionally, you should prioritize stateless services over stateful ones, as stateless applications can recover seamlessly from unexpected OOM-kills via simple restarts. By following this strategy, you secure the largest and safest financial wins early on while gradually building the organizational confidence needed to rightsize the more sensitive, mission-critical applications later.
Hands-On Exercise: VPA Recommendation Mode
Section titled “Hands-On Exercise: VPA Recommendation Mode”Deploy VPA in recommendation mode on an over-provisioned Deployment and analyze the recommendations.
Prerequisites
Section titled “Prerequisites”kindorminikubecluster runningkubectlconfigured- metrics-server installed
Step 1: Install VPA
Section titled “Step 1: Install VPA”# Clone the VPA repositorygit clone https://github.com/kubernetes/autoscaler.git /tmp/autoscalercd /tmp/autoscaler/vertical-pod-autoscaler
# Install VPA components./hack/vpa-up.sh
# Verify VPA is runningkubectl get pods -n kube-system | grep vpaExpected output:
vpa-admission-controller-xxx 1/1 Running 0 30svpa-recommender-xxx 1/1 Running 0 30svpa-updater-xxx 1/1 Running 0 30sStep 2: Deploy an Over-Provisioned Workload
Section titled “Step 2: Deploy an Over-Provisioned Workload”# Create namespacekubectl create namespace rightsizing-lab
# Deploy a massively over-provisioned nginxkubectl apply -f - << 'EOF'apiVersion: apps/v1kind: Deploymentmetadata: name: overprovisioned-app namespace: rightsizing-lab labels: app: overprovisioned-appspec: replicas: 3 selector: matchLabels: app: overprovisioned-app template: metadata: labels: app: overprovisioned-app spec: containers: - name: app image: nginx:alpine resources: requests: cpu: "1000m" # Way too much for nginx memory: "1Gi" # Way too much for nginx limits: cpu: "2000m" memory: "2Gi" ports: - containerPort: 80EOF
# Create a Service so the load-generator can reach the app via DNSkubectl apply -f - << 'EOF'apiVersion: v1kind: Servicemetadata: name: overprovisioned-app namespace: rightsizing-labspec: selector: app: overprovisioned-app ports: - port: 80 targetPort: 80EOF
# Wait for pods to be readykubectl rollout status deployment/overprovisioned-app -n rightsizing-lab
# Verify resourceskubectl get pods -n rightsizing-lab -o custom-columns=\NAME:.metadata.name,\CPU_REQ:.spec.containers[0].resources.requests.cpu,\MEM_REQ:.spec.containers[0].resources.requests.memory,\CPU_LIM:.spec.containers[0].resources.limits.cpu,\MEM_LIM:.spec.containers[0].resources.limits.memoryStep 3: Create VPA in Off Mode
Section titled “Step 3: Create VPA in Off Mode”kubectl apply -f - << 'EOF'apiVersion: autoscaling.k8s.io/v1kind: VerticalPodAutoscalermetadata: name: overprovisioned-app-vpa namespace: rightsizing-labspec: targetRef: apiVersion: apps/v1 kind: Deployment name: overprovisioned-app updatePolicy: updateMode: "Off" # Recommendation only — no changes applied resourcePolicy: containerPolicies: - containerName: app minAllowed: cpu: "10m" memory: "32Mi" maxAllowed: cpu: "2000m" memory: "4Gi" controlledResources: ["cpu", "memory"]EOF
echo "VPA created in Off mode. Waiting for recommendations..."Step 4: Generate Some Load
Section titled “Step 4: Generate Some Load”# Simulate light traffic to give VPA usage datakubectl run load-generator \ --namespace=rightsizing-lab \ --image=busybox \ --restart=Never \ --command -- sh -c " while true; do wget -q -O- http://overprovisioned-app.rightsizing-lab.svc.cluster.local/ > /dev/null 2>&1 sleep 0.5 done "
echo "Load generator running. Wait 5-10 minutes for VPA to collect data..."Step 5: Review VPA Recommendations
Section titled “Step 5: Review VPA Recommendations”# After 5-10 minutes, check VPA recommendationskubectl get vpa overprovisioned-app-vpa -n rightsizing-lab -o yaml | \ grep -A 30 "recommendation:"Expected output (values will vary):
recommendation: containerRecommendations: - containerName: app lowerBound: cpu: 10m memory: 48Mi target: cpu: 15m memory: 62Mi uncappedTarget: cpu: 15m memory: 62Mi upperBound: cpu: 42m memory: 131MiStep 6: Analyze the Results
Section titled “Step 6: Analyze the Results”cat > /tmp/analyze_vpa.sh << 'SCRIPT'#!/bin/bashecho "============================================"echo " VPA Rightsizing Analysis"echo "============================================"echo ""
# Current requestsecho "CURRENT REQUESTS (per replica):"echo " CPU: 1000m"echo " Memory: 1Gi (1024Mi)"echo ""
# Get VPA recommendationsVPA_JSON=$(kubectl get vpa overprovisioned-app-vpa -n rightsizing-lab -o json 2>/dev/null)
if [ -z "$VPA_JSON" ]; then echo "ERROR: VPA not found or no recommendations yet." echo "Wait a few more minutes and try again." exit 1fi
echo "$VPA_JSON" | python3 -c "import json, sysdata = json.load(sys.stdin)recs = data.get('status', {}).get('recommendation', {}).get('containerRecommendations', [])if not recs: print('No recommendations available yet. Wait 5-10 minutes.') sys.exit(0)r = recs[0]print('VPA RECOMMENDATIONS:')print(f\" Target: CPU={r['target']['cpu']}, Memory={r['target']['memory']}\")print(f\" Lower bound: CPU={r['lowerBound']['cpu']}, Memory={r['lowerBound']['memory']}\")print(f\" Upper bound: CPU={r['upperBound']['cpu']}, Memory={r['upperBound']['memory']}\")print()
# Parse target values for savings calculationcpu_target = r['target']['cpu']if cpu_target.endswith('m'): cpu_target_m = int(cpu_target[:-1])else: cpu_target_m = int(float(cpu_target) * 1000)
mem_target = r['target']['memory']if mem_target.endswith('Mi'): mem_target_mi = int(mem_target[:-2])elif mem_target.endswith('Gi'): mem_target_mi = int(float(mem_target[:-2]) * 1024)elif mem_target.endswith('M'): mem_target_mi = int(mem_target[:-1])else: mem_target_mi = int(int(mem_target) / 1048576)
cpu_savings = ((1000 - cpu_target_m) / 1000) * 100mem_savings = ((1024 - mem_target_mi) / 1024) * 100
print('SAVINGS ANALYSIS:')print(f' CPU: {1000}m → {cpu_target_m}m = {cpu_savings:.0f}% reduction')print(f' Memory: 1024Mi → {mem_target_mi}Mi = {mem_savings:.0f}% reduction')print()
# With margincpu_safe = int(cpu_target_m * 1.15 / 5) * 5 # 15% margin, round to 5mem_safe = int(mem_target_mi * 1.20 / 16) * 16 # 20% margin, round to 16print('RECOMMENDED NEW REQUESTS (with safety margin):')print(f' CPU: {max(cpu_safe, 25)}m (target + 15%)')print(f' Memory: {max(mem_safe, 64)}Mi (target + 20%)')print()print('ESTIMATED MONTHLY SAVINGS (3 replicas):')cpu_saved = (1000 - max(cpu_safe, 25)) / 1000 * 3print(f' CPU: {cpu_saved:.2f} cores freed across cluster')print(f' At \$0.05/CPU-hr: ~\${cpu_saved * 0.05 * 730:.2f}/month')"SCRIPT
chmod +x /tmp/analyze_vpa.shbash /tmp/analyze_vpa.shStep 7: Apply Rightsized Resources
Section titled “Step 7: Apply Rightsized Resources”# Apply the VPA-recommended values with margin# Adjust these based on your actual VPA outputkubectl set resources deployment/overprovisioned-app \ -n rightsizing-lab \ --requests=cpu=25m,memory=64Mi \ --limits=cpu=100m,memory=256Mi
# Watch the rolloutkubectl rollout status deployment/overprovisioned-app -n rightsizing-lab
# Verify new resource allocationkubectl get pods -n rightsizing-lab -o custom-columns=\NAME:.metadata.name,\CPU_REQ:.spec.containers[0].resources.requests.cpu,\MEM_REQ:.spec.containers[0].resources.requests.memoryStep 8: Cleanup
Section titled “Step 8: Cleanup”kubectl delete namespace rightsizing-labkubectl delete pod load-generator -n rightsizing-lab --ignore-not-foundSuccess Criteria
Section titled “Success Criteria”You’ve completed this exercise when you:
- Deployed VPA and verified all three components are running
- Created an over-provisioned Deployment (1000m CPU, 1Gi memory for nginx)
- Deployed VPA in Off mode and generated recommendations
- Analyzed VPA recommendations and calculated savings
- Applied rightsized resources with safety margins
- Verified the Deployment runs correctly with reduced resources
Key Takeaways
Section titled “Key Takeaways”- The request-usage gap is the largest source of Kubernetes waste — most workloads use 10-20% of what they request
- VPA automates rightsizing recommendations — start with Off mode, graduate to Auto
- Memory needs more margin than CPU — CPU throttling is graceful, OOM-killing is catastrophic
- HPA and VPA can coexist — VPA on memory, HPA on CPU
- Rightsizing is continuous — usage patterns change, review recommendations monthly
Further Reading
Section titled “Further Reading”Projects:
- Kubernetes VPA — github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
- Goldilocks — github.com/FairwindsOps/goldilocks (VPA dashboard for all workloads)
Articles:
- “Right-Sizing Your Kubernetes Workloads” — learnk8s.io
- “VPA Best Practices” — povilasv.me/vertical-pod-autoscaler-best-practices
- “CPU Limits in Kubernetes Are Harmful” — robusta.dev (why some teams remove CPU limits)
Talks:
- “To Limit or Not to Limit: Kubernetes Resource Management” — KubeCon (YouTube)
- “Goldilocks: Getting Kubernetes Resource Requests Just Right” — Fairwinds (YouTube)
Summary
Section titled “Summary”Rightsizing is the highest-ROI FinOps activity in Kubernetes. By using VPA recommendations, Prometheus metrics, and structured workflows, teams can typically reduce compute costs by 40-70% without impacting application performance. The key is to start with visibility (Off mode VPA), apply changes gradually (non-critical workloads first), and monitor aggressively after changes (OOMKills, throttling, latency). Rightsizing is not a one-time project — it’s a continuous practice that should be reviewed monthly as usage patterns evolve.
Next Module
Section titled “Next Module”Continue to Module 1.4: Cluster Scaling & Compute Optimization to learn how Karpenter, Spot instances, and node consolidation reduce infrastructure costs at the cluster level.
“The most expensive resource is the one nobody’s using.” — FinOps proverb