Skip to content

Module 1.3: Workload Rightsizing & Optimization

Discipline Module | Complexity: [MEDIUM] | Time: 2.5h

Before starting this module:

  • Required: Module 1.2: Kubernetes Cost Allocation — Cost visibility and attribution
  • Required: Understanding of Kubernetes resource requests and limits
  • Required: Familiarity with Deployments, Pods, and container resource management
  • Recommended: Experience with kubectl top and metrics-server
  • Recommended: Access to a local Kubernetes cluster (kind or minikube)

After completing this module, you will be able to:

  • Implement resource rightsizing recommendations using VPA, Goldilocks, or custom analysis scripts
  • Design rightsizing workflows that validate changes in staging before applying to production workloads
  • Analyze resource request and limit patterns to identify over-provisioned and under-provisioned workloads
  • Build automated rightsizing pipelines that continuously optimize resource allocations based on actual usage

In Module 1.2, you learned that the average Kubernetes cluster runs at 13-18% CPU utilization. That means for every dollar you spend on compute, roughly 82-87 cents buys unused capacity.

Why does this happen? Because engineers are rational.

When a developer sets resource requests, they face an asymmetric risk: request too little and the app crashes at 3 AM. Request too much and… nothing bad happens. The cost is invisible, the outage is a PagerDuty alert. So developers round up. Way up.

graph TD
subgraph "The Developer's Dilemma"
A["'My app uses ~200m CPU normally, but once last<br>quarter it spiked to 800m during Black Friday.<br>I'll request 1000m to be safe.'"]
B["Actual usage (p95): 250m CPU<br>Requested: 1000m CPU<br>Wasted: 750m CPU (75%)"]
C["Annual waste per replica: ~$270<br>× 6 replicas: ~$1,620/year<br>× 80 similar services: ~$129,600/year"]
D["That's one senior engineer's salary in waste."]
end
A --> B
B --> C
C --> D

Rightsizing is the practice of aligning resource requests with actual usage. It’s the single highest-ROI FinOps activity for Kubernetes — and this module shows you exactly how to do it.


  • Google’s internal research showed that container resource requests are typically set 5-10x higher than actual usage across most workloads. This isn’t laziness — it’s rational risk aversion. Nobody gets fired for over-provisioning, but under-provisioning causes visible outages.

  • The Vertical Pod Autoscaler (VPA) was created specifically to solve rightsizing. Originally developed by Google, it’s now a Kubernetes autoscaler project that observes actual resource consumption over time and recommends (or automatically applies) right-sized resource requests.

  • Memory rightsizing is trickier than CPU rightsizing. If you under-provision CPU, the container gets throttled (slow but alive). If you under-provision memory, the container gets OOM-killed (dead). This asymmetry means memory requests should include a larger safety margin — typically 15-25% above peak observed usage.


The first step in rightsizing is finding where the biggest gaps exist between what’s requested and what’s used.

Terminal window
# Quick check: resource requests vs actual usage
kubectl top pods -n payments --containers
NAMESPACE POD CONTAINER CPU(cores) MEMORY(bytes)
payments payment-api-7d8f9c-abc12 api 23m 84Mi
payments payment-api-7d8f9c-def34 api 31m 91Mi
payments payment-api-7d8f9c-ghi56 api 18m 78Mi
payments payment-worker-5b6c7-jkl89 worker 8m 42Mi
payments payment-worker-5b6c7-mno01 worker 5m 38Mi

Compare against requests:

payment-api:
Requested: 200m CPU, 256Mi memory (per replica)
Actual: ~24m CPU, ~84Mi memory (average)
Gap: 176m CPU (88%), 172Mi memory (67%)
payment-worker:
Requested: 100m CPU, 128Mi memory (per replica)
Actual: ~7m CPU, ~40Mi memory (average)
Gap: 93m CPU (93%), 88Mi memory (69%)

For historical analysis over days or weeks (not just a point-in-time snapshot):

# Average CPU usage vs requests over 7 days, by container
avg by (namespace, pod, container) (
rate(container_cpu_usage_seconds_total{container!=""}[5m])
) / on(namespace, pod, container) group_left()
kube_pod_container_resource_requests{resource="cpu"}
# Returns values like 0.12, meaning 12% of requested CPU is actually used
# Memory usage vs requests over 7 days
avg by (namespace, pod, container) (
container_memory_working_set_bytes{container!=""}
) / on(namespace, pod, container) group_left()
kube_pod_container_resource_requests{resource="memory"}
# Returns values like 0.33, meaning 33% of requested memory is used
# Find the worst offenders: pods where avg CPU usage < 10% of requests
avg by (namespace, pod) (
rate(container_cpu_usage_seconds_total{container!=""}[1h])
) / on(namespace, pod) group_left()
sum by (namespace, pod) (
kube_pod_container_resource_requests{resource="cpu"}
) < 0.10

Categorize workloads based on their usage patterns:

CategoryCPU Usage vs RequestMemory Usage vs RequestAction
Massively over-provisioned< 15%< 30%Rightsize immediately (easy win)
Moderately over-provisioned15-40%30-60%Rightsize with monitoring
Reasonably sized40-70%60-80%Monitor, minor adjustments
Tight70-85%80-90%Watch carefully, might need increase
Under-provisioned> 85%> 90%Increase requests immediately

Pause and predict: If you scale up replicas using HPA based on CPU, and VPA also tries to change CPU requests, what might happen?

VPA watches actual resource consumption over time and adjusts (or recommends) resource requests accordingly.

graph LR
A["Observe<br>usage<br>metrics<br>(Recommender)"] --> B["Calculate<br>optimal<br>requests<br>(Recommender)"]
B --> C["Apply<br>new<br>requests<br>(Updater — optional)"]
ComponentRoleRequired?
RecommenderWatches usage, calculates recommendationsYes
UpdaterEvicts pods to apply new requestsOnly for Auto mode
Admission ControllerSets requests on new podsOnly for Auto/Initial modes
ModeBehaviorUse Case
OffOnly generates recommendations, applies nothingStart here — review before changing anything
InitialSets requests on pod creation, doesn’t change running podsSafe for new deployments
AutoEvicts and recreates pods with updated requestsFully automated rightsizing
RecreateSame as Auto (legacy name)Avoid, use Auto instead

Best practice: Always start with Off mode to review recommendations before trusting VPA to change anything automatically.

Before you go all-in on VPA, know the gotchas:

  1. VPA and HPA conflict on CPU/memory — Don’t use both to scale the same metric. VPA adjusts requests; HPA adjusts replicas. If both try to respond to CPU, they fight.

  2. VPA evicts pods to update — In Auto mode, VPA kills running pods to apply new resource values. This means brief disruption. Use PodDisruptionBudgets.

  3. VPA needs history — Recommendations improve with more data. Give VPA at least 24-48 hours (ideally 7 days) of data before trusting its recommendations.

  4. VPA doesn’t set limits — It only manages requests. You need separate policies for limits.

  5. VPA ignores burst patterns — If your app spikes to 2000m CPU for 5 seconds every hour, VPA might not capture that in its recommendation.


The Horizontal Pod Autoscaler (HPA) scales replicas. Most teams configure it for availability — but it’s also a powerful cost optimization tool.

# Cost-optimized HPA (scales down quickly, scales up carefully)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payment-api
namespace: payments
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-api
minReplicas: 2 # Don't go below 2 for HA
maxReplicas: 12 # Cap the spend
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65 # Scale up at 65% — more aggressive than default 50%
behavior:
scaleUp:
stabilizationWindowSeconds: 120 # Wait 2 min before scaling up
policies:
- type: Pods
value: 2 # Add max 2 pods at a time
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 25 # Remove max 25% of pods at a time
periodSeconds: 120
SettingCost ImpactRisk
Higher target utilization (65-80%)Lower cost — fewer replicas neededHigher latency during spikes
Lower minReplicasLower baseline costSlower response to sudden load
Faster scaleDownLess idle capacityThrashing if load fluctuates
Slower scaleUpTemporary under-capacityBrief degradation during ramp
Custom metrics (queue depth)Scale on actual demand, not CPURequires metrics pipeline setup

The trick is: let VPA handle resource requests and HPA handle replica count — but on different metrics.

# VPA: Right-size the per-pod resources
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: payment-api-vpa
namespace: payments
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-api
updatePolicy:
updateMode: "Off" # Recommendation only
resourcePolicy:
containerPolicies:
- containerName: api
controlledResources: ["memory"] # VPA manages memory ONLY
minAllowed:
memory: "64Mi"
maxAllowed:
memory: "2Gi"
# HPA: Scale replicas based on CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payment-api-hpa
namespace: payments
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

Rule: VPA on memory, HPA on CPU. They don’t conflict because they manage different dimensions.


Stop and think: Does Kubernetes evict Pods based on how much they cost, or based on how their resources are configured?

Kubernetes assigns QoS classes to pods based on how requests and limits are configured. QoS affects eviction priority, which has cost implications.

# Guaranteed — highest priority, evicted last
# requests == limits for ALL containers
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "500m" # Same as request
memory: "512Mi" # Same as request
# Burstable — medium priority
# requests < limits (or limits not set for some resources)
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "1000m" # Higher than request
memory: "1Gi" # Higher than request
# BestEffort — lowest priority, evicted first
# NO requests or limits set at all
resources: {} # Empty — no guarantees
QoS ClassWhen to UseCost Implication
GuaranteedCritical production workloads (databases, payment APIs)Highest — you pay for the exact resources at all times
BurstableMost production servicesMedium — pay for requests, can burst higher when available
BestEffortBatch jobs, dev/test, non-critical tasksLowest — no cost guarantee, but evicted under pressure

Cost-optimized strategy: Use Guaranteed only for truly critical workloads (< 20% of pods). Make most workloads Burstable. Use BestEffort for development and batch processing.

pie title Cost-Optimized QoS Distribution (Target utilization: 55-70%)
"Burstable (standard)" : 65
"BestEffort (dev/batch)" : 20
"Guaranteed (critical)" : 15

Pause and predict: If a node runs out of memory, which Pod gets evicted first: a Burstable pod using 90% of its requested memory, or a BestEffort pod using 10% of its node’s memory?

Profiling vs Utilization-Based Rightsizing

Section titled “Profiling vs Utilization-Based Rightsizing”

Look at historical usage, set requests to match:

Approach: Watch metrics → set requests = p95 usage + margin
payment-api over 14 days:
CPU p50: 85m → Not useful (too low)
CPU p95: 210m → This is the target
CPU p99: 380m → Rare spikes
CPU max: 820m → One-time outlier
Recommendation: requests.cpu = 250m (p95 + 19% margin)
Previous: requests.cpu = 1000m
Savings: 750m CPU per replica (75% reduction)

Pros: Simple, data-driven, works for all workloads Cons: Backward-looking, doesn’t account for future growth or rare events

Measure actual resource needs through controlled tests:

Terminal window
# Load test to find true resource ceiling
# Using k6 or similar load testing tool
# Step 1: Deploy with generous resources
kubectl set resources deployment/payment-api \
--requests=cpu=2000m,memory=2Gi \
--limits=cpu=4000m,memory=4Gi
# Step 2: Run load test at expected peak traffic
k6 run --vus 200 --duration 30m load-test.js
# Step 3: Observe actual consumption during peak
kubectl top pods -n payments
# Step 4: Set requests = observed peak + 20% margin

Pros: Accounts for peak load, forward-looking, gives confidence Cons: Requires load testing infrastructure, time-intensive

ScenarioRecommended Approach
Existing service with 30+ days of dataUtilization-based
New service, no production dataProfiling (load test first)
Seasonal workload (Black Friday, etc.)Profiling + seasonal adjustment
Batch/cron jobsUtilization-based on last 10 runs
Critical path (payment, auth)Both — profile then validate with utilization

A structured approach to rightsizing across your cluster:

Terminal window
# Find the biggest gaps between requests and usage
# This script ranks workloads by waste potential
cat > /tmp/rightsizing_discovery.sh << 'SCRIPT'
#!/bin/bash
echo "=== Rightsizing Discovery Report ==="
echo "Date: $(date +%Y-%m-%d)"
echo ""
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | grep -v kube); do
echo "--- Namespace: $ns ---"
kubectl get pods -n "$ns" -o json 2>/dev/null | python3 -c "
import json, sys
data = json.load(sys.stdin)
for pod in data.get('items', []):
name = pod['metadata']['name']
for c in pod['spec']['containers']:
cname = c['name']
req = c.get('resources', {}).get('requests', {})
lim = c.get('resources', {}).get('limits', {})
cpu_req = req.get('cpu', 'none')
mem_req = req.get('memory', 'none')
cpu_lim = lim.get('cpu', 'none')
mem_lim = lim.get('memory', 'none')
print(f' {name}/{cname}: req={cpu_req}/{mem_req} lim={cpu_lim}/{mem_lim}')
" 2>/dev/null
echo ""
done
SCRIPT
chmod +x /tmp/rightsizing_discovery.sh
bash /tmp/rightsizing_discovery.sh

Deploy VPA in recommendation mode:

# Deploy VPA for all workloads in target namespace
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: payment-api-vpa
namespace: payments
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-api
updatePolicy:
updateMode: "Off" # Recommendations only
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: "25m"
memory: "64Mi"
maxAllowed:
cpu: "2000m"
memory: "4Gi"

After 24-48 hours, check recommendations:

Terminal window
kubectl get vpa payment-api-vpa -n payments -o json | \
python3 -c "
import json, sys
vpa = json.load(sys.stdin)
recs = vpa.get('status', {}).get('recommendation', {}).get('containerRecommendations', [])
for r in recs:
print(f\"Container: {r['containerName']}\")
print(f\" Lower bound: CPU={r['lowerBound']['cpu']}, Mem={r['lowerBound']['memory']}\")
print(f\" Target: CPU={r['target']['cpu']}, Mem={r['target']['memory']}\")
print(f\" Upper bound: CPU={r['upperBound']['cpu']}, Mem={r['upperBound']['memory']}\")
print(f\" Uncapped: CPU={r['uncappedTarget']['cpu']}, Mem={r['uncappedTarget']['memory']}\")
"

Apply changes progressively:

Terminal window
# Start with non-critical workloads
# Apply VPA target recommendation + 15% margin for CPU, +20% for memory
# Example: VPA recommends cpu=120m, memory=180Mi
# Apply: cpu=138m (round to 150m), memory=216Mi (round to 256Mi)
kubectl set resources deployment/payment-api -n payments \
--requests=cpu=150m,memory=256Mi \
--limits=cpu=500m,memory=512Mi
Terminal window
# Monitor after rightsizing
# Watch for OOMKills, CPU throttling, and latency changes
# Check for OOMKills
kubectl get events -n payments --field-selector reason=OOMKilling
# Check for CPU throttling (Prometheus)
# container_cpu_cfs_throttled_seconds_total should stay low
# Check application latency (compare before/after)
# Use your APM tool or Prometheus histograms

Stop and think: Why is it dangerous to set memory requests equal to average usage instead of p95 or p99?

MistakeWhy It HappensHow to Fix It
Rightsizing without monitoring”Just reduce requests, what could go wrong?”Always monitor OOMKills and throttling for 72+ hours after changes
Setting requests = average usageAverage hides peaksUse p95 or p99 + margin, never average
Rightsizing memory too aggressivelyMemory OOMKill is instant deathKeep 20-25% margin above p99 for memory
Ignoring JVM/Go runtime overheadLanguage runtimes reserve memory beyond app needsAccount for GC heap, goroutine stacks, etc.
Rightsizing once and forgettingUsage patterns change over timeReview VPA recommendations monthly
Applying Auto VPA in production immediatelyPod evictions during trafficStart with Off mode, then Initial, then Auto with PDBs
Not setting VPA boundsVPA might recommend 1m CPU or 100 CPUAlways set minAllowed and maxAllowed
Rightsizing without PDBsVPA evicts pods, service goes downSet PodDisruptionBudgets before enabling Auto mode

Scenario: You are auditing a legacy batch processing application. The main Pod requests 2 CPU and 8Gi memory. Over the last 14 days, Prometheus metrics show its CPU p95 usage is 340m and memory p95 is 2.1Gi. The tech lead asks you to provide new resource request recommendations to cut costs without risking stability. What would you recommend, and how did you arrive at those numbers?

Show Answer

CPU: Recommend 400m. Memory: Recommend 2.5Gi to 3Gi.

Here is why: For CPU, we take the p95 usage of 340m and add a ~15% safety margin (340m * 1.15 = 391m), rounding up to 400m. For memory, because under-provisioning leads to catastrophic OOM-kills rather than just graceful throttling, we apply a larger safety margin of at least 20%. Taking the 2.1Gi p95 and adding 20% gives us 2.52Gi, which we round up to 2.5Gi or 3Gi for extra safety. By applying these calculated margins, you safely reduce CPU waste by 80% and memory waste by over 60% without risking application stability.

Scenario: A junior engineer on your team proposes a new rightsizing policy: “Set all resource requests (both CPU and memory) to exactly the p95 usage observed over the last 30 days.” You need to explain why this policy is dangerous for the application’s reliability. How do you explain the difference between CPU and memory under-provisioning?

Show Answer

If a container exceeds its allocated CPU, the Linux kernel simply throttles it by limiting its CPU time. The application will run slower and latency will increase, but the process remains alive and can eventually recover once the load decreases. However, memory is an incompressible resource; if a container attempts to allocate more memory than its limit, the kernel immediately terminates it via an OOM-kill. This catastrophic termination can corrupt in-flight transactions, cause data loss, and lead to service outages. Therefore, memory requests and limits must always include a significantly larger safety margin than CPU to absorb sudden spikes without killing the application.

Scenario: Your team has deployed a critical payment API using the Horizontal Pod Autoscaler (HPA) to scale replicas based on CPU utilization. A colleague now wants to enable the Vertical Pod Autoscaler (VPA) to automatically optimize resource requests for the same Deployment. They ask you if this is a safe configuration. How do you advise them to configure VPA and HPA to work together?

Show Answer

You should advise them that VPA and HPA can only safely coexist if they are configured to manage entirely different resource dimensions. If both autoscalers attempt to respond to CPU metrics simultaneously, they will conflict—VPA will try to increase the per-pod CPU requests while HPA tries to add more replicas, leading to unpredictable scaling behavior and thrashing. The safe pattern is to configure VPA to manage only memory by setting its controlledResources to ["memory"], while allowing HPA to continue scaling the replica count based purely on CPU utilization. This ensures each autoscaler operates independently without interfering with the other’s scaling logic.

Scenario: You are tasked with rolling out VPA across a production cluster that hosts dozens of microservices. You want to gain visibility into resource waste, but the engineering teams are terrified that automated changes will cause pod evictions and unexpected downtime. Which VPA update mode should you use to start this initiative, and how does the adoption path look over time?

Show Answer

You should start by deploying VPA in Off mode for all workloads. In this mode, VPA acts purely as an observability tool—it analyzes historical usage and generates recommendations without applying any changes or evicting running pods. This allows engineering teams to review the suggested requests, compare them against their own understanding of the workload, and build trust in the tool’s accuracy. Once the teams are confident in the recommendations, you can transition to Initial mode for new deployments, and eventually to Auto mode for full automation, provided that proper PodDisruptionBudgets are in place to ensure safe evictions.

Scenario: You’ve run a cluster-wide analysis and identified that 50 different Deployments are significantly over-provisioned. Your FinOps manager wants to see a quick reduction in the monthly cloud bill, but the SRE team insists on minimizing risk to critical user journeys. How do you prioritize which Deployments to rightsize first?

Show Answer

You should prioritize workloads by calculating their ‘waste potential’, which is the difference between requested and used resources multiplied by the number of replicas and the unit cost. To balance cost savings with risk, you start by targeting non-critical workloads (such as staging environments, batch jobs, or internal tools) that exhibit the largest request-usage gaps and run with high replica counts. Additionally, you should prioritize stateless services over stateful ones, as stateless applications can recover seamlessly from unexpected OOM-kills via simple restarts. By following this strategy, you secure the largest and safest financial wins early on while gradually building the organizational confidence needed to rightsize the more sensitive, mission-critical applications later.


Hands-On Exercise: VPA Recommendation Mode

Section titled “Hands-On Exercise: VPA Recommendation Mode”

Deploy VPA in recommendation mode on an over-provisioned Deployment and analyze the recommendations.

  • kind or minikube cluster running
  • kubectl configured
  • metrics-server installed
Terminal window
# Clone the VPA repository
git clone https://github.com/kubernetes/autoscaler.git /tmp/autoscaler
cd /tmp/autoscaler/vertical-pod-autoscaler
# Install VPA components
./hack/vpa-up.sh
# Verify VPA is running
kubectl get pods -n kube-system | grep vpa

Expected output:

vpa-admission-controller-xxx 1/1 Running 0 30s
vpa-recommender-xxx 1/1 Running 0 30s
vpa-updater-xxx 1/1 Running 0 30s

Step 2: Deploy an Over-Provisioned Workload

Section titled “Step 2: Deploy an Over-Provisioned Workload”
Terminal window
# Create namespace
kubectl create namespace rightsizing-lab
# Deploy a massively over-provisioned nginx
kubectl apply -f - << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: overprovisioned-app
namespace: rightsizing-lab
labels:
app: overprovisioned-app
spec:
replicas: 3
selector:
matchLabels:
app: overprovisioned-app
template:
metadata:
labels:
app: overprovisioned-app
spec:
containers:
- name: app
image: nginx:alpine
resources:
requests:
cpu: "1000m" # Way too much for nginx
memory: "1Gi" # Way too much for nginx
limits:
cpu: "2000m"
memory: "2Gi"
ports:
- containerPort: 80
EOF
# Create a Service so the load-generator can reach the app via DNS
kubectl apply -f - << 'EOF'
apiVersion: v1
kind: Service
metadata:
name: overprovisioned-app
namespace: rightsizing-lab
spec:
selector:
app: overprovisioned-app
ports:
- port: 80
targetPort: 80
EOF
# Wait for pods to be ready
kubectl rollout status deployment/overprovisioned-app -n rightsizing-lab
# Verify resources
kubectl get pods -n rightsizing-lab -o custom-columns=\
NAME:.metadata.name,\
CPU_REQ:.spec.containers[0].resources.requests.cpu,\
MEM_REQ:.spec.containers[0].resources.requests.memory,\
CPU_LIM:.spec.containers[0].resources.limits.cpu,\
MEM_LIM:.spec.containers[0].resources.limits.memory
Terminal window
kubectl apply -f - << 'EOF'
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: overprovisioned-app-vpa
namespace: rightsizing-lab
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: overprovisioned-app
updatePolicy:
updateMode: "Off" # Recommendation only — no changes applied
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: "10m"
memory: "32Mi"
maxAllowed:
cpu: "2000m"
memory: "4Gi"
controlledResources: ["cpu", "memory"]
EOF
echo "VPA created in Off mode. Waiting for recommendations..."
Terminal window
# Simulate light traffic to give VPA usage data
kubectl run load-generator \
--namespace=rightsizing-lab \
--image=busybox \
--restart=Never \
--command -- sh -c "
while true; do
wget -q -O- http://overprovisioned-app.rightsizing-lab.svc.cluster.local/ > /dev/null 2>&1
sleep 0.5
done
"
echo "Load generator running. Wait 5-10 minutes for VPA to collect data..."
Terminal window
# After 5-10 minutes, check VPA recommendations
kubectl get vpa overprovisioned-app-vpa -n rightsizing-lab -o yaml | \
grep -A 30 "recommendation:"

Expected output (values will vary):

recommendation:
containerRecommendations:
- containerName: app
lowerBound:
cpu: 10m
memory: 48Mi
target:
cpu: 15m
memory: 62Mi
uncappedTarget:
cpu: 15m
memory: 62Mi
upperBound:
cpu: 42m
memory: 131Mi
cat > /tmp/analyze_vpa.sh << 'SCRIPT'
#!/bin/bash
echo "============================================"
echo " VPA Rightsizing Analysis"
echo "============================================"
echo ""
# Current requests
echo "CURRENT REQUESTS (per replica):"
echo " CPU: 1000m"
echo " Memory: 1Gi (1024Mi)"
echo ""
# Get VPA recommendations
VPA_JSON=$(kubectl get vpa overprovisioned-app-vpa -n rightsizing-lab -o json 2>/dev/null)
if [ -z "$VPA_JSON" ]; then
echo "ERROR: VPA not found or no recommendations yet."
echo "Wait a few more minutes and try again."
exit 1
fi
echo "$VPA_JSON" | python3 -c "
import json, sys
data = json.load(sys.stdin)
recs = data.get('status', {}).get('recommendation', {}).get('containerRecommendations', [])
if not recs:
print('No recommendations available yet. Wait 5-10 minutes.')
sys.exit(0)
r = recs[0]
print('VPA RECOMMENDATIONS:')
print(f\" Target: CPU={r['target']['cpu']}, Memory={r['target']['memory']}\")
print(f\" Lower bound: CPU={r['lowerBound']['cpu']}, Memory={r['lowerBound']['memory']}\")
print(f\" Upper bound: CPU={r['upperBound']['cpu']}, Memory={r['upperBound']['memory']}\")
print()
# Parse target values for savings calculation
cpu_target = r['target']['cpu']
if cpu_target.endswith('m'):
cpu_target_m = int(cpu_target[:-1])
else:
cpu_target_m = int(float(cpu_target) * 1000)
mem_target = r['target']['memory']
if mem_target.endswith('Mi'):
mem_target_mi = int(mem_target[:-2])
elif mem_target.endswith('Gi'):
mem_target_mi = int(float(mem_target[:-2]) * 1024)
elif mem_target.endswith('M'):
mem_target_mi = int(mem_target[:-1])
else:
mem_target_mi = int(int(mem_target) / 1048576)
cpu_savings = ((1000 - cpu_target_m) / 1000) * 100
mem_savings = ((1024 - mem_target_mi) / 1024) * 100
print('SAVINGS ANALYSIS:')
print(f' CPU: {1000}m → {cpu_target_m}m = {cpu_savings:.0f}% reduction')
print(f' Memory: 1024Mi → {mem_target_mi}Mi = {mem_savings:.0f}% reduction')
print()
# With margin
cpu_safe = int(cpu_target_m * 1.15 / 5) * 5 # 15% margin, round to 5
mem_safe = int(mem_target_mi * 1.20 / 16) * 16 # 20% margin, round to 16
print('RECOMMENDED NEW REQUESTS (with safety margin):')
print(f' CPU: {max(cpu_safe, 25)}m (target + 15%)')
print(f' Memory: {max(mem_safe, 64)}Mi (target + 20%)')
print()
print('ESTIMATED MONTHLY SAVINGS (3 replicas):')
cpu_saved = (1000 - max(cpu_safe, 25)) / 1000 * 3
print(f' CPU: {cpu_saved:.2f} cores freed across cluster')
print(f' At \$0.05/CPU-hr: ~\${cpu_saved * 0.05 * 730:.2f}/month')
"
SCRIPT
chmod +x /tmp/analyze_vpa.sh
bash /tmp/analyze_vpa.sh
Terminal window
# Apply the VPA-recommended values with margin
# Adjust these based on your actual VPA output
kubectl set resources deployment/overprovisioned-app \
-n rightsizing-lab \
--requests=cpu=25m,memory=64Mi \
--limits=cpu=100m,memory=256Mi
# Watch the rollout
kubectl rollout status deployment/overprovisioned-app -n rightsizing-lab
# Verify new resource allocation
kubectl get pods -n rightsizing-lab -o custom-columns=\
NAME:.metadata.name,\
CPU_REQ:.spec.containers[0].resources.requests.cpu,\
MEM_REQ:.spec.containers[0].resources.requests.memory
Terminal window
kubectl delete namespace rightsizing-lab
kubectl delete pod load-generator -n rightsizing-lab --ignore-not-found

You’ve completed this exercise when you:

  • Deployed VPA and verified all three components are running
  • Created an over-provisioned Deployment (1000m CPU, 1Gi memory for nginx)
  • Deployed VPA in Off mode and generated recommendations
  • Analyzed VPA recommendations and calculated savings
  • Applied rightsized resources with safety margins
  • Verified the Deployment runs correctly with reduced resources

  1. The request-usage gap is the largest source of Kubernetes waste — most workloads use 10-20% of what they request
  2. VPA automates rightsizing recommendations — start with Off mode, graduate to Auto
  3. Memory needs more margin than CPU — CPU throttling is graceful, OOM-killing is catastrophic
  4. HPA and VPA can coexist — VPA on memory, HPA on CPU
  5. Rightsizing is continuous — usage patterns change, review recommendations monthly

Projects:

  • Kubernetes VPA — github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
  • Goldilocks — github.com/FairwindsOps/goldilocks (VPA dashboard for all workloads)

Articles:

  • “Right-Sizing Your Kubernetes Workloads” — learnk8s.io
  • “VPA Best Practices” — povilasv.me/vertical-pod-autoscaler-best-practices
  • “CPU Limits in Kubernetes Are Harmful” — robusta.dev (why some teams remove CPU limits)

Talks:

  • “To Limit or Not to Limit: Kubernetes Resource Management” — KubeCon (YouTube)
  • “Goldilocks: Getting Kubernetes Resource Requests Just Right” — Fairwinds (YouTube)

Rightsizing is the highest-ROI FinOps activity in Kubernetes. By using VPA recommendations, Prometheus metrics, and structured workflows, teams can typically reduce compute costs by 40-70% without impacting application performance. The key is to start with visibility (Off mode VPA), apply changes gradually (non-critical workloads first), and monitor aggressively after changes (OOMKills, throttling, latency). Rightsizing is not a one-time project — it’s a continuous practice that should be reviewed monthly as usage patterns evolve.


Continue to Module 1.4: Cluster Scaling & Compute Optimization to learn how Karpenter, Spot instances, and node consolidation reduce infrastructure costs at the cluster level.


“The most expensive resource is the one nobody’s using.” — FinOps proverb