Skip to content

Module 1.7: Capacity Planning

Discipline Module | Complexity: [COMPLEX] | Time: 50-60 min

Before starting this module, make sure you can already read service-level indicators, interpret latency and error-rate graphs, and explain why an SLO is a business promise rather than just an engineering target. Capacity planning turns those promises into resource decisions, so it builds directly on the earlier SRE modules instead of replacing them.

PrerequisiteWhy It Matters
Module 1.2: SLOsCapacity decisions must protect a concrete reliability target, not an abstract sense of “healthy.”
Module 1.3: Error BudgetsThe acceptable amount of risk changes how much headroom you keep and how aggressively you scale.
Module 1.4: Toil and AutomationManual capacity work becomes toil when the same decisions repeat every week.
Observability Theory TrackForecasts are only useful when current measurements are trustworthy.
Basic Kubernetes and cloud infrastructure experienceThe examples use replicas, autoscaling, node capacity, and managed-service limits.

After completing this module, you will be able to:

  • Design a capacity model that combines historical traffic, business events, SLO targets, and infrastructure limits into a defensible forecast.
  • Analyze utilization, queueing, latency, and dependency metrics to identify which resource will constrain throughput first.
  • Evaluate manual, scheduled, reactive, and predictive scaling strategies against reliability, cost, and operational-risk trade-offs.
  • Implement load-testing and validation plans that prove capacity assumptions before production users discover the limit.
  • Optimize capacity plans by connecting cloud cost to units of work such as requests, checkouts, jobs, or active users.

At 09:00, a retail company’s checkout service is quiet. By 10:10, a celebrity has posted a discount link, marketing has doubled its ad spend, and thousands of users are refreshing carts at the same time. The dashboards still look acceptable for a few minutes because average CPU is only catching up to reality, but queue depth is climbing, database connections are nearly exhausted, and every retry makes the system busier. The incident commander does not need a bigger dashboard at that moment; she needs capacity that was planned before the event.

Capacity planning is the SRE practice of deciding how much system you need before demand forces the answer. It is not guessing, and it is not buying the largest instance because outages are embarrassing. A good plan connects expected demand, measured system behavior, lead time, SLO risk, and cost into a decision that can be explained to engineering and business stakeholders.

The beginner mistake is to treat capacity as “more replicas.” The senior-level view is broader: every service has a limiting resource, every scaling action has a delay, every forecast carries uncertainty, and every extra unit of headroom has a price. The goal is not perfect prediction. The goal is to make uncertainty explicit early enough that you can test, provision, and adapt without gambling with users.


Capacity planning is the process of ensuring a service can meet future demand while staying inside reliability and cost boundaries. It answers four practical questions: what demand should we expect, what limit will we hit first, how much headroom do we need, and when must we act so capacity is ready before users are affected.

A capacity plan is strongest when it is anchored in the service’s SLO. If the checkout API promises 99.9% of requests under 300 ms, then “enough capacity” means enough CPU, memory, connections, queues, network, storage, and downstream quota to keep that promise during expected peaks. A service can look inexpensive while quietly burning error budget, or look reliable while wasting money on unused resources. SRE capacity planning forces those trade-offs into the open.

The planning cycle is iterative because demand, code, dependencies, and infrastructure all change. A forecast from January is not a guarantee for July after a new feature doubles database writes per request. Each cycle should refresh the model, compare it with production measurements, find the newest bottleneck, provision or tune resources, and validate the result under controlled load.

flowchart TD
A[FORECAST DEMAND<br/>Expected users, events, seasonality] --> B[MEASURE CURRENT STATE<br/>Traffic, utilization, latency, queues]
B --> C[IDENTIFY BOTTLENECKS<br/>Resource or dependency that saturates first]
C --> D[PLAN CAPACITY<br/>Required throughput plus SLO headroom]
D --> E[PROVISION OR TUNE<br/>Replicas, nodes, limits, quotas, services]
E --> F[VALIDATE WITH TESTS<br/>Load, stress, soak, and failover]
F --> G[REVIEW COST AND RISK<br/>Cost per unit, error budget exposure]
G --> A

The important habit is to separate current load from current capacity. Current load is what users are asking the system to do now. Current capacity is the highest load the system can handle while still meeting its SLO. The difference between the two is headroom, and headroom is what gives autoscalers, humans, and dependent systems time to react.

TermPractical MeaningExample
LoadThe work arriving at the system1,200 requests per second during daily peak
CapacityThe work the system can handle within SLO1,800 requests per second at p95 latency under 300 ms
HeadroomCapacity that remains unused during normal peak600 requests per second, or 33% of capacity
BottleneckThe first component that prevents more throughputDatabase connection pool saturates before API CPU
Lead timeTime required to make new capacity usableFive minutes for pods, two weeks for vendor quota
Safety marginExtra capacity held for forecast uncertaintyPlanning for 30% above expected event demand

Imagine a team owns an internal package-search API. The service currently peaks at 900 requests per second, and load testing shows it can handle 1,350 requests per second before p95 latency breaks the SLO. Product expects organic traffic to grow 8% per month, and a new IDE integration is expected to add a 60% event spike in month four. The team wants 30% headroom after forecast demand, not before it.

Current peak load: 900 RPS
Measured SLO-safe capacity: 1,350 RPS
Current headroom: 450 RPS
Current headroom percentage: 33%
Organic growth after 4 months:
900 * (1.08 ^ 4) = 1,224 RPS
Month 4 launch impact:
1,224 * 1.60 = 1,958 RPS expected peak
Capacity required with 30% headroom:
1,958 * 1.30 = 2,545 RPS required capacity

The worked example shows why a system that looks healthy today can still be under-planned. The team has 33% current headroom, which sounds comfortable, but the month-four forecast requires almost twice the current measured capacity. If the lead time to increase database throughput is one month, waiting until dashboards show sustained high utilization is already too late.

Active learning prompt: Before reading further, decide which number you would take to a planning meeting: current peak load, current capacity, forecast peak, or required capacity with headroom. Explain why the other three numbers are still useful but insufficient by themselves.


Demand forecasting is the bridge between business plans and infrastructure plans. Engineering teams often have detailed graphs of last week’s traffic but weak visibility into the campaign, launch, migration, compliance deadline, or customer onboarding event that will change next month’s load. A useful forecast combines observed history with explicit business input and then labels uncertainty instead of hiding it.

Organic growth is usually the easiest part because it appears in historical traffic. It may still be misleading if the product recently changed pricing, added a new client, or shifted traffic from one API version to another. Event-driven demand is less frequent but more dangerous because it can arrive faster than reactive scaling. Viral growth is the hardest case; it cannot be scheduled, so the plan must rely on elastic infrastructure, fast startup times, rate limits, graceful degradation, and larger baseline headroom.

Demand TypeSignal to GatherPlanning Response
Organic growthMonth-over-month traffic, active users, request mixForecast with trends and review monthly
Event-driven spikeLaunch calendar, marketing plan, seasonal business cyclePre-scale, load test, and staff the event
Viral or external spikeSocial mentions, partner integrations, breaking news exposureKeep elastic buffer and protect dependencies
Migration-driven demandCustomers moving from old system to new systemModel overlap and rollback traffic explicitly
Batch or scheduled demandJobs, reports, backups, billing runsStagger schedules and reserve off-peak capacity

Historical trend analysis starts with production measurements and projects them forward. This method works best when the service is mature, request behavior is stable, and growth is gradual. It works poorly when upcoming business events will change user behavior or when a new feature changes the amount of backend work per request.

Month Peak Traffic (RPS) Growth
Jan 820 -
Feb 875 6.7%
Mar 930 6.3%
Apr 995 7.0%
May 1,060 6.5%
Average monthly growth: about 6.6%
Six-month organic forecast:
1,060 * (1.066 ^ 6) = about 1,555 RPS

The calculation is simple, but the interpretation takes judgment. If the request mix changes from mostly reads to many writes, the same RPS may consume more CPU, database I/O, or lock time. A mature capacity model therefore tracks units of work, not just request count. For an API this might mean separating search, checkout, login, write, and export endpoints because each endpoint stresses different resources.

Business-input forecasting asks stakeholders what will change the demand curve. This is not a ceremonial meeting; it is part of the reliability system. SREs should ask for launch dates, expected user cohorts, marketing size, regional rollouts, partner commitments, and success criteria. When business estimates are uncertain, capture ranges rather than forcing false precision.

Questions to ask before a major launch:
What customer segment is being targeted?
How many users or accounts are expected in the first week?
Will traffic arrive gradually or at a published launch time?
Does the feature change read/write ratio or payload size?
Are any external vendors, payment systems, or identity providers involved?
What business loss occurs if we throttle, queue, or disable the feature?

A senior capacity plan often has multiple scenarios. A conservative plan might model expected, high, and extreme demand, then define which actions are justified for each. That keeps the team from arguing about a single magic forecast number and instead focuses the discussion on thresholds, triggers, and acceptable risk.

ScenarioDemand AssumptionCapacity ActionBusiness Discussion
ExpectedForecast peak plus normal headroomStandard pre-scale and monitoringBudgeted operational cost
HighForecast peak plus stronger adoptionExtra replicas, larger database tier, vendor quota increaseTemporary event spend
ExtremeViral or partner-amplified spikeRate limits, queueing, graceful degradation, incident staffingProtect core transactions first

First-principles modeling is useful when a service is new or historical data does not represent the future. Instead of extrapolating traffic, you estimate user behavior and convert it into work. This method exposes assumptions clearly, which makes it easier to revise the model after real production data arrives.

Assumptions:
Each active user opens 3 sessions per day.
Each session performs 12 API calls.
Peak hour receives 18% of daily calls.
The peak minute is 2.5 times the average minute in the peak hour.
For 600,000 active users:
Daily calls = 600,000 * 3 * 12 = 21,600,000 calls
Peak hour calls = 21,600,000 * 0.18 = 3,888,000 calls
Average RPS during peak hour = 3,888,000 / 3,600 = 1,080 RPS
Peak-minute adjusted RPS = 1,080 * 2.5 = 2,700 RPS

This model is not “true” just because the arithmetic is neat. Its value is that every assumption is visible. If analytics later shows five sessions per day instead of three, you can update the model and immediately see the capacity impact. If the peak-minute multiplier is higher in one region, you can plan regional capacity differently instead of averaging away the risk.

Seasonality matters because systems fail at peaks, not averages. A service with the same daily request count can be easy or hard to run depending on whether traffic is smooth or concentrated. Capacity planning should look at daily, weekly, monthly, and yearly shapes, then identify which peaks align with business-critical workflows.

xychart-beta
title "Daily Pattern"
x-axis ["12am", "6am", "12pm", "6pm", "12am"]
y-axis "Traffic" 0 --> 100
line [20, 30, 80, 90, 25]
xychart-beta
title "Weekly Pattern"
x-axis ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
y-axis "Traffic" 0 --> 100
line [80, 85, 80, 85, 80, 30, 25]
xychart-beta
title "Yearly Pattern for Retail"
x-axis ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
y-axis "Traffic" 0 --> 100
line [40, 35, 40, 45, 40, 45, 40, 45, 50, 60, 95, 80]

Active learning prompt: A service has low weekend traffic but runs heavy reporting jobs every Sunday night. Would you plan capacity from the weekday user peak, the reporting window, or both? Write down which resource each window is likely to stress before you continue.


Measure Current Capacity and Find the Bottleneck

Section titled “Measure Current Capacity and Find the Bottleneck”

A forecast tells you expected demand, but only measurement tells you how the current system behaves under load. The most common capacity mistake is to watch one resource, usually CPU, and assume it represents the whole service. Real systems saturate through connection pools, memory pressure, queue depth, disk latency, lock contention, downstream quotas, network bandwidth, garbage collection, or retry amplification.

Capacity measurement should use golden signals and resource signals together. Latency and errors tell you whether users are suffering. Utilization, saturation, and queueing tell you why the system is approaching a limit. Throughput tells you whether extra load is becoming completed work or just more waiting.

MetricWhat It MeasuresWarning SignCapacity Interpretation
CPU utilizationProcessing capacitySustained usage above about 70%Little room remains for spikes or inefficient code paths
Memory utilizationWorking set and cache pressureOOM kills, swap, or pressure evictionRequests may fail suddenly rather than slow gradually
Disk I/O latencyStorage throughput and wait timeRising read/write latency under loadDatabases and queues may become the bottleneck
Network I/OBandwidth and packet handlingDrops, retransmits, or near-limit bandwidthMore replicas may not help if the network is saturated
Request latencyUser-visible response timep95 or p99 trend rises with loadQueueing or dependency saturation is beginning
Queue depthBacklog of unfinished workQueue grows faster than workers drain itArrival rate exceeds service rate
Error rateFailed workErrors rise as traffic increasesCapacity limit has become user-visible
Dependency quotaExternal or managed-service limitThrottling, 429s, connection refusalYour service capacity is capped outside your pods

Sustained utilization near 100% is usually not efficient when the service has latency SLOs. As utilization rises, each new request is more likely to wait behind existing work. Queueing theory explains the shape: the system may look stable for a while, then latency accelerates rapidly near saturation. This is why many teams keep explicit headroom for latency-sensitive services and use tighter utilization targets only after testing how their workloads behave near saturation.

xychart-beta
title "Utilization vs Latency"
x-axis "Utilization (%)" ["0", "30", "50", "70", "85", "100"]
y-axis "Latency" 0 --> 100
line [10, 12, 15, 25, 60, 100]

The exact threshold depends on workload shape and scaling speed. A stateless API with fast pod startup, good caching, and no slow dependencies might safely run tighter. A payment service with external vendors, strict latency SLOs, and slow warm-up needs more headroom because one dependency wobble can consume the remaining margin.

Workload TypeTypical Headroom BiasReason
User-facing checkout or loginLarger headroomFailures are immediately visible and business-critical
Internal read-heavy APIModerate headroomCan often tolerate short latency increases
Batch processingSmaller headroomWork can queue if deadlines are still met
Streaming ingestionLarger burst bufferBackpressure can cascade into producers
Machine-learning training jobsCost-optimized capacityReliability target is often completion time, not request latency

The bottleneck question is: which resource fails first as load increases? The answer is rarely visible from one dashboard panel. You find it by increasing load gradually, watching every tier, and looking for the first saturation signal that correlates with latency or errors.

flowchart LR
A[Web App<br/>CPU 45%<br/>Queue low] --> B[Database<br/>CPU 78%<br/>Connections 95%<br/>BOTTLENECK]
A --> C[Cache<br/>CPU 30%<br/>Hit rate stable]
B --> D[Storage<br/>Write latency rising]
style B stroke:#ff0000,stroke-width:2px
style D stroke:#ff9900,stroke-width:2px

If the API tier is at 45% CPU and the database connection pool is at 95%, adding API replicas can make the incident worse. More replicas may open more database connections, increase contention, and produce more retries. The correct capacity action is to relieve the constrained tier: tune queries, add indexes, increase connection pool discipline, scale the database, shard workload, cache safely, or reduce write amplification.

Bottleneck investigation sequence:
1. Increase load gradually while holding the software version constant.
2. Watch throughput, latency, errors, saturation, and queue depth at each tier.
3. Mark the first resource that saturates before user-visible failure.
4. Change one capacity variable at a time and retest.
5. Record the new limiting resource, because fixing one bottleneck reveals the next.

Active learning prompt: Your API CPU is 55%, Redis CPU is 35%, PostgreSQL has 92% connection usage, and p99 latency rises only on endpoints that write orders. Predict what happens if you double API replicas without changing database limits. Then identify one measurement that would confirm your prediction.


A capacity model turns observed behavior and forecast demand into decisions. It should be simple enough for another engineer to audit, but detailed enough to show different work types and limiting resources. The model should include current peak, measured capacity, forecast demand, headroom target, scaling lead time, and known external limits.

Start with a service map. The map does not need to be beautiful; it needs to show where work goes and where queues or quotas exist. Static architecture diagrams are useful here because they force the team to name each capacity boundary.

+------------------+ +------------------+ +------------------+
| Users / Clients | ----> | API Gateway | ----> | Orders API |
| Peak requests | | Rate limits | | CPU + workers |
+------------------+ +------------------+ +------------------+
| |
v v
+------------------+ +------------------+
| Auth Provider | | PostgreSQL |
| Vendor quota | | Connections/I/O |
+------------------+ +------------------+
|
v
+------------------+
| Payment Vendor |
| QPS contract |
+------------------+

A useful model distinguishes capacity units. For example, one request per second is not one equal unit of work if login requests call an identity provider, search requests use cache-heavy reads, and checkout requests write to the database and payment gateway. Senior teams often model the top few traffic classes separately and then combine them into total resource demand.

Traffic ClassCurrent PeakForecast MultiplierLimiting ResourceNotes
Product search700 RPS1.4xCache memory and API CPUMostly read-heavy
Cart update250 RPS1.8xDatabase writesSensitive to lock contention
Checkout120 RPS2.5xPayment vendor QPSBusiness-critical
Login180 RPS1.5xAuth provider quotaCan block all user flows

Suppose a product launch is expected to double total traffic, but checkout traffic is expected to grow more than browsing because the campaign targets high-intent users. The team measures current SLO-safe capacity for each limiting resource and compares it with forecast demand plus 30% headroom.

Traffic ClassForecast PeakRequired with 30% HeadroomCurrent SLO-Safe CapacityGap
Product search980 RPS1,274 RPS1,600 RPSNo immediate gap
Cart update450 RPS585 RPS520 RPSAdd database write capacity
Checkout300 RPS390 RPS260 RPSIncrease vendor quota and test
Login270 RPS351 RPS400 RPSWatch auth latency during event

The table changes the conversation. The service does not need a blanket doubling of every component. Search has enough room, cart updates need database attention, checkout is externally constrained, and login is close enough to require event monitoring. This is how capacity planning avoids both outages and waste.

Capacity is only useful if it arrives before demand. Kubernetes pods may start in seconds, but nodes, database storage, managed-service limits, procurement approvals, and vendor contracts can take much longer. Your plan should define trigger points that fire earlier than the final danger zone.

Capacity ChangeTypical Lead TimeTrigger Point
Add pod replicasSeconds to minutesForecast peak approaches autoscaler max
Add Kubernetes nodesMinutesCluster allocatable CPU or memory below buffer
Increase database tierHours to daysForecast writes exceed tested database capacity
Raise vendor quotaDays to weeksEvent forecast exceeds contract limit
Redesign hot pathWeeks to monthsLoad test shows architectural bottleneck
Negotiate budgetWeeksRequired event spend exceeds existing allocation

A trigger point should be observable and action-oriented. “CPU is high” is vague. “Sustained checkout p95 above 240 ms while database connections exceed 80% at 70% of forecast event load” is specific enough to drive a decision. Good triggers also include rollback or mitigation actions, because some capacity changes introduce their own risks.


Provisioning strategy is where reliability goals, cost limits, and operational maturity meet. There is no universally best strategy. Manual provisioning can be appropriate for stable systems with long lead times. Scheduled scaling is excellent for known traffic shapes. Reactive autoscaling is the default for many Kubernetes services. Predictive scaling is useful when historical patterns are strong and the business cost of lag is high.

StrategyProsConsBest For
Manual provisioningFull review and cost controlSlow reaction and high toilStable, slow-growth services
Scheduled scalingReady before known peaksMisses unexpected spikesBusiness-hours or seasonal demand
Reactive autoscalingHandles variable demand automaticallyLags behind sudden spikesGeneral stateless services
Predictive autoscalingScales before demand arrivesNeeds good data and tuningLarge-scale predictable traffic
Queue-based scalingMatches workers to backlogUsers may wait if queues growAsynchronous jobs and ingestion
Graceful degradationPreserves core flows under stressRequires product decisions ahead of timeCritical services with optional features

Manual provisioning is not automatically bad. It can be the right choice when resources are expensive, changes are risky, or growth is slow. The problem appears when manual work is the only path during a fast-moving incident. If every capacity increase requires a ticket, approval, deployment, and hand verification, the lead time may exceed the time between warning and user impact.

Manual capacity path:
1. Utilization or forecast crosses a trigger.
2. Engineer opens a capacity request with model evidence.
3. Owner approves spend or risk.
4. Infrastructure change is applied.
5. Load or canary validation confirms SLO-safe capacity.
6. The capacity model is updated with new measurements.

Manual provisioning should still be tested and documented. A runbook that says “increase the database tier” is incomplete unless it includes expected downtime, backup status, rollback options, connection behavior, and post-change verification.

Scheduled scaling is powerful when traffic follows a clock or calendar. It also protects against autoscaling lag by making capacity ready before users arrive. The risk is that schedules become stale after usage patterns change, so they should be reviewed against actual traffic.

scaling_schedule:
- name: business-hours
cron: "0 8 * * MON-FRI"
replicas: 12
- name: overnight
cron: "0 20 * * MON-FRI"
replicas: 4
- name: launch-event
cron: "0 7 14 8 *"
replicas: 40
- name: post-event-reduction
cron: "0 3 15 8 *"
replicas: 10

Reactive autoscaling responds to measured demand. In Kubernetes 1.35+, the Horizontal Pod Autoscaler can scale workloads based on CPU, memory, or custom metrics through the autoscaling API. The example below uses CPU because it is common and easy to understand, but production teams often add request rate, queue depth, or business metrics when CPU is not the actual bottleneck.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: orders-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: orders-api
minReplicas: 4
maxReplicas: 60
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 8
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60

Reactive scaling has delay. Metrics must be scraped, the controller must calculate desired replicas, the scheduler must place pods, images may need to be pulled, readiness probes must pass, and application caches may need to warm. If traffic doubles in less time than that chain requires, the service can fail even though autoscaling is configured.

flowchart LR
A[Traffic spike starts] --> B[Metrics scraped]
B --> C[Autoscaler calculates replicas]
C --> D[Scheduler places pods]
D --> E[Containers start]
E --> F[Readiness probes pass]
F --> G[Pods receive traffic]

Predictive Scaling and Graceful Degradation

Section titled “Predictive Scaling and Graceful Degradation”

Predictive scaling uses historical patterns, calendars, and leading indicators to add capacity before demand arrives. It is most useful when the pattern is stable enough to forecast and the cost of being late is high. The failure mode is overconfidence: a prediction system that silently drifts can waste money or miss a new traffic shape.

Graceful degradation is the capacity strategy for the moment when demand exceeds available resources anyway. Instead of letting every feature fail equally, the service preserves critical paths and sheds optional work. A checkout service might disable recommendations, defer analytics writes, reduce image processing, or queue non-urgent email while keeping payment flow available.

Degradation ChoiceProtectsTrade-Off
Disable recommendationsCheckout latency and database readsLower personalization quality
Queue analytics writesRequest path latencyDelayed reporting
Rate-limit expensive searchDatabase and cache stabilitySome users receive throttling
Serve cached responsesAPI availabilityData may be slightly stale
Reject low-priority batch jobsInteractive user flowsBackground work finishes later

You cannot trust capacity you have not tested. Dashboards from normal production traffic show how the service behaves under current conditions, not where it breaks. Load testing creates controlled pressure so the team can measure throughput, latency, errors, saturation, queueing, and recovery behavior before a real event forces the experiment.

Load tests should have a question. “Run k6” is not a question. “Can checkout sustain 390 RPS for thirty minutes with p95 under 300 ms while payment vendor latency doubles?” is a question. The sharper the question, the easier it is to choose traffic shape, duration, stopping conditions, and success criteria.

Test TypePurposeTypical DurationWhat You Learn
Smoke testConfirm the test path worksMinutesEnvironment and script are valid
Load testValidate expected production load30-60 minutesSLO-safe capacity at planned demand
Stress testPush beyond expected demandUntil controlled failureBreaking point and first bottleneck
Spike testSimulate sudden traffic arrivalMinutesAutoscaling lag and burst behavior
Soak testHold sustained pressureHours to daysMemory leaks, slow queues, degradation
Failover testRemove or impair a dependencyScenario-dependentCapacity under partial failure

A responsible load test protects production and produces evidence. The checklist matters because a poorly designed test can either hurt users or create false confidence. If you test only the API tier while dependencies are mocked, you have measured API compute, not end-to-end service capacity.

load_test_checklist:
before:
- "Define the capacity question and SLO thresholds."
- "Confirm the test environment matches production limits that matter."
- "Notify stakeholders and on-call engineers."
- "Verify dashboards, alerts, and logs are working."
- "Set stopping conditions for user impact or dependency risk."
- "Prepare rollback and traffic-shaping controls."
during:
- "Increase load in planned stages."
- "Monitor throughput, p95, p99, errors, saturation, and queues."
- "Watch dependent services and external quotas."
- "Record the exact time each bottleneck signal appears."
- "Stop if production users or shared dependencies are harmed."
after:
- "Compare results with the forecast and previous tests."
- "Identify the first limiting resource."
- "Document the new SLO-safe capacity."
- "Create follow-up work for gaps."
- "Update autoscaler limits, runbooks, and capacity models."

The following k6 script is runnable after installing k6 and setting TARGET_URL. It ramps traffic in stages, checks status codes, and fails the run if latency or error thresholds exceed the plan. The endpoint is intentionally configurable so the same script can test staging, a dedicated performance environment, or a controlled production route.

import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 100 },
{ duration: '5m', target: 100 },
{ duration: '2m', target: 250 },
{ duration: '10m', target: 250 },
{ duration: '2m', target: 0 },
],
thresholds: {
http_req_duration: ['p(95)<300', 'p(99)<750'],
http_req_failed: ['rate<0.01'],
},
};
const targetUrl = __ENV.TARGET_URL || 'https://example.com/health';
export default function () {
const res = http.get(targetUrl);
check(res, {
'status is 200': (r) => r.status === 200,
'response time under 300ms': (r) => r.timings.duration < 300,
});
sleep(1);
}
Terminal window
TARGET_URL="https://staging.example.com/products" k6 run load-test.js

A test result should update the capacity model, not live as an isolated report. If the service passed 250 virtual users but database write latency began rising at the end, the next model should record both the pass and the emerging bottleneck. Capacity planning improves when each validation run reduces uncertainty.


Control Cost Without Undermining Reliability

Section titled “Control Cost Without Undermining Reliability”

Capacity planning is also cost planning. Over-provisioning may hide reliability risk for a while, but it teaches the organization to solve every uncertainty with spend. Under-provisioning saves money until the service misses its SLO at the worst possible time. The SRE skill is to spend deliberately where reliability requires it and remove waste where it does not.

xychart-beta
title "Cost vs Reliability"
x-axis "Reliability (%)" ["90", "95", "99", "99.9", "99.99"]
y-axis "Relative Cost" 0 --> 100
line [20, 30, 50, 80, 100]

The curve is steep because the last increments of reliability require redundancy, faster recovery, better testing, stronger observability, and more idle headroom. That does not mean high reliability is wrong. It means the service tier, customer impact, revenue exposure, and error budget should justify the capacity spend.

Cost MetricFormulaWhy It Helps
Cost per requestMonthly service cost / successful requestsShows whether efficiency changes as traffic grows
Cost per checkoutMonthly service cost / completed checkoutsConnects spend to business value
Cost per active userMonthly service cost / active usersUseful for product-level planning
Cost per jobBatch platform cost / completed jobsHelps tune worker capacity and scheduling
Idle capacity costUnused provisioned capacity * unit priceMakes headroom trade-offs explicit
Error-budget risk costEstimated impact of capacity-related failuresBalances reliability investment against outage risk

In Kubernetes, capacity planning must include resource requests and limits because the scheduler uses requests to place pods. If requests are too high, nodes appear full while real utilization is low. If requests are too low, the cluster overcommits and pods fight under load. Limits add another layer: CPU limits can throttle latency-sensitive services, while missing memory limits can let one pod pressure a node.

SignalLikely MeaningAction
CPU usage far below requestOver-requested CPULower requests or consolidate workloads after testing
CPU throttling with latency spikesLimit too tight or bursty workloadRevisit CPU limits and autoscaling target
Memory usage near limitOOM riskIncrease limit, reduce footprint, or split workload
Node allocatable exhausted but utilization lowRequests too conservativeAudit requests against measured usage
Frequent scale-up and scale-downAutoscaler instabilityTune thresholds and stabilization windows

Right-sizing should be done with evidence, not by blindly cutting requests. The safe pattern is to examine percentile usage over representative periods, account for peak and startup behavior, adjust a small group, and watch latency and eviction signals. Cost savings that create retry storms are not savings.

Different infrastructure options fit different parts of the capacity model. Predictable baseline demand can often use committed or reserved capacity. Spiky stateless workloads may fit autoscaling nodes. Interruptible capacity is attractive for batch and redundant workers but risky for single points of failure.

WorkloadSuitable Capacity OptionCaution
Steady baseline APIReserved or committed capacityDo not reserve for unproven demand
Stateless burst workersAutoscaled on-demand nodesStartup time must fit the spike shape
Batch processingSpot or preemptible instancesJobs must tolerate interruption
Stateful databaseManaged scaling or planned tier changesTest failover and storage behavior
Critical low-latency pathDedicated headroomCost must be tied to SLO and business impact

Scaling down is part of the plan. Many teams automate scale-up because outages are visible, then leave scale-down manual because waste is quieter. A complete capacity plan defines when temporary event capacity expires, how quickly autoscalers reduce load, and which dashboards show cost returning to baseline.

event_capacity_policy:
event_name: product-launch
pre_scale_start: "2026-08-14T07:00:00Z"
planned_peak_end: "2026-08-15T03:00:00Z"
scale_down_review: "2026-08-15T06:00:00Z"
temporary_max_replicas: 60
normal_max_replicas: 24
required_checks:
- "p95 latency below SLO for 2 hours"
- "queue depth back to baseline"
- "error rate below alert threshold"
- "business owner confirms campaign traffic ended"

Capacity alerts should fire when action is still possible. A page at 99% CPU often arrives too late, while a ticket when forecast demand crosses tested capacity gives the team time to plan. The best alert depends on lead time: fast autoscaling can use near-real-time signals, but database tier changes and vendor quotas need early warning from forecasts.

Alerting on symptoms and causes together prevents confusion. A high-CPU alert with no latency impact may be a warning. High latency with rising queue depth and saturated workers is user-impacting capacity pressure. Forecasted event demand above vendor quota is not an incident yet, but it should create tracked work before the deadline.

Alert TypeExample ConditionResponse
Early warningForecast demand exceeds 75% of tested capacity in thirty daysPlan capacity change
Autoscaler pressureHPA at max replicas for fifteen minutesRaise max, reduce load, or fix bottleneck
SaturationQueue depth grows for ten minutes while workers are busyAdd workers or shed load
SLO impactp95 latency violates SLO during high utilizationIncident response and mitigation
Cost driftCost per request rises 25% week over weekInvestigate over-provisioning or inefficient code
External quotaVendor throttling begins before service saturationRaise quota or reduce dependent calls

When using Kubernetes commands in exercises or operations, this project often uses the alias k for kubectl after explaining it once. For example, an operator might inspect an autoscaler with k describe hpa orders-api-hpa, then compare desired replicas, current metrics, and scaling events. The command is not the lesson; the lesson is to connect the controller’s decision with the capacity model.

Terminal window
kubectl describe hpa orders-api-hpa
kubectl top pods -l app=orders-api
kubectl get events --sort-by=.lastTimestamp

Runbooks should encode decision logic, not just commands. A useful capacity runbook tells the responder how to decide whether to scale replicas, raise node capacity, protect a dependency, enable degradation, or escalate to a vendor. It also states what not to do, such as adding API replicas when the database is already the bottleneck.

Capacity runbook decision flow:
If user latency is healthy and forecast crosses threshold:
Open planned capacity work and schedule validation.
If latency is rising and API CPU is saturated:
Increase replicas or reduce CPU-heavy optional work.
If latency is rising and database connections are saturated:
Do not add API replicas until connection behavior is controlled.
If external quota is throttling:
Enable graceful degradation and contact vendor escalation path.
If autoscaler is at max replicas:
Check node capacity, startup failures, and bottleneck tier before raising max.

  1. Large streaming services sometimes prepare extra capacity and resilience measures ahead of major releases because demand can be predictable in timing but uncertain in magnitude.

  2. The thundering herd problem can turn a recovery into a second outage when many clients retry at once, so capacity planning often includes retry budgets, jitter, backoff, and load shedding.

  3. In many environments, raising managed-service limits, database throughput, network quotas, or third-party vendor capacity can take longer than scaling application pods.

  4. Queue depth is often a better early warning than CPU for worker systems because it shows that arrivals are exceeding completions before every worker is visibly saturated.


MistakeWhy It FailsBetter Practice
Planning from average trafficPeaks, not averages, break SLOs and trigger incidentsUse peak windows, percentiles, and event scenarios
Watching only CPUThe first limit may be memory, connections, I/O, queues, or vendor quotaMonitor every tier and correlate saturation with latency
Adding replicas to the wrong tierMore callers can overload the constrained dependency fasterIdentify the bottleneck before scaling
Trusting autoscaling without measuring lagPods, nodes, readiness, and warm-up all take timeTest spike behavior and pre-scale known events
Forgetting scale-downTemporary event capacity becomes permanent wasteDefine expiry checks and cost alerts
Load testing unrealistic pathsMocked or narrow tests miss real dependency limitsUse representative user journeys and request mixes
Treating forecasts as exactFalse precision hides uncertainty and riskModel expected, high, and extreme scenarios
Ignoring business priorityAll requests fail together when capacity runs outDesign graceful degradation for less critical work

Your checkout API currently peaks at 800 RPS and load testing shows SLO-safe capacity at 1,200 RPS. Product expects 10% monthly organic growth and a campaign in three months that will multiply checkout traffic by 1.8 during launch week. The team wants 30% headroom during the event. What capacity target should you plan for, and why is current headroom not enough evidence?

Show Answer

The three-month organic forecast is 800 * (1.10 ^ 3), which is about 1,065 RPS. Applying the campaign multiplier gives about 1,917 RPS, and adding 30% headroom gives a required capacity target of about 2,492 RPS. Current headroom is only the difference between today’s peak and today’s measured capacity, so it does not include future growth or event demand. The plan should raise tested checkout capacity before the campaign and validate it with a load test that reflects the campaign request mix.

A team sees p95 latency rising during peak traffic. API pods are at 52% CPU, Redis is stable, PostgreSQL connection usage is 96%, and adding API replicas during the last incident made errors worse. What should the team investigate first, and what action should they avoid?

Show Answer

The team should investigate PostgreSQL connection saturation first because it is the clearest bottleneck signal and it correlates with rising latency. They should avoid blindly adding more API replicas because additional replicas may open more connections and increase pressure on the already constrained database. Better next steps include reviewing pool settings, query latency, lock contention, database tier capacity, and whether request paths can reduce or defer database work.

Your HPA scales on CPU at 70%, but a viral spike doubled traffic in one minute. New pods took four minutes to become ready, and users saw errors before the autoscaler caught up. How would you redesign the capacity strategy for the next similar risk?

Show Answer

The failure was caused by autoscaling lag being longer than the spike’s arrival time. The redesign should combine more baseline headroom, faster startup or pre-warmed capacity, a lower or better scaling signal, and protection mechanisms such as rate limits or graceful degradation. If the spike is predictable, scheduled or predictive pre-scaling should add replicas and nodes before demand arrives. The team should also load test spike behavior specifically, not just steady-state load.

A batch processing platform runs cheaply on spot instances, but month-end reports must finish by 06:00 for finance users. During the last month-end run, spot interruptions delayed completion even though average compute cost was low. How should the capacity model change?

Show Answer

The model should include deadline reliability, interruption rate, queue backlog, and a baseline of non-interruptible capacity for critical windows. Spot instances can remain useful for flexible work, but the month-end deadline needs enough stable capacity to complete required jobs even when interruptions occur. The team should model cost per completed report and deadline miss risk, then reserve or schedule reliable workers for the month-end period.

A service team load tests only the /health endpoint and proves it can serve 10,000 RPS with excellent latency. The next production event still fails on checkout. What was wrong with the validation approach?

Show Answer

The test validated a cheap endpoint rather than the real user journey that stresses checkout dependencies. A health endpoint may avoid authentication, database writes, payment vendor calls, inventory locks, and message queues, so it cannot prove checkout capacity. The team should build a representative load test with realistic request mix, data setup, dependency behavior, and SLO thresholds for the critical transaction.

Cloud cost for a search API rose 35% while successful requests increased only 8%. Latency is healthy, CPU usage is consistently below requests, and the autoscaler rarely scales down after business hours. What should you evaluate before cutting resources?

Show Answer

The team should evaluate request sizing, autoscaler scale-down behavior, actual percentile usage, and business-hour traffic shape before cutting resources. The signals suggest over-requested CPU or stale scale-down policy, but resource cuts should be tested because search latency may depend on cache warmth or burst behavior. A safe plan would adjust requests or scale-down windows gradually, watch p95 and p99 latency, and track cost per request after the change.

A payment vendor contract allows 350 requests per second, but your Black Friday capacity model shows checkout may need 420 payment calls per second with headroom. Engineering can scale the API to 600 RPS. What should the capacity plan say?

Show Answer

The plan should identify the payment vendor quota as the limiting capacity, regardless of API replica capacity. Engineering should request a quota increase early, reduce payment call amplification if possible, and define graceful degradation or queueing behavior if the vendor limit remains. Scaling the API alone would not solve the bottleneck and could increase retries against the vendor. The load test should include vendor-like limits or a realistic simulator so the team can verify behavior before the event.


In this exercise, you will create a capacity plan for a service with organic growth, event spikes, autoscaling, cost pressure, and validation requirements. Use a spreadsheet, markdown document, or notebook, but keep the assumptions visible so another engineer can review the plan. The purpose is not to produce a perfect forecast; it is to practice making capacity decisions from explicit evidence.

You operate an API service that handles product browsing and checkout for a growing platform. Current peak traffic is 1,000 RPS, and the current measured SLO-safe capacity is 1,500 RPS. Organic growth is expected to be 10% month over month. A major product launch in month three is expected to double traffic during the launch window. A Black Friday event in month five is expected to produce four times the organic peak for that month. The business wants 30% headroom for user-facing traffic.

InputValue
Current peak traffic1,000 RPS
Current SLO-safe capacity1,500 RPS
Organic growth10% month over month
Launch eventMonth 3, 2x organic traffic
Black Friday eventMonth 5, 4x organic traffic
Required headroom30% above expected traffic
Capacity unit cost$0.10 per 100 RPS per hour

Calculate organic traffic for each month, then apply the event multiplier where relevant. Treat month one as the current 1,000 RPS baseline unless your instructor gives a different starting point. Show the formula you used so the reviewer can see whether event demand was applied before or after organic growth.

MonthOrganic Peak TrafficEvent MultiplierTotal Expected Traffic
11,000 RPS1x
21x
32x launch
41x
54x Black Friday
61x

For each month, multiply expected traffic by 1.3 to include 30% headroom. Compare required capacity with the current SLO-safe capacity of 1,500 RPS and mark the first month where capacity is insufficient. Explain whether that gap can be solved by replicas alone or whether dependencies must be tested.

MonthExpected TrafficRequired Capacity with 30% HeadroomCurrent CapacityCapacity Gap
11,500 RPS
21,500 RPS
31,500 RPS
41,500 RPS
51,500 RPS
61,500 RPS

Create a short service map and choose the most likely first bottleneck. Do not assume the API pods are the bottleneck just because they are visible. Consider database writes, cache hit rate, external payment calls, node capacity, and request startup behavior.

+------------------+ +------------------+ +------------------+
| Customers | ----> | API Service | ----> | Database |
| Forecast RPS | | Replicas/CPU | | Writes/locks |
+------------------+ +------------------+ +------------------+
|
v
+------------------+
| Payment Vendor |
| QPS quota |
+------------------+

Write your hypothesis in this form: “At month five load, the first bottleneck will probably be ______ because ______. I will confirm this by measuring ______ during a staged load test.”

Design a scaling plan that includes normal autoscaling and event pre-scaling. Your plan should include minimum replicas, maximum replicas, scale-up threshold, scale-down behavior, and the load level you will test before each event. If you choose a CPU target, explain why CPU is a good enough signal or name the custom metric you would prefer.

SettingYour DecisionReason
Minimum replicas
Maximum replicas
Scale-up signal
Scale-up threshold
Scale-down window
Node capacity buffer
Dependency quota action
EventMonthPre-Scale TargetValidation Test
Product launch3
Black Friday5

Estimate monthly cost using the simplified rate of $0.10 per 100 RPS capacity per hour and 720 hours per month. This simplified model intentionally ignores reserved discounts, spot pricing, and regional differences so you can focus on the relationship between capacity and cost. After calculating the table, write one paragraph explaining whether the cost is justified by SLO and business risk.

MonthPlanned CapacityHoursEstimated Cost
1720
2720
3720
4720
5720
6720

Write the validation plan for the two major events. Each test should include the target load, duration, stopping conditions, dashboards to watch, and the first action if a bottleneck appears. Include at least one graceful-degradation option for Black Friday in case demand exceeds the extreme scenario.

## Validation Plan
### Product Launch Test
- Target load:
- Duration:
- Success criteria:
- Stop conditions:
- Bottleneck signals:
- Follow-up action:
### Black Friday Test
- Target load:
- Duration:
- Success criteria:
- Stop conditions:
- Bottleneck signals:
- Graceful degradation option:
  • Forecast includes organic growth for all six months.
  • Event multipliers are applied to the correct months.
  • Required capacity includes 30% headroom after forecast demand.
  • The first capacity gap is identified and explained.
  • Bottleneck hypothesis names a specific resource or dependency.
  • Autoscaling plan includes scale-up and scale-down behavior.
  • Event pre-scaling is scheduled before launch and Black Friday.
  • Load tests validate SLO-safe capacity, not only raw throughput.
  • Cost estimate ties capacity spend to business and reliability risk.
  • Runbook includes at least one graceful-degradation action.

Capacity planning is a reliability practice, a cost practice, and a communication practice. The engineering work is to measure limits, forecast demand, validate assumptions, and automate repeatable decisions. The communication work is to make uncertainty, lead time, and business risk visible before the service is under pressure.

A strong capacity plan starts with the SLO, not with infrastructure inventory. It asks what user promise must hold, what demand is expected, which component limits throughput, how much headroom is needed, and when action must happen. It then proves the answer through load tests and updates the model when reality disagrees.

The senior habit is to plan for behavior, not components. More replicas, bigger nodes, and larger databases are tools. They are not the plan. The plan is the chain of evidence that shows the service can absorb expected demand, fail gracefully under unexpected demand, and return to efficient cost when the event is over.

Continue with Platform Engineering Discipline to learn how SRE practices become reusable internal platforms for many teams.