Module 5.2: CPU & Scheduling

Linux Performance | Complexity: [MEDIUM] | Time: 30-35 min

Prerequisites

Before starting this module:

Required: Module 5.1: USE Method
Required: Module 2.2: cgroups
Helpful: Understanding of processes and threads

Learning Outcomes

After completing this module, you will be able to:

Diagnose CPU contention and bottlenecks using standard Linux utilities like top, mpstat, and vmstat.
Evaluate the impact of Kubernetes CPU requests and limits on application performance and throttling.
Implement strategies to optimize CPU resource allocation for containerized workloads in Kubernetes.
Compare the roles of the Completely Fair Scheduler (CFS) and cgroups in managing CPU resources.
Troubleshoot high load averages and unexpected application latency stemming from CPU scheduling issues.

Why This Module Matters: The Hidden Cost of CPU Throttling

Imagine a major e-commerce platform, let’s call it “MegaMart,” preparing for its biggest sales event of the year. Their site reliably handles millions of requests per day, and their monitoring dashboards show average CPU utilization well within acceptable limits—never exceeding 30% on most application servers. Yet, when the sales event goes live, customers report slow page loads, transactions time out, and carts mysteriously empty. MegaMart’s engineers scramble, checking everything from network latency to database performance, but find no obvious culprits. The CPUs look idle on average. What went wrong?

This scenario, tragically common in cloud-native environments, is often caused by an insidious performance killer: CPU throttling. Despite seemingly low average utilization, poorly configured CPU limits in Kubernetes can introduce micro-pauses for your applications, turning smooth execution into a stop-and-go crawl. These tiny, forced delays accumulate, dramatically increasing latency for user-facing requests and causing cascading failures across microservices. MegaMart’s seemingly “idle” CPUs were actually forcing critical application processes to wait, even when ample capacity was available, leading to a disastrous customer experience and millions in lost revenue. Understanding CPU scheduling isn’t just an academic exercise; it’s a critical skill for preventing such catastrophes and ensuring your applications perform reliably under pressure.

CPU Fundamentals: Understanding How Linux Sees Your Processor

Before diving into how the CPU is scheduled, it’s crucial to understand how Linux perceives and measures its utilization. This section covers the basic tools and concepts for monitoring CPU activity.

CPU Time Categories

The operating system categorizes CPU time into various states, providing a granular view of where your processor’s cycles are being spent. Understanding these categories is the first step in diagnosing CPU-related performance issues.

# View CPU time categories
top -bn1 | grep "Cpu(s)"
# %Cpu(s):  5.2 us,  2.1 sy,  0.0 ni, 92.0 id,  0.5 wa,  0.0 hi,  0.2 si,  0.0 st

# What each means:

Category	Meaning	High Value Indicates
`us`	User - Application code	Application CPU usage
`sy`	System - Kernel code	System calls, drivers
`ni`	Nice - Low priority user	Nice’d processes running
`id`	Idle - Nothing to do	Unused CPU capacity
`wa`	I/O Wait - Waiting for disk	I/O bottleneck
`hi`	Hardware IRQ - Interrupts	High interrupt load
`si`	Software IRQ - Soft interrupts	Network/timer handling
`st`	Steal - VM overhead	Hypervisor stealing time

A high %us means your applications are working hard. A high %sy might indicate an issue with system calls or device drivers. Crucially, a high %wa signals an I/O bottleneck, meaning the CPU is waiting for data from disk or network, not that the CPU itself is the limiting factor.

Understanding Load Average

The load average is a metric that gives you a sense of how many processes are either currently running or waiting to run (including those waiting for disk I/O) over 1, 5, and 15-minute intervals. It’s often misunderstood, as it doesn’t directly represent CPU utilization.

# Show load average
uptime
# 10:23:45 up 5 days,  3 users,  load average: 2.15, 1.87, 1.42
#                                              1m    5m    15m

cat /proc/loadavg
# 2.15 1.87 1.42 3/245 12345
# load averages   running/total  last PID

The load average of 2.15, 1.87, 1.42 means that over the last minute, 2.15 processes were either running or waiting. On a system with N CPU cores, a load average equal to N indicates perfect utilization without queuing. Anything significantly above N indicates that processes are contending for CPU resources, or waiting on I/O.

graph TD
    subgraph "4-core system with load average of 4.0"
        CPU0_P1[P1] --> CPU0(CPU 0)
        CPU1_P2[P2] --> CPU1(CPU 1)
        CPU2_P3[P3] --> CPU2(CPU 2)
        CPU3_P4[P4] --> CPU3(CPU 3)
        subgraph "4 processes running"
            P1_running(P1)
            P2_running(P2)
            P3_running(P3)
            P4_running(P4)
        end
        P1_running --"assigned to"--> CPU0_P1
        P2_running --"assigned to"--> CPU1_P2
        P3_running --"assigned to"--> CPU2_P3
        P4_running --"assigned to"--> CPU3_P4
        PerfUtil(Load = 4.0 = Perfect utilization)
    end
    subgraph "4-core system with load average of 8.0"
        CPU0_P1_8[P1] --> CPU0_8(CPU 0)
        CPU1_P2_8[P2] --> CPU1_8(CPU 1)
        CPU2_P3_8[P3] --> CPU2_8(CPU 2)
        CPU3_P4_8[P4] --> CPU3_8(CPU 3)
        subgraph "4 running"
            P1_running_8(P1)
            P2_running_8(P2)
            P3_running_8(P3)
            P4_running_8(P4)
        end
        P1_running_8 --"assigned to"--> CPU0_P1_8
        P2_running_8 --"assigned to"--> CPU1_P2_8
        P3_running_8 --"assigned to"--> CPU2_P3_8
        P4_running_8 --"assigned to"--> CPU3_P4_8
        Queue_P5(P5)
        Queue_P6(P6)
        Queue_P7(P7)
        Queue_P8(P8)
        subgraph "4 waiting"
            Waiting(Queue: [P5] [P6] [P7] [P8])
        end
        Overload(Load = 8.0 = 100% utilized + 4 waiting)
    end

Pause and predict: You have a 16-core system with a load average of 24. What does this tell you about your system’s performance and workload?

Show Answer

A load average of 24 on a 16-core system indicates that the system is significantly overloaded. On average, 16 processes are actively utilizing the CPU, and an additional 8 processes are waiting in the run queue for CPU time. This suggests severe CPU contention, where processes are delayed due to a lack of available processing power, leading to reduced application responsiveness and overall system performance degradation.

CPU Count and Topology

Understanding the physical and logical CPU count on your system helps interpret metrics like load average and mpstat. Modern CPUs often have multiple cores, and each core can have multiple hardware threads (hyperthreading), which appear as separate logical CPUs to the operating system.

# Number of CPUs
nproc
# 4

# Detailed CPU info
lscpu
# CPU(s):              4
# Thread(s) per core:  2
# Core(s) per socket:  2
# Socket(s):           1

# Per-CPU info
cat /proc/cpuinfo | grep "processor\|model name" | head -8

The Linux Scheduler: Orchestrating Processor Time

The Linux kernel is responsible for deciding which process runs on which CPU at any given moment. This complex task is handled by the scheduler, with the Completely Fair Scheduler (CFS) being the default for general-purpose workloads.

Completely Fair Scheduler (CFS)

Since Linux kernel version 2.6.23 (released in 2007), the Completely Fair Scheduler (CFS) has been the default process scheduler. It replaced older, more complex schedulers by introducing a simpler, elegant approach focused on fairness. CFS aims to give every process a “fair” share of CPU time by tracking a metric called virtual runtime (vruntime).

graph TD
    A[CFS SCHEDULER] --> B{Goal: Every process gets fair share of CPU}
    B --> C{Process with LOWEST vruntime runs next}
    C --> D[Red-Black Tree]

    subgraph Red-Black Tree
        P3(P3: 5000) --- Highest(Highest vruntime)
        P1(P1: 3000) --- P5(P5: 4500)
        P2(P2: 2000) --- P4(P4: 2800)
        P3 --> P1
        P3 --> P5
        P1 --> P2
        P1 --> P4
    end

    C --> E(Next to run: P2 (lowest vruntime = 2000))
    E --> F(vruntime increases as process uses CPU)
    F --> G(Higher priority = vruntime increases slower)

CFS maintains a red-black tree (a self-balancing binary search tree) where each leaf node represents a runnable task. The key for each node is its vruntime. The scheduler always picks the leftmost node (the one with the smallest vruntime) to run next. As a process consumes CPU time, its vruntime increases. When a process yields the CPU (e.g., waiting for I/O) or is preempted, its vruntime stops increasing, allowing other processes to “catch up.” This mechanism naturally prioritizes processes that have received less CPU time, ensuring fairness.

Nice Values and Process Priority

While CFS strives for fairness, sometimes you need to explicitly tell the scheduler that certain processes are more important than others. This is where “nice values” come in. A nice value (or “niceness”) is a user-space priority hint to the CFS.

# View process nice values
ps -eo pid,ni,comm | head -10
#   PID  NI COMMAND
#     1   0 systemd
#   123 -20 migration/0
#   456  19 backup

# Nice range: -20 (highest priority) to 19 (lowest priority)
# Default: 0

# Start process with nice value
nice -n 10 ./my-script.sh

# Change running process
renice 10 -p 1234

# Only root can set negative nice (higher priority)
sudo nice -n -10 ./critical-process

The ni column in ps output shows the nice value. A lower nice value means a higher priority. A process with a nice value of -20 is the highest priority, while 19 is the lowest. The default nice value is 0. The effect of nice values on vruntime is inverse: processes with lower nice values (higher priority) have their vruntime increased at a slower rate, effectively making them run more often.

Stop and think: You’re running a critical real-time data processing service and a low-priority batch job on the same machine. How would you use nice and renice commands to prioritize the real-time service? What’s a key limitation to consider?

Show Answer

To prioritize the real-time data processing service, you would use `sudo nice -n -10 ./real-time-service` to start it with a higher priority (lower nice value). For the batch job, you could use `nice -n 10 ./batch-job` to assign it a lower priority (higher nice value). If the batch job is already running, you'd use `renice 10 -p `. A key limitation is that only the root user can set negative nice values, meaning that without root privileges, you can only *lower* a process's priority, not raise it above the default of 0. Also, nice values only apply to the CFS; real-time schedulers (`SCHED_FIFO`, `SCHED_RR`) have their own priority mechanisms.

Real-Time Scheduling Policies

For highly time-sensitive applications where even small delays are unacceptable (e.g., industrial control systems, audio/video processing), Linux offers real-time scheduling policies that bypass the CFS fairness mechanisms. These policies ensure that a real-time process will run as soon as it’s ready, preempting any non-real-time processes.

# Check scheduling policy
chrt -p 1234
# pid 1234's current scheduling policy: SCHED_OTHER
# pid 1234's current scheduling priority: 0

# Policies:
# SCHED_OTHER - Normal (CFS)
# SCHED_FIFO  - Real-time FIFO
# SCHED_RR    - Real-time Round Robin
# SCHED_BATCH - Batch processing
# SCHED_IDLE  - Very low priority

# Set real-time priority (careful!)
sudo chrt -f -p 50 1234

Using real-time policies requires extreme caution. A misbehaving real-time process can monopolize the CPU, leading to system unresponsiveness or crashes, as it will prevent even critical kernel tasks from running.

CPU Metrics Deep Dive: Going Beyond Averages

While top and uptime give you a good overview, diagnosing complex CPU issues requires a deeper look into per-CPU statistics, context switches, and the run queue.

Per-CPU Statistics

Aggregate CPU utilization can hide problems. One CPU core might be saturated while others are idle, leading to performance bottlenecks for applications tied to the busy core, even if the overall CPU usage looks low. Tools like mpstat provide per-CPU breakdowns.

# Per-CPU utilization
mpstat -P ALL 1
# 10:30:01  CPU    %usr   %sy  %idle  %iowait
# 10:30:02  all    15.2    3.1   80.5     1.2
# 10:30:02    0    20.0    5.0   74.0     1.0
# 10:30:02    1    10.0    2.0   87.0     1.0
# 10:30:02    2    18.0    3.0   78.0     1.0
# 10:30:02    3    13.0    2.0   83.0     2.0

The output of mpstat -P ALL 1 shows the utilization for each individual CPU core (CPU 0, CPU 1, etc.) as well as the average across all CPUs (all). This is invaluable for identifying “hot” cores that might be causing bottlenecks.

Context Switches

A context switch occurs when the CPU scheduler stops one process from running and starts another. This involves saving the state of the current process and loading the state of the new one, which incurs a performance overhead. A high rate of context switches can indicate that the system is spending a lot of time managing processes rather than doing useful work.

# System-wide context switches
vmstat 1
#  r  b   swpd   free  ...   in   cs
#  2  0      0 123456  ...  500 2000
#                            │    │
#                            │    └── Context switches/sec
#                            └── Interrupts/sec

# Per-process context switches
cat /proc/1234/status | grep ctxt
# voluntary_ctxt_switches:    1000
# nonvoluntary_ctxt_switches: 500

# Voluntary = Process yielded (I/O, sleep)
# Nonvoluntary = Preempted by scheduler

vmstat provides system-wide context switch rates (cs). For a specific process, /proc/<PID>/status shows voluntary_ctxt_switches (the process willingly gave up the CPU, e.g., for I/O or sleeping) and nonvoluntary_ctxt_switches (the scheduler preempted the process because its time slice expired or a higher-priority process became runnable). A high number of non-voluntary context switches often indicates CPU contention.

Run Queue Depth

The run queue (or runnable queue) is where processes wait for their turn to be scheduled on a CPU. Its length indicates the immediate demand for CPU resources.

# Processes in run queue
vmstat 1
#  r  b   swpd   free ...
#  4  0      0 123456 ...
#  │
#  └── Runnable processes

# Alternative
cat /proc/loadavg
# 4.00 3.50 3.00 2/150 12345
#                 │
#                 └── 2 currently running / 150 total

In vmstat, the r column shows the number of runnable processes (those waiting for or currently using a CPU). A persistently high r value (greater than the number of CPU cores) signifies CPU saturation and contention.

Kubernetes CPU Management: Requests, Limits, and Throttling

Kubernetes, leveraging Linux cgroups, provides powerful mechanisms to manage CPU resources for pods and containers. However, these mechanisms, particularly CPU limits, come with nuances that are critical to understand for optimal performance.

How CPU Requests and Limits Work

In Kubernetes, you define CPU resources using requests and limits in your pod specifications. These translate directly into Linux cgroup parameters.

resources:
  requests:
    cpu: "100m"     # Guaranteed minimum
  limits:
    cpu: "500m"     # Maximum allowed

CPU Requests (cpu: "100m"): This is a guaranteed minimum amount of CPU that the scheduler uses to place your pod on a node. It translates to cpu.shares in cgroups. cpu.shares is a relative weight; it determines the proportion of CPU your container gets when there is contention for CPU resources.
CPU Limits (cpu: "500m"): This is a hard maximum amount of CPU your container can consume. It translates to cpu.cfs_quota_us and cpu.cfs_period_us in cgroups. This mechanism always enforces a cap, even if the node has abundant idle CPU.

graph LR
    subgraph Kubernetes
        K8s_100m["cpu: '100m' (request)"]
        K8s_500m["cpu: '500m' (limit)"]
        K8s_1CPU["'1 CPU'"]
        K8s_2CPU["'2 CPU'"]
    end

    subgraph Linux cgroups
        CG_shares["cpu.shares = 102 (1024 * 0.1 = ~102) --> Relative weight for scheduling"]
        CG_quota50k["cpu.cfs_quota_us = 50000 / cpu.cfs_period_us = 100000 --> 50ms of every 100ms period"]
        CG_quota100k["cpu.cfs_quota_us = 100000 --> Full period allowed"]
        CG_quota200k["cpu.cfs_quota_us = 200000 --> Can use 2 cores simultaneously"]
    end

    K8s_100m --> CG_shares
    K8s_500m --> CG_quota50k
    K8s_1CPU --> CG_quota100k
    K8s_2CPU --> CG_quota200k

The Problem with CPU Throttling

CPU limits, while seemingly beneficial for resource isolation, can introduce significant and often subtle performance problems through throttling. When a container reaches its CPU limit, the kernel temporarily pauses its execution until the next scheduling period begins.

# Check container throttling
cat /sys/fs/cgroup/cpu/docker/<container-id>/cpu.stat
# nr_periods 1000
# nr_throttled 150
# throttled_time 30000000000

# Interpretation:
# 1000 periods (100ms each = 100 seconds)
# 150 throttled (15% of periods had throttling)
# 30 billion nanoseconds = 30 seconds throttled

nr_periods: The number of 100ms enforcement periods that have elapsed.
nr_throttled: The number of periods where the container was throttled.
throttled_time: The total time (in nanoseconds) that the container was throttled.

A high nr_throttled count or throttled_time indicates that your application is frequently being paused, leading to latency spikes and degraded performance, even if average CPU utilization appears low.

CPU Shares vs. Quotas: A Critical Distinction

It’s vital to understand the different implications of CPU requests (which lead to cpu.shares) and CPU limits (which lead to cpu.cfs_quota_us).

Mechanism	Effect	When Applied
Shares (requests)	Relative weight	Only under contention
Quotas (limits)	Hard cap	Always enforced

graph TD
    subgraph CPU SHARES
        A[Pod A: 100m request = 102 shares]
        B[Pod B: 200m request = 205 shares]
        A --- B
        A_contention{"When both compete for CPU"}
        A_gets["Pod A gets: 102/(102+205) = 33%"]
        B_gets["Pod B gets: 205/(102+205) = 67%"]
        A_alone{"When only Pod A runs"}
        A_can_use["Pod A can use 100% (Shares only matter during contention)"]
        A --> A_contention
        B --> A_contention
        A_contention --> A_gets
        A_contention --> B_gets
        A --> A_alone
        A_alone --> A_can_use
    end

    subgraph CPU QUOTAS
        C[Pod A: 500m limit = 50ms per 100ms]
        D["Even if CPU is idle, Pod A is capped at 50%"]
        E["This causes throttling (latency spikes)"]
        C --> D
        D --> E
    end

CPU shares provide a proportional guarantee. If two pods on a node have requests of 100m and 200m respectively, and the node is saturated, they will get approximately 1/3 and 2/3 of the available CPU. However, if one pod is idle, the other can burst and use 100% of the CPU if needed. CPU quotas (cpu.cfs_quota_us) impose a strict ceiling. A container with a 500m limit will never use more than 50% of a single CPU core, regardless of how much idle capacity is available on the node. This hard cap is what causes throttling.

Viewing Container CPU Metrics

Monitoring tools are essential for understanding how your containers are utilizing CPU and whether they are being throttled.

# Pod CPU usage
kubectl top pod

# Detailed metrics (if metrics-server installed)
kubectl get --raw /apis/metrics.k8s.io/v1beta1/pods

# cgroup files for a container
# Find container ID first
crictl ps
crictl inspect <container-id> | grep cgroupsPath
cat /sys/fs/cgroup/cpu/<path>/cpu.stat

kubectl top pod provides a quick overview of current CPU and memory usage. For more detailed insights, especially into throttling, directly inspecting the cgroup cpu.stat file for a specific container is the most accurate method.

Troubleshooting CPU Performance Issues

Diagnosing and resolving CPU-related performance bottlenecks requires a systematic approach, combining observation of high-level metrics with deep dives into kernel-level statistics.

Diagnosing High Load Average

When your system’s load average is consistently high, it’s a clear signal of contention. The first step is to determine if the bottleneck is truly CPU or if processes are waiting on other resources, primarily I/O.

# Diagnosis steps:
# 1. Is it CPU or I/O?
vmstat 1
#  r  b    ← r=CPU, b=I/O
#  8  0    ← High r = CPU bound
#  2  6    ← High b = I/O bound

# 2. What's using CPU?
top -bn1 | head -15
ps aux --sort=-%cpu | head -10

# 3. Check for throttling
cat /sys/fs/cgroup/cpu/*/cpu.stat | grep throttled

vmstat: Observe the r (run queue) and b (blocked for I/O) columns. A high r indicates CPU contention, while a high b points to I/O bottlenecks.
top/ps: Identify the specific processes consuming the most CPU. This helps pinpoint the problematic applications.
cgroup cpu.stat: For containerized workloads, always check for throttling, as it can be a primary cause of perceived CPU issues even when overall utilization is low.

Debugging CPU Throttling in Kubernetes

If your containerized applications are experiencing unexplained latency or degraded performance, especially under moderate load, CPU throttling is a prime suspect.

# Find throttled containers
for cg in /sys/fs/cgroup/cpu/kubepods/*/pod*/; do
  throttled=$(cat $cg/cpu.stat 2>/dev/null | grep nr_throttled | awk '{print $2}')
  if [ "$throttled" -gt 0 ] 2>/dev/null; then
    echo "$cg: $throttled throttled periods"
  fi
done

# Solution: Increase CPU limit or remove limit
# Note: Some orgs remove CPU limits entirely

The script above directly queries cgroup statistics to identify any containers currently experiencing throttling. If throttling is detected, the immediate solutions are to either increase the CPU limit for the affected pod or, in some cases, remove the CPU limit entirely to allow the pod to burst. Many organizations have moved towards removing CPU limits, relying instead on horizontal pod autoscaling and node autoscaling to manage demand, arguing that throttling causes more problems than it solves.

Visualizing Throttling Latency

The impact of throttling on application latency can be counter-intuitive. Even if a container needs only a small amount of CPU, if its limit is set too low, it will be forced to wait.

sequenceDiagram
    participant App as Container (100m limit)
    App->>App: Request arrives (0ms)
    App->>App: Processing starts (0ms)
    App->>App: CPU quota exhausted (10ms)
    App--xApp: WAIT until 100ms (new period)
    App->>App: Processing continues (100ms)
    App->>App: Response at 120ms
    Note over App: Without throttling: 20ms latency
    Note over App: With throttling: 120ms latency (6x slower)
    Note over App: This is why low CPU% can still cause latency issues!

This diagram illustrates how a container, limited to 100m (10ms of CPU per 100ms period), experiences significant latency even if its actual processing time is only 20ms. Once its 10ms quota is exhausted, it’s forcibly paused, waiting for the next 100ms period to begin, dramatically increasing the total response time. This is the “hidden cost” of CPU limits.

Did You Know?

Linux uses the Completely Fair Scheduler (CFS) since 2007 — It replaced the O(1) scheduler and uses red-black trees to ensure every process gets a fair share of CPU time, revolutionizing Linux’s ability to handle diverse workloads efficiently.
CPU “millicores” are a Kubernetes abstraction — Linux doesn’t natively understand 100m. Kubernetes translates this into cgroup cpu.shares (for requests) and cpu.cfs_quota_us / cpu.cfs_period_us (for limits), demonstrating the sophisticated layer of resource management Kubernetes builds on top of Linux primitives.
Throttling happens in 100ms periods by default — The cpu.cfs_period_us cgroup parameter defaults to 100,000 microseconds (100ms). This means that a container’s CPU quota is enforced over these 100ms intervals, leading to the discrete “pauses” that define throttling.
Nice values range from -20 (highest priority) to 19 (lowest priority) — While the default is 0, only processes running as root can set negative nice values, giving them elevated priority. A difference of one nice unit can correspond to approximately a 10% change in CPU allocation during contention.

Common Mistakes and How to Avoid Them

Mistake	Problem	Solution
Setting CPU limits too low	Throttling causes latency, even with low average CPU utilization.	Test under realistic load, consider removing CPU limits, especially for latency-sensitive services.
Confusing requests and limits	Leads to either overcommitment (no limits, high contention) or wasted resources (high limits, underutilized) or unnecessary throttling.	Use requests for guaranteed baseline and scheduling. Use limits for safety nets, or remove them for burstable workloads.
Ignoring `iowait` (`wa%`) in `top`	Blaming CPU for performance issues when the true bottleneck is disk or network I/O.	Always check `wa%` in `top` or `vmstat`. A high value indicates I/O is the problem, not CPU saturation.
Not checking per-CPU statistics	Average CPU utilization can hide a single saturated core, leading to bottlenecks for single-threaded applications.	Use `mpstat -P ALL` to inspect individual CPU core utilization.
Assuming `nice` values matter in containers	`nice` values are overridden or irrelevant when cgroups are actively managing CPU resources.	In Kubernetes, rely on CPU requests (which map to `cpu.shares`) for relative priority among containers.
High context switches without clear cause	Excessive context switching indicates processes are frequently yielding or being preempted, potentially due to too many active threads or CPU contention.	Analyze `voluntary` vs. `nonvoluntary` context switches via `/proc/<PID>/status`. Reduce thread count if necessary or investigate CPU contention.

Quiz

Question 1: Scenario-Based

A Kubernetes pod is configured with cpu: "200m" for requests and cpu: "500m" for limits. On a node with 4 CPU cores, this pod occasionally experiences performance degradation and high latency, even when the overall node CPU utilization is only 40%. Explain why this might be happening and what metric you would check first to confirm your hypothesis.

Show Answer

This scenario strongly suggests **CPU throttling**. Even though the node's overall CPU utilization is low, the pod's `500m` CPU limit means it cannot use more than 50% of a single CPU core. If the application within the pod has bursty CPU requirements, it will quickly hit this hard limit and be paused by the kernel until the next cgroup period. This forced waiting introduces latency, making the application feel slow despite available node resources.

To confirm this, you should check the cpu.stat file within the container’s cgroup for nr_throttled and throttled_time. A high number of throttled periods and significant throttled time would validate the hypothesis.

Question 2: Scenario-Based

You observe a Linux server with an 8-core CPU reporting a 1-minute load average of 10.0, a 5-minute load average of 8.0, and a 15-minute load average of 6.0. Describe the current state of the system and its trend, and what you would look for next using vmstat.

Show Answer

The system is currently **overloaded**, as the 1-minute load average (10.0) is greater than the number of CPU cores (8). This indicates that, on average, 2 processes are waiting in the run queue for CPU time. The trend (10.0 -> 8.0 -> 6.0) suggests that the load has been decreasing over the last 15 minutes and is slowly returning to a state of full utilization (8.0 on an 8-core system) or even slight underutilization.

Next, using vmstat, I would examine the r (run queue) and b (blocked for I/O) columns. A high r value (greater than 8) would confirm CPU contention, while a high b value would shift the diagnosis towards an I/O bottleneck, even with a high load average. I would also look at wa (I/O wait) in top output.

Question 3: Scenario-Based

A legacy Java application is deployed in a Kubernetes cluster. Developers report that setting CPU limits below 2 CPUs for this application consistently leads to severe performance degradation and increased garbage collection pauses, even though kubectl top pod shows its average CPU usage rarely exceeds 1.5 CPUs. Based on your understanding of CPU limits and Java applications, what is a likely explanation for this behavior?

Show Answer

The likely explanation is that the Java application is **sensitive to CPU throttling due to its garbage collection (GC) mechanisms**. Modern JVMs attempt to optimize GC by performing bursty, CPU-intensive work. If the CPU limit is set too low (e.g., 1.5 CPUs), these GC bursts will be throttled. This means the GC process takes longer to complete, leading to increased GC pauses, which directly impact application latency and throughput. The average CPU usage might be low, but the peak demands of GC are being constrained by the Kubernetes CPU limit, causing performance issues.

Question 4: Conceptual

Explain the primary difference in how cpu.shares (influenced by Kubernetes CPU requests) and cpu.cfs_quota_us (influenced by Kubernetes CPU limits) affect a container’s CPU allocation when a node has abundant idle CPU resources.

Show Answer

When a node has abundant idle CPU resources:

cpu.shares (requests): These have no effect on a container’s maximum CPU allocation. Shares only determine the proportional CPU allocation during contention. If no other processes are competing for CPU, a container with even a tiny request can burst and use all available CPU on a core.
cpu.cfs_quota_us (limits): These will always enforce a hard cap on the container’s CPU usage. Even if the entire node is idle, a container with a 500m limit will be throttled once it tries to consume more than 50% of a single CPU core within the cfs_period. This hard limit can introduce throttling latency even when resources are plentiful.

Question 5: Tool Usage

You suspect a specific process (PID 12345) on a Linux server is causing excessive CPU contention. You want to confirm this by looking at how frequently it’s being preempted by the scheduler. Which specific file and entry would you inspect to gather this information, and what would a high value indicate?

Show Answer

You would inspect the `/proc/12345/status` file and look for the `nonvoluntary_ctxt_switches` entry.

A high value for nonvoluntary_ctxt_switches indicates that the process is frequently being preempted by the scheduler because its CPU time slice has expired or a higher-priority task became runnable. This is a strong indicator of CPU contention, meaning there are more processes wanting to run than available CPU resources, forcing the scheduler to frequently interrupt processes.

Question 6: Best Practice

Why is checking per-CPU statistics with mpstat -P ALL often more informative than just looking at the overall %Cpu(s) from top when diagnosing CPU bottlenecks?

Show Answer

Checking per-CPU statistics with `mpstat -P ALL` is more informative because **overall CPU averages can mask uneven load distribution across cores**. A system might report a low average CPU utilization (e.g., 25% on a 4-core system), but `mpstat` could reveal that one core is 100% utilized while the others are idle. For single-threaded applications or workloads that are not efficiently parallelized, saturating a single core will create a bottleneck, even if the system as a whole appears to have ample capacity. This granular view helps pinpoint specific resource contention issues that averages obscure.

Hands-On Exercise: Exploring CPU Scheduling Dynamics

Objective: Gain practical experience observing CPU scheduling, priority, and throttling behavior using common Linux tools and container runtimes.

Environment: A Linux system (VM, cloud instance, or local machine) with stress and docker (or podman) installed. Root access may be required for some steps.

Part 1: Initial CPU System Metrics

Begin by establishing a baseline understanding of your system’s CPU characteristics and current load.

# 1. Check CPU info: Identify the number of logical CPUs, cores, and sockets.
#    This helps in interpreting load averages and per-CPU statistics.
nproc
lscpu | grep -E "CPU|Thread|Core|Socket"

# 2. View current load: Get a snapshot of the system's load average.
#    Note the 1, 5, and 15-minute averages.
uptime
cat /proc/loadavg

# 3. CPU time breakdown: See how CPU time is categorized (user, system, idle, iowait, etc.).
#    This gives an initial hint if CPU is busy, waiting for I/O, or idle.
top -bn1 | head -8

# 4. Per-CPU statistics: Examine individual CPU core utilization.
#    Look for uneven distribution or saturated individual cores.
mpstat -P ALL 1 3

Expected Output and Analysis

After running these commands, you should see: - `nproc` will output the number of logical CPUs (e.g., `4`). - `lscpu` will provide detailed CPU topology, confirming cores and threads. - `uptime` and `cat /proc/loadavg` will show the current load averages. Compare these to your `nproc` output to understand if your system is under, perfectly, or over-loaded. - `top` will show the `%Cpu(s)` line with various categories. Pay attention to `us`, `sy`, `id`, and `wa`. - `mpstat` will display utilization for each CPU (`CPU 0`, `CPU 1`, etc.) and an `all` average. If one core is consistently much higher than others, it suggests uneven workload distribution.

Part 2: Observing Nice Values in Action

This section demonstrates how nice values influence CPU allocation between competing processes.

# 1. Start two CPU-intensive processes with different nice values.
#    `sha256sum /dev/zero` is a CPU-bound process that reads from /dev/zero and computes SHA256 hashes indefinitely.
nice -n 19 sha256sum /dev/zero &
PID1=$!

nice -n 0 sha256sum /dev/zero &
PID2=$!

# 2. Check their nice values and CPU usage.
#    Observe the `%cpu` column for PID1 and PID2.
ps -o pid,ni,%cpu,comm -p $PID1,$PID2
sleep 3
ps -o pid,ni,%cpu,comm -p $PID1,$PID2

# 3. Notice the CPU% difference: The process with `nice 0` should get significantly more CPU time.
#    A process with a lower nice value (higher priority) will be favored by the CFS.

# 4. Clean up: Terminate the background processes.
kill $PID1 $PID2

Expected Output and Analysis

You should observe that `PID2` (nice value 0) consistently gets a higher percentage of CPU time compared to `PID1` (nice value 19). This demonstrates how `nice` values, as hints to the CFS, can effectively prioritize one CPU-bound workload over another when CPU resources are contended. The difference won't be perfectly proportional to `19` vs `0` due to other system processes and scheduling complexities, but the trend will be clear.

Part 3: System-Wide Scheduling Behavior

Explore how context switches and the run queue behave under increasing system load.

# 1. Monitor system-wide context switches and run queue depth.
#    The 'cs' column (context switches) shows how many times the CPU switches between processes per second.
#    The 'r' column (run queue) shows the number of runnable processes.
vmstat 1 5
# Watch the 'cs' column for spikes and 'r' column for increases.

# 2. Create significant CPU load using the `stress` tool.
#    `stress --cpu 4` will create 4 processes that spin on CPU, simulating heavy computation.
stress --cpu 4 --timeout 30 &

# 3. Watch the load average change in real-time.
#    Open a new terminal or run this in the background: `watch -n 1 uptime`.
#    Observe how the 1-minute load average gradually increases as the `stress` processes consume CPU.
#    Wait 1-2 minutes to see the 1-minute average rise significantly.
watch -n 1 uptime

Expected Output and Analysis

- When `stress` starts, you'll see the `r` column in `vmstat` increase, indicating more processes are runnable and waiting for CPU. The `cs` column might also increase as the scheduler works harder to manage the increased contention. - In `watch -n 1 uptime`, the 1-minute load average will rise, potentially exceeding your CPU count, signaling CPU saturation. This demonstrates that load average reflects both running and waiting processes.

Part 4: Investigating cgroup CPU Quotas (Native Linux)

This part uses raw cgroup v1 to demonstrate CPU quotas directly on a Linux system. This is what Kubernetes uses under the hood.

# 1. Create a new CPU cgroup.
#    Requires root privileges.
sudo mkdir /sys/fs/cgroup/cpu/test

# 2. Set a CPU quota for this cgroup (e.g., 10% of a CPU core).
#    `cpu.cfs_quota_us`: 10,000 microseconds (10ms)
#    `cpu.cfs_period_us`: 100,000 microseconds (100ms)
#    This limits processes in this cgroup to 10% of one CPU.
echo 10000 | sudo tee /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
echo 100000 | sudo tee /sys/fs/cgroup/cpu/test/cpu.cfs_period_us

# 3. Run a CPU-intensive process within this cgroup.
#    The `$$` expands to the current shell's PID, which we'll move into the cgroup.
#    Then, `sha256sum /dev/zero` will run under the imposed limit.
echo $$ | sudo tee /sys/fs/cgroup/cpu/test/cgroup.procs
sha256sum /dev/zero &
PID=$! # Store the PID of the background process

# 4. Check for throttling: Observe `nr_throttled` and `throttled_time`.
#    You should see these values incrementing, indicating active throttling.
sleep 5
cat /sys/fs/cgroup/cpu/test/cpu.stat

# 5. Clean up: Terminate the process and remove the cgroup.
kill $PID
echo $$ | sudo tee /sys/fs/cgroup/cpu/cgroup.procs # Move shell back to root cgroup
sudo rmdir /sys/fs/cgroup/cpu/test

Expected Output and Analysis

After a few seconds of `sha256sum` running, `cat /sys/fs/cgroup/cpu/test/cpu.stat` will show `nr_throttled` and `throttled_time` values greater than zero. This directly demonstrates that the process, despite being CPU-bound, is being actively throttled to adhere to the 10% CPU quota imposed by the cgroup. This is the underlying mechanism for Kubernetes CPU limits.

Part 5: Container CPU Limits and Throttling (Docker/Podman)

This section extends the cgroup understanding to container runtimes, showing how docker (or podman) applies CPU limits and how to observe their effect.

# 1. Run a container with a CPU limit (e.g., 0.5 CPUs, or 500m).
#    The `--cpus` flag directly translates to cgroup CPU quotas.
docker run -d --name cpu-test --cpus="0.5" nginx sleep 3600

# 2. Inspect the cgroup settings applied to the container.
#    These should reflect the `0.5` CPU limit (i.e., 50,000 quota for a 100,000 period).
docker exec cpu-test cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
docker exec cpu-test cat /sys/fs/cgroup/cpu/cpu.cfs_period_us

# 3. Generate CPU load inside the container.
#    This will push the container against its defined CPU limit.
docker exec cpu-test sh -c "sha256sum /dev/zero &"

# 4. Check for throttling after a minute or so.
#    Observe the `nr_throttled` and `throttled_time` entries.
sleep 60
docker exec cpu-test cat /sys/fs/cgroup/cpu/cpu.stat

# 5. Clean up: Stop and remove the container.
docker rm -f cpu-test

Expected Output and Analysis

- The `cpu.cfs_quota_us` inside the container should be `50000` (for 0.5 CPU) and `cpu.cfs_period_us` should be `100000`. - After running `sha256sum` inside the container for a minute, `docker exec cpu-test cat /sys/fs/cgroup/cpu/cpu.stat` will show non-zero (and likely growing) values for `nr_throttled` and `throttled_time`. This confirms that Docker, using cgroups, is actively throttling the container's CPU usage according to the `--cpus` limit. This is directly analogous to how Kubernetes CPU limits work.

Success Checklist

I can explain the difference between us, sy, id, wa, st CPU time categories.
I can interpret load average values relative to the number of CPU cores.
I have observed how nice values impact CPU allocation among competing processes.
I have monitored context switches and run queue depth under load.
I understand how cpu.cfs_quota_us and cpu.cfs_period_us implement CPU limits via cgroups.
I have seen evidence of CPU throttling both in native cgroups and within a Docker container.

Key Takeaways

Load average ≠ CPU utilization: Load average includes processes waiting for I/O, while CPU utilization only measures active processing. A high load average can indicate either CPU contention or I/O bottlenecks.
CFS ensures fairness: The Completely Fair Scheduler balances CPU time among runnable processes using vruntime, prioritizing those that have received less CPU time.
Requests = shares, Limits = quotas: In Kubernetes, CPU requests translate to cpu.shares (relative weight during contention), while CPU limits translate to cpu.cfs_quota_us (a hard cap on usage).
Throttling causes latency: Kubernetes CPU limits, enforced via cgroup quotas, can introduce significant latency spikes by pausing container execution, even if average CPU usage is low.
Check per-CPU stats: Aggregate CPU metrics can hide performance bottlenecks caused by a single saturated core. Use mpstat -P ALL for a granular view.
I/O wait is critical: Don’t confuse high iowait (wa%) with CPU starvation; it indicates processes are waiting for disk or network I/O, shifting the troubleshooting focus.

What’s Next?

CPU is only one piece of the performance puzzle. In Module 5.3: Memory Management, you’ll learn how Linux handles memory, the implications of OOM events, and how Kubernetes memory limits differ fundamentally from CPU limits. Prepare to dive into RSS, VSS, swap, and the dreaded OOM Killer!

Module 5.2: CPU & Scheduling

Prerequisites

Learning Outcomes

Why This Module Matters: The Hidden Cost of CPU Throttling

CPU Fundamentals: Understanding How Linux Sees Your Processor

CPU Time Categories

Understanding Load Average

CPU Count and Topology

The Linux Scheduler: Orchestrating Processor Time

Completely Fair Scheduler (CFS)

Nice Values and Process Priority

Real-Time Scheduling Policies

CPU Metrics Deep Dive: Going Beyond Averages

Per-CPU Statistics

Context Switches

Run Queue Depth

Kubernetes CPU Management: Requests, Limits, and Throttling

How CPU Requests and Limits Work

The Problem with CPU Throttling

CPU Shares vs. Quotas: A Critical Distinction

Viewing Container CPU Metrics

Troubleshooting CPU Performance Issues

Diagnosing High Load Average

Debugging CPU Throttling in Kubernetes

Visualizing Throttling Latency

Did You Know?

Common Mistakes and How to Avoid Them

Quiz

Question 1: Scenario-Based

Question 2: Scenario-Based

Question 3: Scenario-Based

Question 4: Conceptual

Question 5: Tool Usage

Question 6: Best Practice

Hands-On Exercise: Exploring CPU Scheduling Dynamics

Part 1: Initial CPU System Metrics

Part 2: Observing Nice Values in Action

Part 3: System-Wide Scheduling Behavior

Part 4: Investigating cgroup CPU Quotas (Native Linux)

Part 5: Container CPU Limits and Throttling (Docker/Podman)

Success Checklist

Key Takeaways

What’s Next?

Further Reading