Skip to content

Module 2.8: Scheduler & Pod Lifecycle Theory

Hands-On Lab Available
K8s Cluster advanced 30 min
Launch Lab ↗

Opens in Killercoda in a new tab

Complexity: [COMPLEX] - Advanced scheduling internals, high exam yield

Time to Complete: 35-45 minutes

Prerequisites: Module 2.5 (Resource Management), Module 2.6 (Scheduling)


A payments team deployed a new fraud-detection service with requests.cpu: 4 and requests.memory: 8Gi but forgot to set a PriorityClass. During a routine node upgrade, the cluster autoscaler drained two nodes. Every node was now packed tight with lower-priority batch jobs. The fraud service’s pods sat Pending for 47 minutes because no node had enough allocatable resources, and without a PriorityClass, the scheduler had no authority to preempt the batch workloads. During those 47 minutes, the payment pipeline processed transactions without fraud checks. The post-incident review revealed three gaps: no PriorityClass hierarchy, no PodDisruptionBudget on the fraud service, and QoS class BestEffort on the batch jobs that should have been evicted first.

This module teaches you the scheduler’s decision pipeline, priority and preemption mechanics, QoS classes, eviction behavior, and pod lifecycle signals. These are not abstract concepts — they determine whether your critical pods run or sit Pending, whether evictions hit the right targets, and whether your services shut down gracefully.

CKA Exam Relevance: Scheduling troubleshooting, PriorityClasses, QoS classification, PDB behavior, and graceful termination are all tested. Understanding the scheduler pipeline lets you diagnose Pending pods in seconds instead of minutes.


By the end of this module, you’ll be able to:

  • Trace a pod through the scheduler’s Filter, Score, and Bind phases
  • Create PriorityClasses and predict preemption behavior
  • Determine a pod’s QoS class from its resource spec
  • Explain kubelet eviction signals and threshold types
  • Configure graceful termination with PreStop hooks and PDBs
  • Troubleshoot Pending pods and unexpected evictions on the CKA exam

When you create a pod, the kube-scheduler watches for unscheduled pods (those with spec.nodeName unset) and runs a three-phase pipeline to assign each pod to a node.

┌──────────────────────────────────────────────────────────────────────┐
│ SCHEDULER PIPELINE │
│ │
│ Unscheduled Pod │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ PHASE 1: FILTER (Predicates) │ │
│ │ │ │
│ │ All nodes ──► Apply hard constraints ──► Feasible set │ │
│ │ │ │
│ │ Checks: resource fit, taints/tolerations, affinity, │ │
│ │ node selectors, volume topology, pod anti-affinity, │ │
│ │ PV node affinity, port conflicts │ │
│ └────────────────────────┬────────────────────────────────┘ │
│ │ │
│ ┌────────────┴──────────────┐ │
│ │ Feasible nodes > 0? │ │
│ └────────────┬──────────────┘ │
│ No ▼ Yes ▼ │
│ Pod stays Pending ┌──────────────────────────────────┐ │
│ Event: "0/N nodes │ PHASE 2: SCORE (Priorities) │ │
│ are available:..." │ │ │
│ │ Score each feasible node 0-100 │ │
│ │ per plugin, sum weighted scores │ │
│ │ │ │
│ │ Factors: resource balance, │ │
│ │ topology spread, affinity │ │
│ │ preferences, image locality │ │
│ └──────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ PHASE 3: BIND │ │
│ │ │ │
│ │ Highest-scored node wins │ │
│ │ (ties broken randomly) │ │
│ │ Write Binding to API server │ │
│ │ Kubelet picks up the pod │ │
│ └──────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘

The filter phase eliminates nodes that cannot run the pod. Each filter plugin is a hard constraint — if any single filter rejects a node, that node is removed from consideration. Key filter plugins:

Filter PluginWhat It Checks
NodeResourcesFitDoes the node have enough allocatable CPU, memory, ephemeral storage?
NodeAffinityDoes the node match requiredDuringSchedulingIgnoredDuringExecution?
TaintTolerationDoes the pod tolerate all NoSchedule taints on the node?
NodePortsAre the requested host ports available?
VolumeBindingCan the required PVs be bound to this node’s topology?
PodTopologySpreadDoes placing here violate maxSkew with whenUnsatisfiable: DoNotSchedule?
InterPodAffinityDoes placement violate required pod anti-affinity rules?

What happens when no node passes filtering? The pod remains in Pending state. The scheduler records an event on the pod explaining which constraints failed on each node. You see messages like:

0/5 nodes are available: 2 insufficient cpu, 2 node(s) had taint
{node-role.kubernetes.io/control-plane: }, 1 node(s) didn't match
Pod topology spread constraints.

The scheduler retries on its next scheduling cycle (triggered by cluster state changes such as a new node joining, a pod being deleted, or resource becoming available).

For nodes that pass all filters, the scheduler scores each one. Every scoring plugin assigns a value from 0 to 100, and scores are multiplied by plugin weights and summed. Key scoring plugins:

Score PluginWhat It Favors
NodeResourcesBalancedAllocationNodes where CPU and memory usage ratios are similar (balanced utilization)
NodeResourcesLeastAllocatedNodes with the most available resources (spread workloads)
ImageLocalityNodes that already have the container image cached
InterPodAffinityNodes matching preferredDuringSchedulingIgnoredDuringExecution
TaintTolerationNodes with fewer tolerations needed (prefer “cleaner” nodes)
PodTopologySpreadNodes that improve topology balance

What if two nodes score equally? The scheduler breaks ties randomly. This prevents hot-spotting a single node when the cluster is uniformly loaded. You should not depend on deterministic placement — if you need a pod on a specific node, use nodeSelector or nodeName.

The scheduler creates a Binding object that sets spec.nodeName on the pod. The kubelet on that node detects the assignment, pulls images, mounts volumes, and starts containers. The bind phase also runs permit and reserve plugins (for features like volume binding confirmation and gang scheduling).

Exam Tip: When troubleshooting a Pending pod, always start with kubectl describe pod <name> (or k describe pod <name> — we use the k alias for kubectl throughout). The Events section tells you exactly which filter phase failed. If there are no events at all, the scheduler may not be running.


Pause and predict: A high-priority pod (priority 1000000) is Pending because no node has enough resources. The cluster has nodes full of low-priority batch jobs (priority 100). Will Kubernetes automatically make room for the high-priority pod, or does it just wait? What determines which pods get evicted?

A PriorityClass assigns a numeric priority value to pods. Higher values mean higher priority. The scheduler uses this value during preemption decisions.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-service
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "For services that must not be displaced by batch workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-processing
value: 100
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "For batch jobs that can be preempted"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: best-effort-batch
value: 10
globalDefault: false
preemptionPolicy: Never
description: "Batch jobs that should never preempt others"

Built-in PriorityClasses (do not modify these):

NameValueUsed By
system-cluster-critical2000000000Cluster-essential components (CoreDNS, kube-proxy)
system-node-critical2000001000Node-essential components (kubelet static pods)

Assign a PriorityClass to a pod:

apiVersion: v1
kind: Pod
metadata:
name: fraud-detector
spec:
priorityClassName: critical-service
containers:
- name: detector
image: fraud-detector:v3.2
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "2"
memory: 4Gi

When a high-priority pod cannot be scheduled (all nodes fail the filter phase), the scheduler enters the preemption cycle:

┌──────────────────────────────────────────────────────────────────┐
│ PREEMPTION SEQUENCE │
│ │
│ High-priority pod P (priority=1000000) is Pending │
│ │ │
│ ▼ │
│ 1. Scheduler re-evaluates each node: │
│ "If I removed lower-priority pods, would P fit?" │
│ │ │
│ ▼ │
│ 2. For each candidate node, identify victim pods: │
│ - Only pods with priority < P's priority │
│ - Remove minimum set needed to free resources │
│ │ │
│ ▼ │
│ 3. Check PDB constraints: │
│ - Would evicting victims violate any PDB? │
│ - If yes, try a different victim set or skip node │
│ │ │
│ ▼ │
│ 4. Select the node with the least disruption: │
│ - Prefer nodes where fewest pods must be evicted │
│ - Prefer nodes where lowest-priority pods are victims │
│ │ │
│ ▼ │
│ 5. Set P's nominatedNodeName to the chosen node │
│ Victims receive graceful termination (SIGTERM + grace) │
│ │ │
│ ▼ │
│ 6. After victims terminate, P is scheduled in the next cycle │
│ │
└──────────────────────────────────────────────────────────────────┘

Consider a 3-node cluster, each with 4 CPU allocatable:

NodeRunning PodsCPU UsedAvailable
node-1batch-a (priority 100, 2 CPU), batch-b (priority 100, 1 CPU)3 CPU1 CPU
node-2web-api (priority 500, 3 CPU)3 CPU1 CPU
node-3monitoring (priority 800, 2 CPU), logger (priority 50, 1.5 CPU)3.5 CPU0.5 CPU

A new pod fraud-detector (priority 1000000, needs 2 CPU) is created.

Filter phase: No node has 2 CPU free. All fail. Pod is Pending.

Preemption analysis:

  • node-1: Evict batch-a (priority 100, 2 CPU) — frees 2 CPU. Victim priority 100. One victim.
  • node-2: Evict web-api (priority 500, 3 CPU) — frees 3 CPU. Victim priority 500. One victim, but higher priority.
  • node-3: Evict logger (priority 50, 1.5 CPU) — frees 1.5 CPU. Not enough. Must also evict monitoring (priority 800, 2 CPU) — frees 3.5 CPU. Two victims, one at priority 800.

Decision: node-1 wins. It requires only one victim, and that victim has the lowest priority (100). The scheduler sets nominatedNodeName: node-1 on fraud-detector, terminates batch-a gracefully, and schedules fraud-detector in the next cycle.

PodDisruptionBudgets limit voluntary disruptions. During preemption, the scheduler respects PDBs as a preference, not a hard constraint. In Kubernetes 1.35+:

  • The scheduler tries to avoid PDB violations when selecting victims
  • If every candidate node requires a PDB violation, preemption still proceeds — the high-priority pod takes precedence
  • PDBs are a hard constraint for kubectl drain and voluntary eviction API calls, but a soft constraint for scheduler preemption

This distinction is critical: PDBs protect against planned maintenance but do not fully block priority-based preemption.

Exam Tip: If asked “Does a PDB prevent preemption?”, the answer is nuanced: the scheduler avoids PDB violations when possible, but a high-priority pod can still preempt through a PDB if no alternative exists.


Kubernetes automatically assigns a QoS class to every pod based on the resource requests and limits of its containers. You do not set QoS class directly — it is derived.

QoS ClassConditionEviction Priority
GuaranteedEvery container has requests == limits for both CPU and memoryLast evicted
BurstableAt least one container has a request or limit set, but not GuaranteedMiddle
BestEffortNo container has any request or limitFirst evicted

Guaranteed — requests equal limits for all resources in all containers:

apiVersion: v1
kind: Pod
metadata:
name: qos-guaranteed
spec:
containers:
- name: app
image: nginx:1.27
resources:
requests:
cpu: 500m
memory: 256Mi
limits:
cpu: 500m
memory: 256Mi

Verify:

Terminal window
k get pod qos-guaranteed -o jsonpath='{.status.qosClass}'
# Output: Guaranteed

Burstable — requests set but not equal to limits (or limits missing for one resource):

apiVersion: v1
kind: Pod
metadata:
name: qos-burstable
spec:
containers:
- name: app
image: nginx:1.27
resources:
requests:
cpu: 250m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi

Verify:

Terminal window
k get pod qos-burstable -o jsonpath='{.status.qosClass}'
# Output: Burstable

BestEffort — no requests, no limits on any container:

apiVersion: v1
kind: Pod
metadata:
name: qos-besteffort
spec:
containers:
- name: app
image: nginx:1.27

Verify:

Terminal window
k get pod qos-besteffort -o jsonpath='{.status.qosClass}'
# Output: BestEffort
  • If you set only limits (no requests), Kubernetes auto-sets requests = limits, making the pod Guaranteed (if done for all containers and all resources).
  • A pod with two containers where one is Guaranteed and the other has no resources is classified as Burstable, not Guaranteed.
  • cpu and memory both matter. If requests equal limits for CPU but not memory, the pod is Burstable.
  • Ephemeral storage requests/limits do not affect QoS classification.

QoS class does not affect scheduling. The scheduler only looks at requests to determine if a pod fits on a node. limits are enforced at runtime by the kubelet and container runtime (CPU throttling, OOM kill for memory). This means:

  • A Guaranteed pod with requests: 2 CPU and a Burstable pod with requests: 2 CPU are scheduled identically
  • QoS class only matters during eviction (next section)

The kubelet monitors resource signals on the node and evicts pods when thresholds are crossed. This is separate from the scheduler — eviction is a kubelet decision on a specific node.

Eviction signals:

SignalDescriptionTypical Soft ThresholdTypical Hard Threshold
memory.availableFree memory on the node< 500Mi (grace 90s)< 100Mi
nodefs.availableFree disk on root partition< 15% (grace 120s)< 10%
imagefs.availableFree disk on image filesystem< 15% (grace 120s)< 10%
pid.availableFree PIDs< 1000 (grace 60s)< 500
  • Soft thresholds include a grace period. The kubelet waits for the grace period to expire before evicting. If resource usage drops below the threshold during the grace period, no eviction occurs. Configure with --eviction-soft and --eviction-soft-grace-period.

  • Hard thresholds are immediate. When crossed, the kubelet evicts pods without waiting. Configure with --eviction-hard.

┌──────────────────────────────────────────────────────────────────┐
│ EVICTION DECISION FLOW │
│ │
│ Kubelet detects resource pressure │
│ │ │
│ ▼ │
│ Is this a hard threshold? │
│ │ │
│ Yes ▼ No ▼ │
│ Evict now Has grace period expired? │
│ │ │ │
│ │ No ▼ Yes ▼ │
│ │ Wait (may recover) Proceed to eviction │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ EVICTION ORDER (within pods exceeding requests): │ │
│ │ │ │
│ │ 1. BestEffort pods -- no guarantees, evict first │ │
│ │ (sorted by resource usage, highest first) │ │
│ │ │ │
│ │ 2. Burstable pods exceeding their requests │ │
│ │ (sorted by usage relative to requests) │ │
│ │ │ │
│ │ 3. Guaranteed / Burstable within their requests │ │
│ │ (only if still under pressure after 1+2) │ │
│ │ Almost never reached in practice │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Evicted pod gets status reason "Evicted" │
│ Pod is NOT rescheduled on the same node │
│ Controller (Deployment, Job, etc.) creates replacement │
│ elsewhere. Standalone pods are gone permanently. │
│ │
└──────────────────────────────────────────────────────────────────┘

When a pod is evicted:

  1. The pod’s status becomes Failed with reason Evicted
  2. The pod remains visible in k get pods until garbage collected
  3. If the pod is owned by a controller (Deployment, ReplicaSet, StatefulSet, Job), the controller creates a replacement pod. The replacement is scheduled by the scheduler and may land on any eligible node.
  4. Standalone pods (no controller) are permanently lost. This is why you should always use controllers.
  5. The evicted pod’s node has a taint applied temporarily (node.kubernetes.io/memory-pressure, etc.) to prevent new pods from being scheduled there while it recovers.
ConditionTriggered ByEffect
MemoryPressurememory.available below thresholdTaint applied, no new BestEffort pods
DiskPressurenodefs.available or imagefs.available below thresholdTaint applied, no new pods
PIDPressurepid.available below thresholdTaint applied, no new pods

Exam Tip: If pods keep getting evicted and rescheduled to the same node, check whether the node’s pressure taints are being cleared prematurely. Use k describe node <name> and look at the Conditions and Taints sections.


When a pod is terminated (whether by deletion, preemption, eviction, or scale-down), Kubernetes follows a specific sequence:

┌──────────────────────────────────────────────────────────────────┐
│ POD TERMINATION SEQUENCE │
│ │
│ 1. Pod marked for deletion (deletionTimestamp set) │
│ Endpoints controller removes pod from Service endpoints │
│ ── Traffic stops being routed to this pod ── │
│ │ │
│ ▼ │
│ 2. PreStop hook executes (if defined) │
│ Runs in parallel with endpoint removal │
│ Examples: drain connections, deregister from service mesh │
│ │ │
│ ▼ │
│ 3. SIGTERM sent to PID 1 in each container │
│ Application should begin graceful shutdown │
│ │ │
│ ▼ │
│ 4. Grace period countdown (terminationGracePeriodSeconds) │
│ Default: 30 seconds │
│ Includes time spent in PreStop hook │
│ │ │
│ ▼ │
│ 5. SIGKILL sent if containers still running │
│ Forced termination -- no cleanup possible │
│ │ │
│ ▼ │
│ 6. Pod removed from API server │
│ Volumes detached and unmounted │
│ │
└──────────────────────────────────────────────────────────────────┘

Stop and think: You set terminationGracePeriodSeconds: 30 and a PreStop hook that runs sleep 20. After the PreStop completes, your app receives SIGTERM. How many seconds does it have before SIGKILL? What if your PreStop hook takes 35 seconds — longer than the grace period?

apiVersion: v1
kind: Pod
metadata:
name: graceful-app
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
image: myapp:v2
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5 && /app/drain-connections.sh"]
ports:
- containerPort: 8080

Key points:

  • The grace period timer starts when the pod is marked for deletion, not when SIGTERM is sent
  • PreStop hook time counts against the grace period. If your PreStop takes 20s and grace period is 30s, the app has only 10s after SIGTERM before SIGKILL
  • Set grace period long enough for: PreStop execution + application drain time + safety margin
  • For databases or stateful services, 60-120s is common. For simple web servers, 15-30s usually suffices
Workload TypeRecommended Grace PeriodWhy
Stateless web server15-30sQuick drain, few in-flight requests
API gateway / load balancer30-60sLong-lived connections, must drain gracefully
Database60-120sMust flush WAL, checkpoint, close connections
Batch processor60-300sMay need to checkpoint partial work
Message queue consumer30-60sMust finish processing current message

Pause and predict: A 3-replica Deployment has a PDB with minAvailable: 3. A cluster administrator runs kubectl drain on a node hosting one of those replicas. What happens — does the drain succeed, block, or partially proceed? Now consider: what if the node crashes instead of being drained?

PDBs protect applications from voluntary disruptions — planned operations like kubectl drain, cluster upgrades, or autoscaler scale-down.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: web-api

Alternative — specify maximum unavailable:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-api-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app: web-api

PDB rules:

  • minAvailable and maxUnavailable are mutually exclusive — use one or the other
  • Can be an integer (2) or percentage (“25%”)
  • PDBs only limit voluntary disruptions (drain, preemption, eviction API). They do not prevent involuntary disruptions (node crash, OOM kill, kubelet eviction under hard pressure)
  • A drain operation will block indefinitely if a PDB cannot be satisfied. Always set --timeout with kubectl drain
Terminal window
# Drain with timeout to avoid hanging forever
k drain node-2 --ignore-daemonsets --delete-emptydir-data --timeout=300s
TypeExamplesHonors PDB?
Voluntarykubectl drain, cluster upgrade, autoscaler scale-down, preemptionYes
InvoluntaryNode crash, OOM kill, kubelet hard eviction, hardware failureNo

Understanding this distinction is essential. A PDB with minAvailable: 3 on a 3-replica Deployment means kubectl drain will refuse to evict any of those pods (it would drop below 3). But if the node crashes, those pods are gone regardless of the PDB.


  1. Scheduling throughput: In large clusters, the kube-scheduler can make over 10,000 scheduling decisions per second. It achieves this by evaluating only a percentage of nodes (controlled by percentageOfNodesToScore) rather than all nodes in clusters with hundreds of nodes.

  2. nominatedNodeName: When a pod triggers preemption, the scheduler sets nominatedNodeName on the pending pod. However, this is not a guarantee — another higher-priority pod might claim that node first. The pod must still pass filter and score phases in the next cycle.

  3. Eviction vs OOM Kill: Kubelet eviction and Linux OOM Kill are different mechanisms. Kubelet eviction is proactive (happens before memory is fully exhausted) and respects QoS ordering. OOM Kill is reactive (kernel kills a process when memory is truly exhausted) and uses oom_score_adj — which Kubernetes sets based on QoS class: BestEffort gets 1000 (most likely killed), Guaranteed gets -997 (least likely).

  4. PDB with zero budget: Setting maxUnavailable: 0 creates a PDB that blocks all voluntary disruptions. This is sometimes used for singleton services during critical business periods, but it will also block node drains and cluster upgrades. Use with caution.


MistakeWhy It FailsWhat to Do Instead
Not setting any PriorityClassCritical services compete equally with batch jobs; no preemption possibleDefine at least 3 priority tiers: critical, default, batch
Setting limits without requestsPod gets Burstable QoS but may be scheduled on overcommitted nodesAlways set requests; set limits equal to requests for Guaranteed QoS on critical pods
Forgetting PDB during cluster upgradesDrain evicts all replicas simultaneously, causing downtimeCreate PDBs for every production Deployment before upgrading
Setting terminationGracePeriodSeconds: 0No graceful shutdown; in-flight requests dropped, data corruption riskUse at least 15s; longer for stateful workloads
Assuming PDBs protect against all disruptionsNode crash, OOM, and kubelet hard eviction ignore PDBsDesign for involuntary disruption: replicas across zones, persistent storage, idempotent operations
Setting maxUnavailable: 0 on PDBBlocks all voluntary disruptions including node drains and upgradesUse maxUnavailable: 1 or minAvailable: N-1 to allow rolling operations
Using preemptionPolicy: Never on critical podsPod will sit Pending forever if no node has capacity — it cannot preemptOnly use preemptionPolicy: Never for batch/background work that should wait
Ignoring QoS class on batch jobsBatch jobs with Guaranteed QoS are evicted last, blocking eviction of pods you care more aboutSet batch jobs to BestEffort or Burstable with low requests

Test your understanding of scheduler internals, priority, QoS, and lifecycle.

1. A pod is stuck Pending. k describe pod shows: "0/3 nodes are available: 2 insufficient cpu, 1 node(s) had taint {gpu=true: NoSchedule}." What do you check?

Two nodes lack sufficient CPU for this pod’s requests, and one node has a gpu=true:NoSchedule taint that the pod does not tolerate. You should first check the pod’s CPU requests with k get pod -o yaml and compare against node allocatable resources with k describe node. If requests are correct, either scale down other workloads, add nodes with more CPU, or add a toleration for the GPU taint if the pod should run on GPU nodes. The key insight is that all three nodes failed the filter phase for different reasons.

2. You have a pod with requests.cpu: 500m, limits.cpu: 1000m, requests.memory: 256Mi, limits.memory: 256Mi. What is its QoS class and why?

The QoS class is Burstable. For a pod to be Guaranteed, requests must equal limits for both CPU and memory across all containers. Here, memory requests equal limits (256Mi), but CPU requests (500m) do not equal CPU limits (1000m). Since at least one resource has requests set but not equal to limits, the pod is classified as Burstable. To make it Guaranteed, set requests.cpu equal to limits.cpu.

3. Pod X has priority 1000 and needs 2 CPU. Node A has Pod Y (priority 100, 1.5 CPU) and Pod Z (priority 500, 1 CPU) running, with 0.5 CPU free. Which pod(s) will the scheduler preempt?

The scheduler will preempt Pod Y (priority 100, 1.5 CPU) because it is the lowest-priority pod and freeing it provides 2 CPU total (1.5 CPU from Y + 0.5 CPU already free), which is exactly enough for Pod X. The scheduler always preempts the minimum set of lowest-priority pods needed to satisfy the incoming pod’s resource requests. Pod Z (priority 500) is spared because evicting Pod Y alone frees sufficient resources.

4. During a kubectl drain, the operation hangs indefinitely. The node has 3 pods from a Deployment with a PDB of minAvailable: 3 and the Deployment has 3 replicas. What is wrong?

The PDB requires at least 3 pods to be available at all times, but the Deployment only has 3 replicas. Draining would require evicting at least one pod from this node, which would drop the available count below 3, violating the PDB. The drain operation blocks because it respects PDBs. Fix by either scaling the Deployment to 4+ replicas (so one can be evicted while 3 remain), changing the PDB to minAvailable: 2, or using --timeout on the drain command and addressing the PDB separately.

5. A node enters MemoryPressure condition. There are 3 pods: a Guaranteed pod using 1Gi, a Burstable pod using 1.5x its memory request, and a BestEffort pod using 500Mi. In what order does the kubelet evict them?

The kubelet evicts in QoS order: BestEffort first, then Burstable pods exceeding their requests, then Guaranteed pods. So the BestEffort pod (500Mi, no requests) is evicted first. If pressure persists, the Burstable pod is evicted next because it is using 1.5x its memory request (exceeding its reservation). The Guaranteed pod is evicted last, and only if the node is still under pressure after evicting the first two — which is rare in practice because Guaranteed pods use exactly what they requested.

6. You set terminationGracePeriodSeconds: 30 and a PreStop hook that runs sleep 25. How much time does your application have to handle SIGTERM before SIGKILL?

Your application has approximately 5 seconds. The grace period countdown begins when the pod is marked for deletion, and the PreStop hook runs first. The PreStop hook consumes 25 seconds of the 30-second grace period. After the PreStop hook completes, SIGTERM is sent, and only 5 seconds remain before SIGKILL. If your application needs more shutdown time, increase terminationGracePeriodSeconds to account for both the PreStop hook duration and the application’s drain time.

7. A pod has preemptionPolicy: Never and priority 1000000. Node capacity is full with priority-100 pods. What happens?

The pod remains Pending indefinitely. Despite having a high priority value (1000000), the preemptionPolicy: Never setting means the scheduler will never evict lower-priority pods to make room for it. The pod must wait until resources become available through other means: pods completing, nodes scaling up, or manual intervention. The preemptionPolicy: Never is designed for workloads that are important enough to run before other pending pods in the queue but should not displace running workloads.

8. You create a PDB with maxUnavailable: 1 for a 3-replica Deployment. A node crashes, taking one pod with it. Can kubectl drain a second node that hosts another replica?

Yes, but it depends on timing. When the node crashes, one pod becomes unavailable (involuntary disruption — PDB does not prevent this). The PDB allows maxUnavailable: 1, and one pod is already unavailable. If the Deployment controller has not yet created a replacement pod on a healthy node, kubectl drain will block because draining would make 2 pods unavailable, violating the PDB. Once the replacement pod is running and healthy, the PDB is satisfied again (1 unavailable is within budget), and the drain can proceed. This is why it is important to ensure your cluster has enough capacity for replacement pods.


This exercise walks you through QoS classification, preemption, simulated eviction behavior, and PDB-protected drains. Run these on a kind or minikube cluster.

Step 1: Create Pods with Different QoS Classes

Section titled “Step 1: Create Pods with Different QoS Classes”
Terminal window
# Create a namespace for this exercise
k create namespace scheduler-lab
# Guaranteed QoS pod
cat <<'EOF' | k apply -f -
apiVersion: v1
kind: Pod
metadata:
name: qos-guaranteed
namespace: scheduler-lab
spec:
containers:
- name: app
image: nginx:1.27
resources:
requests:
cpu: 200m
memory: 128Mi
limits:
cpu: 200m
memory: 128Mi
EOF
# Burstable QoS pod
cat <<'EOF' | k apply -f -
apiVersion: v1
kind: Pod
metadata:
name: qos-burstable
namespace: scheduler-lab
spec:
containers:
- name: app
image: nginx:1.27
resources:
requests:
cpu: 100m
memory: 64Mi
limits:
cpu: 500m
memory: 256Mi
EOF
# BestEffort QoS pod
cat <<'EOF' | k apply -f -
apiVersion: v1
kind: Pod
metadata:
name: qos-besteffort
namespace: scheduler-lab
spec:
containers:
- name: app
image: nginx:1.27
EOF

Verify QoS classification:

Terminal window
k get pods -n scheduler-lab -o custom-columns=\
NAME:.metadata.name,\
QOS:.status.qosClass,\
STATUS:.status.phase

Expected output:

NAME QOS STATUS
qos-besteffort BestEffort Running
qos-burstable Burstable Running
qos-guaranteed Guaranteed Running

Step 2: Set Up PriorityClasses and Observe Preemption

Section titled “Step 2: Set Up PriorityClasses and Observe Preemption”
Terminal window
# Create PriorityClasses
cat <<'EOF' | k apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 10000
globalDefault: false
description: "High priority for critical workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 100
globalDefault: false
description: "Low priority for batch workloads"
EOF
# Fill the node with low-priority pods
# Adjust CPU requests based on your cluster's allocatable CPU
k create deployment low-batch \
--image=nginx:1.27 \
--replicas=10 \
-n scheduler-lab
# Patch to add priority and resource requests
k patch deployment low-batch -n scheduler-lab --type=json -p='[
{"op": "add", "path": "/spec/template/spec/priorityClassName", "value": "low-priority"},
{"op": "add", "path": "/spec/template/spec/containers/0/resources", "value": {"requests": {"cpu": "100m", "memory": "64Mi"}}}
]'
# Wait for pods to be running
k rollout status deployment/low-batch -n scheduler-lab --timeout=60s
# Now create a high-priority pod that requests significant resources
cat <<'EOF' | k apply -f -
apiVersion: v1
kind: Pod
metadata:
name: critical-service
namespace: scheduler-lab
spec:
priorityClassName: high-priority
containers:
- name: app
image: nginx:1.27
resources:
requests:
cpu: 500m
memory: 256Mi
EOF
# Check events -- look for preemption messages
k get events -n scheduler-lab --sort-by='.lastTimestamp' | tail -20
# Verify the critical pod is running
k get pod critical-service -n scheduler-lab
# Check if any low-priority pods were preempted
k get pods -n scheduler-lab -o wide

Step 3: Observe Eviction Ordering with Memory Stress

Section titled “Step 3: Observe Eviction Ordering with Memory Stress”

This step demonstrates the concept of eviction ordering. In a real cluster under memory pressure, the kubelet evicts BestEffort pods first.

Terminal window
# Check current node conditions
k describe nodes | grep -A 5 "Conditions:"
# View the oom_score_adj values set by kubelet for each QoS class
# (requires exec access to the node -- works in kind)
k get pods -n scheduler-lab -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.qosClass}{"\n"}{end}'
# To see oom_score_adj for a specific pod's container:
k exec qos-besteffort -n scheduler-lab -- cat /proc/1/oom_score_adj
# Expected: 1000 (most likely to be OOM killed)
k exec qos-guaranteed -n scheduler-lab -- cat /proc/1/oom_score_adj
# Expected: -997 (least likely to be OOM killed)
k exec qos-burstable -n scheduler-lab -- cat /proc/1/oom_score_adj
# Expected: value between -997 and 1000 (calculated based on requests ratio)
Terminal window
# Create a Deployment with multiple replicas
k create deployment web-app \
--image=nginx:1.27 \
--replicas=3 \
-n scheduler-lab
k rollout status deployment/web-app -n scheduler-lab --timeout=60s
# Create a PDB
cat <<'EOF' | k apply -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
namespace: scheduler-lab
spec:
minAvailable: 2
selector:
matchLabels:
app: web-app
EOF
# Verify PDB status
k get pdb -n scheduler-lab
# ALLOWED DISRUPTIONS should be 1 (3 replicas - 2 minAvailable)
# Find which node has web-app pods
k get pods -n scheduler-lab -l app=web-app -o wide
# Try draining a node that has a web-app pod (use --dry-run first)
NODE=$(k get pods -n scheduler-lab -l app=web-app -o jsonpath='{.items[0].spec.nodeName}')
k drain $NODE --ignore-daemonsets --delete-emptydir-data --dry-run=client
# Perform actual drain with timeout
k drain $NODE --ignore-daemonsets --delete-emptydir-data --timeout=120s
# Observe: PDB allows draining one pod at a time
k get pods -n scheduler-lab -l app=web-app -o wide
k get pdb -n scheduler-lab
# Uncordon the node when done
k uncordon $NODE
Terminal window
k delete namespace scheduler-lab
k delete priorityclass high-priority low-priority

Timed drills for CKA exam preparation. Practice until you can complete each within the target time.

#DrillTarget Time
1Create three pods (Guaranteed, Burstable, BestEffort) and verify their QoS class using jsonpath3 min
2Create two PriorityClasses (high=10000, low=100) and a pod using each. Verify with `k get pod -o yamlgrep priority`
3Create a 3-replica Deployment with a PDB (maxUnavailable: 1). Drain a node and verify only one pod is evicted at a time5 min
4A pod is Pending. Use k describe pod and k describe node to identify whether the issue is insufficient resources, taints, or affinity3 min
5Create a pod with a PreStop hook that writes to a log file, delete it with --grace-period=60, and verify the hook ran by checking the log4 min
6Given a cluster with resource fragmentation (no single node has 2 CPU free, but total cluster has 6 CPU free), explain why a pod requesting 2 CPU is Pending and propose two fixes2 min

Continue to Module 2.9: Autoscaling (HPA, VPA, Cluster) to learn how Kubernetes automatically adjusts resources based on demand.