Module 5.4: Worker Node Failures
Complexity:
[MEDIUM]- Critical for cluster operationsTime to Complete: 45-55 minutes
Prerequisites: Module 5.1 (Methodology), Module 1.1 (Cluster Architecture)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”- Diagnose
NotReadyandUnknownworker node states by correlating Kubernetes node conditions, kubelet heartbeats, systemd service health, and node-local logs. - Evaluate
MemoryPressure,DiskPressure, andPIDPressureconditions and choose immediate containment steps that reduce cascading workload failures. - Debug kubelet and container runtime integration failures with
journalctl,systemctl, CRI socket checks, andcrictlinspection. - Implement safe node recovery procedures with cordon, drain, restart, reset, rejoin, and deletion workflows while respecting disruption constraints.
- Design a worker-node eviction and maintenance response that accounts for taint-based eviction, node-pressure thresholds, graceful shutdown, and Kubernetes 1.35 behavior.
Why This Module Matters
Section titled “Why This Module Matters”Hypothetical scenario: a host-wide monitoring agent starts leaking memory after a routine rollout. At first only one worker node reports higher memory usage, but the DaemonSet runs everywhere, so the same failure pattern begins appearing across the node pool. The kubelet starts asserting MemoryPressure, new pods stop landing on affected nodes, and evicted workloads move onto the remaining healthy nodes, increasing their pressure as well. The outage is not caused by a single broken application; it is caused by the node layer losing the capacity and control loops that every application depends on.
Worker nodes are the factory floor of a Kubernetes cluster. The control plane can schedule, observe, and reconcile, but the actual containers run on machines with finite memory, disks, network paths, process IDs, certificates, and local services. The kubelet acts like the floor supervisor, the container runtime is the heavy machinery, and node resources are the raw materials. If the supervisor is unable to report status, the machinery is unable to start containers, or the raw materials run out, the scheduler’s best intentions do not matter until the node is recovered or isolated.
This module teaches a practical sequence for diagnosing worker node failures without guessing. You will start from the API server’s view, move to the node’s local operating system evidence, inspect kubelet and container runtime integration, evaluate resource pressure, distinguish network partitions from local crashes, and then choose a recovery path. The exam value is obvious because CKA troubleshooting tasks often present a node that is unhealthy for one concrete reason. The production value is larger: calm node diagnosis prevents a local failure from becoming a fleet-wide incident.
The habit you are building is evidence ordering. Kubernetes exposes many symptoms, and several of them can be true at the same time during a node incident. A pod may be Pending because the node is under pressure, because the scheduler is honoring a cordon, because the runtime is unavailable, or because the control plane has stopped trusting the node heartbeat. If you collect those signals in a stable order, the failure usually narrows quickly; if you jump straight to repair commands, you may change the system before you understand the fault.
Reading Node Status Like an Operator
Section titled “Reading Node Status Like an Operator”Kubernetes does not continuously log into worker nodes to see if they are alive. Instead, the kubelet on each node publishes status updates and Lease heartbeats back to the API server, and the node controller interprets missed or unhealthy updates. That distinction matters because the control plane’s view is always a report, not the node itself. A node can be running containers while the API server sees it as Unknown, and a node can be reachable by SSH while the kubelet is failing to authenticate or post status.
The first diagnostic question is therefore not “what is broken?” but “which observer says it is broken?” kubectl get nodes tells you what the API server currently believes. kubectl describe node shows conditions, recent events, addresses, capacity, allocatable resources, and taints. SSH, systemctl, journalctl, df, free, and crictl tell you what is happening locally. A good troubleshooter deliberately switches between those viewpoints instead of trusting one command as the whole truth.
Think of the node object as a dashboard fed by field reports. It is authoritative for scheduling decisions, but it is still a summary of messages that had to travel from the node to the API server. When the reporting path is damaged, the dashboard can be stale, incomplete, or conservative. That is why Unknown is not the same as “everything on the host has died,” and why a reachable host is not automatically a healthy Kubernetes node.
graph TD A[Node Conditions] --> B{Ready?} B -->|True| C[Healthy, can run pods] B -->|False/Unknown| D[NotReady, scheduling problems] A --> E[Resource Pressures] E --> F[MemoryPressure] E --> G[DiskPressure] E --> H[PIDPressure] E --> I[NetworkUnavailable] F -.->|True| D G -.->|True| D H -.->|True| D I -.->|True| DThe Ready condition summarizes whether the node can accept and run ordinary workloads, but it should never be read alone. MemoryPressure=True tells you the kubelet is protecting the host from low memory. DiskPressure=True indicates local storage or inode exhaustion may block image pulls, logs, or container creation. PIDPressure=True means the node is running out of Linux process identifiers. NetworkUnavailable=True usually points toward CNI or routing readiness rather than kubelet liveness alone.
Events add time and texture to those conditions. A condition tells you the current or recently observed state, while events often reveal the transition path: image garbage collection failed, eviction thresholds were met, kubelet stopped posting status, or the scheduler avoided the node because of taints. In a timed exam, scan events for the newest repeated warning. In production, preserve those events in incident notes because they explain why a later repair worked and whether the same failure is likely to repeat.
| Status | Meaning | Common Causes |
|---|---|---|
| Ready | Healthy and accepting pods | Normal operation |
| NotReady | Unhealthy | kubelet down, network issues |
| Unknown | No heartbeat received | Node unreachable, kubelet crashed |
| SchedulingDisabled | Cordoned | Manual cordon or maintenance |
Use the control plane view to classify the failure before touching the node. A NotReady node with recent kubelet events is different from an Unknown node that has stopped posting heartbeats. A Ready,SchedulingDisabled node may be healthy but deliberately cordoned. A node that is Ready but has repeated image pull or runtime events may have a local container runtime, registry, DNS, or disk problem that has not yet crossed the threshold into node-level failure.
# Quick statuskubectl get nodes
# Detailed conditionskubectl describe node <node-name> | grep -A 10 Conditions
# All nodes with Ready condition reasonkubectl get nodes -o custom-columns='NAME:.metadata.name,READY:.status.conditions[?(@.type=="Ready")].status,REASON:.status.conditions[?(@.type=="Ready")].reason'
# Check for resource pressurekubectl describe node <node-name> | grep -E "MemoryPressure|DiskPressure|PIDPressure"Under default controller-manager behavior, node health is monitored frequently, and missed heartbeats eventually turn into Ready=Unknown. Kubernetes also uses taints such as node.kubernetes.io/not-ready and node.kubernetes.io/unreachable to influence scheduling and eviction. The important operational lesson is that Kubernetes intentionally delays some reactions because short network blips are common. Immediate eviction on every missed heartbeat would create more disruption than it solves.
Pause and predict: if a node becomes Unknown, do the containers that were already running on that machine immediately stop? Think about which component starts containers, which component reports status, and which component can still be alive when the API server loses contact with the node.
The answer is usually no. Existing containers may continue running if the node and runtime are alive, even while the control plane lacks confirmed status. The scheduler will avoid placing new work on the unhealthy node, and eviction logic may eventually replace pods elsewhere, but that is a control-plane decision. This is why node troubleshooting always separates application liveness, container runtime state, kubelet reporting, and API visibility.
This distinction also explains why application owners may report mixed symptoms. A user request routed to a still-running pod can succeed while kubectl get pods shows stale information. A replacement pod can start elsewhere only after controller logic decides the old pod should no longer count. A log command can fail through the API while the container’s local stdout file still exists on the node. Worker-node troubleshooting is the practice of reconciling those perspectives without assuming they should all change at the same instant.
Taints, Evictions, and Node Failure Timing
Section titled “Taints, Evictions, and Node Failure Timing”When a node is not ready or unreachable, Kubernetes does not simply flip a status label and hope people notice. The control plane adds taints that shape two behaviors: new pods should not be scheduled there, and existing pods may eventually be evicted if their tolerations expire. The default toleration for ordinary pods on the not-ready and unreachable taints is tolerationSeconds: 300, which is why pods can appear to linger on an unhealthy node during the first minutes of a failure.
That delay is a feature, not negligence. Distributed systems experience transient packet loss, routing convergence, maintenance windows, cloud host pauses, and overloaded API paths. If Kubernetes immediately rescheduled every workload after a brief node heartbeat interruption, it would amplify noise into churn. The default behavior gives the node a chance to recover, then moves work only after the failure appears sustained enough to justify disruption.
From Kubernetes v1.29 onward, taint-based eviction is handled by the taint-eviction-controller, and Kubernetes 1.35 clusters continue to rely on that control-plane behavior unless operators explicitly change the controller set. For day-to-day troubleshooting, you do not usually tune this controller during an incident. You identify whether the observed pod delay is expected toleration behavior, a PodDisruptionBudget constraint, a zone-wide eviction throttle, or a sign that a controller is not running.
The eviction system also has throttles to avoid overwhelming the rest of the cluster during broad failures. Defaults such as node-eviction-rate, secondary-node-eviction-rate, unhealthy-zone thresholds, and large-cluster thresholds exist because mass node failure is different from single-node failure. If a whole zone goes dark, evicting everything at full speed can stampede the surviving nodes, trigger image pulls, overload storage, and turn recovery into a second outage.
The practical implication is that “why are pods still there?” is not a single question. They may still be there because the node toleration has not expired, because a custom toleration permits longer residence, because an eviction throttle slowed replacement, because the controller manager is unhealthy, or because the pod is managed by a controller that must create a replacement before traffic recovers. Before changing flags or deleting pods, inspect taints, tolerations, controller ownership, and cluster capacity. These small checks prevent you from mistaking deliberate safety behavior for a stuck control plane.
| Failure Signal | Kubernetes Reaction | Why It Matters During Diagnosis |
|---|---|---|
Ready=False | Node is known unhealthy | The kubelet is still reporting a problem, so inspect recent conditions and events. |
Ready=Unknown | Node heartbeat is missing | The control plane lacks trustworthy pod status, so inspect network reachability and node-local services. |
not-ready taint | New scheduling is blocked and existing pods may tolerate briefly | Delayed replacement can be normal, not a scheduler bug. |
unreachable taint | Existing pods may be evicted after toleration expiry | Workloads with custom tolerations can stay longer than expected. |
| Node-pressure condition | Kubelet may evict locally | These evictions are emergency host protection, not voluntary disruption. |
Before running this, what output do you expect from a node that is reachable but under memory pressure? You should expect Ready may still be True or may be degraded depending on severity, while MemoryPressure is the decisive condition to inspect. If you only look at the first column of kubectl get nodes, you may miss the pressure signal that explains pending pods and local evictions.
The CKA exam tends to reward this timing awareness. A candidate who deletes pods immediately after seeing Unknown may create unnecessary noise, while a candidate who checks node conditions, taints, kubelet state, and pod tolerations can explain why workloads have or have not moved. In production, the same discipline prevents false conclusions such as “Kubernetes failed to reschedule” when Kubernetes is deliberately waiting for a toleration window or throttling evictions across an unhealthy zone.
Eviction timing also affects communication. If a service is degraded because one node is unreachable, telling the team “pods should move in five minutes” may be accurate for ordinary pods, but it is incomplete for StatefulSets, local storage, strict PodDisruptionBudgets, custom tolerations, and capacity-constrained clusters. A better incident update names the mechanism: the node is tainted unreachable, ordinary pods have default tolerations, the controller is expected to create replacements after the window, and we are verifying spare capacity before forcing anything.
Debugging kubelet and Runtime Integration
Section titled “Debugging kubelet and Runtime Integration”The kubelet is the most important Kubernetes process on a worker node because it turns desired pod state into local container actions and reports reality back to the API server. It registers the node, watches for assigned pods, asks the runtime to create or remove containers, mounts volumes, runs probes, reports pod status, and manages static pods. If kubelet is down, misconfigured, or unable to authenticate, the node becomes operationally detached even when the underlying operating system is still running.
flowchart TD K[kubelet] --> R[Registers node with API server] K --> W[Watches for pod assignments] K --> M[Manages container lifecycle via runtime] K --> S[Reports node/pod status] K --> H[Handles probes: liveness, readiness] K --> V[Mounts volumes] K --> P[Runs static pods] K -.->|Fails| N[Node goes NotReady] N -.->|Result| X[Pods stop working or face eviction]The kubelet does not run containers directly. It talks to a Container Runtime Interface implementation such as containerd or CRI-O over a local socket, and that runtime uses an OCI runtime such as runc or crun to create Linux namespaces, cgroups, and processes. This layered design is useful because each layer has a narrow job, but it also means a worker-node failure can appear as a kubelet problem while the actual fault is a missing socket, a stopped runtime, corrupted runtime storage, or kernel-level resource exhaustion.
Use the layer model to read error messages. If kubelet reports authentication failure, the path between kubelet and the API server is suspect. If kubelet reports CRI connection failure, the runtime layer is suspect. If the runtime reports cgroup or mount errors, the operating system and kernel configuration are suspect. If the container starts but readiness probes fail, the workload or pod network may be the better focus. This prevents the common habit of restarting whichever component printed the most recent error.
flowchart TD K[kubelet] -->|CRI - gRPC via unix socket| C[containerd / cri-o] C -->|OCI - JSON spec| R[runc / crun - low-level runtime] R -->|System Calls| L[Linux kernel: cgroups, namespaces]Start kubelet debugging from the node, not from another pod. SSH to the affected host, inspect the systemd unit, then read recent logs before restarting anything. A restart can temporarily clear symptoms and erase useful timing, especially when the real issue is configuration, certificate expiry, API reachability, or runtime socket failure. Your goal is to identify the first error in the chain, not just the loudest error repeated during a crash loop.
# SSH to the node firstssh <node-name>
# Check kubelet service statussudo systemctl status kubelet
# Check if kubelet is runningps aux | grep kubelet
# Check kubelet logssudo journalctl -u kubelet -f
# Check recent kubelet errorssudo journalctl -u kubelet --since "10 minutes ago" | grep -i error| Issue | Symptom | Diagnosis | Fix |
|---|---|---|---|
| kubelet stopped | Node NotReady | systemctl status kubelet | systemctl start kubelet |
| kubelet crash loop | Node flapping | journalctl -u kubelet | Fix config, check logs |
| Wrong config | Fails to start | Error in logs | Fix /var/lib/kubelet/config.yaml |
| API unreachable | NotReady | Network timeout in logs | Check network, firewall |
| Certificate issues | TLS errors | Cert errors in logs | Renew certs |
| Container runtime down | Fails to create pods | Runtime errors | Fix containerd/docker |
If kubelet is simply stopped, starting it is reasonable, but verify that the service is enabled for the next boot and that the node returns to a healthy state. If it immediately fails again, stop treating the restart as the fix and move back to logs and configuration. Many kubelet failures are deterministic: a bad flag, missing file, wrong CRI endpoint, invalid certificate, or unreachable API server will reproduce every time.
# Start kubeletsudo systemctl start kubelet
# Enable on bootsudo systemctl enable kubelet
# Check statussudo systemctl status kubeletKubelet configuration is often split between /var/lib/kubelet/config.yaml and systemd drop-ins created by kubeadm or the node image. A damaged YAML file can prevent startup. A changed systemd drop-in will not be used until systemctl daemon-reload runs. A stale --container-runtime-endpoint can point kubelet at the wrong socket after a runtime migration. These are boring details, but they are exactly where many node repairs succeed.
# Check kubelet config filecat /var/lib/kubelet/config.yaml
# Check kubelet flagscat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
# After fixing config, reload and restartsudo systemctl daemon-reloadsudo systemctl restart kubeletsudo systemctl status kubelet # Verify service startedCertificate problems deserve special attention in kubeadm-style clusters. Kubelet authenticates to the API server with client certificates, and expired or damaged material can look like intermittent node failure, TLS errors, or repeated authentication failures. If the cluster was created about a year ago, or if certificate rotation was disabled or interrupted, check the kubelet PKI directory and the logs before deciding the node itself is broken.
Do not confuse certificate expiry with generic network loss. A network failure usually produces connection timeouts, refused connections, or routing symptoms from many tools. A certificate failure often shows TLS handshake, authorization, or client certificate messages while basic connectivity to the API endpoint may still work. That difference matters because opening firewall ports will not fix an expired certificate, and rejoining a node is excessive if the only fault is a temporary route change.
# Check certificate pathscat /var/lib/kubelet/config.yaml | grep -i cert
# Verify certificates existls -la /var/lib/kubelet/pki/
# For expired certs, may need to rejoin node# On control plane: kubeadm token create --print-join-command# On worker: kubeadm reset && kubeadm join ...The runtime side of the investigation begins with containerd or CRI-O service health, then moves to the socket and finally to direct CRI inspection. crictl is valuable because it talks to the local runtime rather than the Kubernetes API. When the API server is unreachable, kubectl logs may fail globally, but sudo crictl logs <container-id> can still show what is happening on that host.
# Check containerd (most common)sudo systemctl status containerdsudo crictl info
# Check container runtime socketls -la /run/containerd/containerd.sock
# List containers with crictlsudo crictl ps
# List imagessudo crictl images| Issue | Symptom | Diagnosis | Fix |
|---|---|---|---|
| containerd stopped | Pods ContainerCreating | systemctl status containerd | systemctl start containerd |
| Socket missing | kubelet errors | Check socket path | Restart containerd |
| Disk full | Container create fails | df -h | Clean up disk |
| Image pull fails | ImagePullBackOff | Check registry access | Fix registry auth |
| Resource exhausted | Random container failures | Check cgroups | Increase resources |
If containerd has crashed, restart it, then check both runtime and kubelet logs. A runtime restart does not automatically fix disk exhaustion, corrupt images, registry DNS failures, or cgroup problems. Treat the restart as a test that tells you whether the runtime can come back cleanly. If it remains unhealthy, the error messages after the restart are often more useful than the stale errors before it.
# Start containerdsudo systemctl start containerd
# Check statussudo systemctl status containerd
# Check logs for issuessudo journalctl -u containerd --since "10 minutes ago"Configure crictl explicitly if the host does not already point it at the correct CRI socket. This avoids confusing failures where crictl is healthy but looking at the wrong endpoint. After configuration, inspect containers, logs, and metadata directly. In a node outage, this can confirm whether the application container is still running, whether it exited locally, or whether the kubelet only lost the ability to report its state.
# Configure crictl for containerdcat <<EOF | sudo tee /etc/crictl.yamlruntime-endpoint: unix:///run/containerd/containerd.sockimage-endpoint: unix:///run/containerd/containerd.socktimeout: 10debug: falseEOF
# List all containers (including stopped)sudo crictl ps -a
# Get container logssudo crictl logs <container-id>
# Inspect containersudo crictl inspect <container-id>Worked example: suppose kubectl describe node worker-a shows Ready=False, and SSH still works. systemctl status kubelet reports active, but journalctl -u kubelet repeatedly shows connection refused for unix:///run/containerd/containerd.sock. At that point, restarting kubelet first is a weak move because kubelet is only reporting that its dependency is missing. Check systemctl status containerd, verify the socket path, inspect containerd logs, and then restart or repair the runtime before returning to kubelet.
Resource Pressure, Local Evictions, and Host Survival
Section titled “Resource Pressure, Local Evictions, and Host Survival”Worker nodes are finite Linux machines. They can run out of memory, disk space, free inodes, process identifiers, or practical I/O capacity long before the cluster as a whole looks full. The kubelet watches several resource signals and asserts node-pressure conditions when thresholds are crossed. That behavior protects the host from total lockup, but it also means pods can be killed locally even when no human issued a drain and no PodDisruptionBudget allowed voluntary disruption.
Resource pressure is often the point where scheduling theory becomes hardware reality. A deployment may request modest resources, but the node also runs the kubelet, runtime, logging agents, CNI components, storage plugins, kernel work, and every DaemonSet placed on that host. Overcommitment can be reasonable when workloads are bursty, yet it becomes dangerous when many containers peak together or a host-level agent consumes resources outside normal pod expectations. During an incident, compare desired allocation with actual consumption so you know whether the fix belongs in workload sizing, node capacity, daemon behavior, or emergency cleanup.
mindmap root((Resource Pressure)) Memory Available memory below threshold Triggers pod eviction Check: free -m Disk Usage above threshold Triggers image GC Check: df -h PID Process IDs exhausted Unable to fork processes Check: pid_maxNode-pressure eviction is different from control-plane eviction after node failure. Taint-based eviction handles pods on nodes the control plane considers unhealthy or unreachable. Node-pressure eviction is performed by the kubelet on the node to reclaim resources before the operating system collapses. Because this is emergency host protection, it can bypass PodDisruptionBudgets and may shorten graceful termination behavior under severe pressure. That is surprising only if you treat all pod movement as the same kind of eviction.
This difference changes how you explain impact to a team. A planned drain is a voluntary disruption and gives controllers, disruption budgets, and graceful termination a chance to shape the move. A pressure eviction is a local survival decision made under stress, and its priority is keeping the host alive enough to continue managing critical processes. If a database pod was evicted by memory pressure, the right question is not only “why did Kubernetes move it?” but also “why was this node allowed to reach an emergency threshold with that workload mix?”
# Check node conditionskubectl describe node <node> | grep -A 10 Conditions
# On the node - check memoryfree -mcat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable"
# Check diskdf -hdu -sh /var/lib/containerd/* # Container storagedu -sh /var/log/* # Log storage
# Check PIDscat /proc/sys/kernel/pid_maxps aux | wc -lDefault hard eviction thresholds cover low available memory, low node filesystem capacity, low image filesystem capacity, and inode exhaustion. The exact values are kubelet configuration, not magic constants embedded in your applications. You should inspect the local kubelet config when the behavior does not match your expectations. Customizing thresholds can be valid for specialized nodes, but tuning them during an outage is risky unless you understand whether the host is truly near failure.
evictionHard: memory.available: "100Mi" nodefs.available: "10%" nodefs.inodesFree: "5%" imagefs.available: "15%"When a threshold is crossed, the kubelet sets the relevant node condition, the scheduler avoids assigning new pods to the node, and the kubelet chooses pods to evict based on quality of service, priority, and resource usage relative to requests. BestEffort pods are usually most exposed because they have no requests. Overcommitted Burstable pods can also be evicted before Guaranteed pods. This is why resource requests are not just scheduling hints; they become evidence during node survival decisions.
Pause and predict: if a pod using an emptyDir volume is evicted because the node is under memory or disk pressure, what happens to data stored in that volume? The important clue is in the name. emptyDir is local ephemeral storage tied to the pod’s life on that node, so eviction can destroy local contents even if the replacement pod starts cleanly elsewhere.
Memory pressure troubleshooting starts by proving whether the pressure is container-driven, host-driven, or an accounting problem. Compare kubectl top with OS-level process lists, then inspect the workload that changed recently. If a single pod is consuming far beyond its request, eviction or deletion may be a containment step. If the pressure is caused by host daemons, logging agents, kernel memory, or a DaemonSet, rescheduling the application pods will not fix the node pool because the culprit follows every node.
QoS class is the bridge between manifest design and node behavior. Guaranteed pods have equal memory requests and limits for every container, so they represent a stronger scheduling promise. Burstable pods have at least some request, but they may be using more than requested when pressure arrives. BestEffort pods have no requests or limits, so they are easy for the kubelet to sacrifice first. This does not make BestEffort wrong for every workload, but it makes it a poor choice for anything you expect to survive node stress.
# Find memory-hungry processesps aux --sort=-%mem | head -20
# Find pods using most memorykubectl top pods -A --sort-by=memory
# Options:# 1. Kill unnecessary processes# 2. Evict low-priority pods# 3. Add more memory to nodeDisk pressure often requires faster action because a full root filesystem can break logs, image pulls, container creation, kubelet state writes, and even interactive repair commands. Start with filesystem utilization, then identify whether image storage, container writable layers, journald, application logs, or unrelated host files are responsible. Avoid deleting directories blindly under /var/lib/containerd; use runtime-aware cleanup first when possible, and preserve evidence when the root cause is unclear.
Disk diagnosis should include both bytes and inodes. A filesystem can have free gigabytes but no available inodes, which means new small files still fail. Container image layers, unpacked files, log fragments, and application scratch data all contribute differently depending on the node image and runtime configuration. If your cluster separates nodefs and imagefs, pressure on one filesystem may trigger different reclaim behavior than pressure on the other. Kubernetes 1.35 documentation also describes containerfs signal handling in supported layouts, so read the node’s actual runtime layout before assuming every disk warning points to the same directory.
# Find large filessudo find / -type f -size +100M -exec ls -lh {} \;
# Clean up container imagessudo crictl rmi --prune
# Clean up old logssudo journalctl --vacuum-time=3d
# Clean up unused containerssudo crictl rm $(sudo crictl ps -a -q --state exited)PID pressure is less visible than memory or disk pressure, but it can be just as severe. Linux needs a free process ID to start a shell, run a probe, fork a helper, or create a new application process. A fork-heavy bug can make a node look haunted because even simple commands fail intermittently. Check the actual pid_max, count processes, and identify the user or container family generating most of them before raising limits. Raising the limit buys time; it does not correct runaway process creation.
Treat emergency relief and permanent prevention as separate work items. Killing a runaway process, deleting a low-priority pod, pruning images, or raising a temporary PID limit may restore enough room for the node to respond. The permanent fix may be a workload limit, a log rotation policy, a DaemonSet rollback, a larger node shape, or fewer pods per node. If you stop at relief, the same pressure condition will return when the workload pattern repeats.
# Check current PID limitcat /proc/sys/kernel/pid_max
# Increase limit temporarilyecho 65536 | sudo tee /proc/sys/kernel/pid_max
# Find processes by countps aux | awk '{print $1}' | sort | uniq -c | sort -rn | headWhich approach would you choose here and why: delete the largest pod, drain the node, or cordon the node and collect evidence first? The best answer depends on blast radius. If the node is minutes from lockup, containment comes first. If the cluster has enough spare capacity and the cause is not obvious, cordon plus evidence collection can prevent new workload placement while preserving data for diagnosis. If a known low-priority workload is the culprit, targeted eviction may restore the node without moving unrelated pods.
Network, Shutdown, and Recovery Paths
Section titled “Network, Shutdown, and Recovery Paths”A worker node can have healthy services and plenty of resources while still failing cluster duties because the network path is broken. The node must reach the API server for heartbeats and pod updates, DNS for name resolution, registries for image pulls, other nodes for pod networking, and sometimes cloud or storage endpoints for volumes. Node network failures are especially confusing because application traffic, SSH, and API reachability can fail independently.
Separate the network paths by purpose. API server reachability keeps kubelet status and pod assignment flowing. Registry reachability determines whether new images can be pulled. Cluster DNS affects workloads that need service discovery. Pod overlay or routing paths determine whether pods can talk across nodes. SSH only proves a management path exists. A node can pass one of these tests and fail another, so a single successful ping should never end a node network investigation.
flowchart LR Node -->|port 6443| API[API Server] Node -->|varies based on CNI| Nodes[Other Nodes] Node -->|port 53| DNS[DNS Servers] Node -->|port 443| Reg[Container Registry] style API stroke:#f66,stroke-width:2pxBegin network diagnosis from the affected node, then compare with a healthy node. If only one node is unable to reach the API server, suspect host firewall rules, routes, interface addressing, node security groups, or local DNS configuration. If many nodes fail at once, look for shared control-plane reachability, network policy mistakes, CNI failure, or infrastructure routing. A single command rarely proves the cause; you need the pattern across nodes and destinations.
# Check basic connectivityping <api-server-ip>
# Check API server reachabilitycurl -k https://<api-server>:6443/healthz
# Check DNSnslookup kubernetes.default.svc.cluster.localcat /etc/resolv.conf
# Check firewallsudo iptables -L -nsudo firewall-cmd --list-all # If using firewalld
# Check network interfacesip addrip route| Issue | Symptom | Diagnosis | Fix |
|---|---|---|---|
| Firewall blocking | API unreachable | telnet api-server 6443 | Open firewall ports |
| DNS failure | Name resolution fails | nslookup | Fix /etc/resolv.conf |
| IP address change | Node NotReady | Check IP in node spec | Reconfigure or rejoin |
| CNI plugin issues | Pod networking fails | Check CNI pods | Restart CNI, fix config |
| MTU mismatch | Intermittent failures | Check MTU settings | Align MTU values |
| Port | Protocol | Component | Purpose |
|---|---|---|---|
| 6443 | TCP | API Server | Kubernetes API |
| 10250 | TCP | kubelet | kubelet API |
| 10259 | TCP | kube-scheduler | Scheduler metrics |
| 10257 | TCP | kube-controller-manager | Controller metrics |
| 2379-2380 | TCP | etcd | Client and peer |
| 30000-32767 | TCP | NodePort | Service NodePorts |
Recovery begins once you know whether the node is reachable, whether kubelet can run, and whether the workload should be moved. If the node is healthy enough to participate, cordon first to stop new assignments, drain when you need to clear existing workloads, perform maintenance, and then uncordon after validation. If the node is not reachable, you may need infrastructure console access, forced power recovery, out-of-service taints for storage detachment behavior, or eventual node deletion.
The safest recovery action is the one that matches the node’s current ability to cooperate. A responsive node with a running kubelet can drain ordinary pods and report progress. A partially responsive node may need cordon plus targeted service repair before a drain will complete. A powered-off node is unable to evict anything locally, so the control plane and storage system must handle replacement and detachment according to their rules. Matching the action to node cooperation prevents commands from hanging and prevents accidental data-loss decisions.
flowchart TD A{Node NotReady?} -->|Yes| B{Can SSH to node?} B -->|YES| C{kubelet running?} B -->|NO| D[Check physical/VM, cloud console] C -->|YES| E[Check logs, certs, API connectivity] C -->|NO| F[Start kubelet] E --> G{Still NotReady after fixes?} F --> G G -->|Yes| H[Drain and rejoin node]Graceful node shutdown is a Kubernetes feature path that lets kubelet react when the operating system is shutting down, mark the node appropriately, and terminate pods in an orderly way when configured. Linux support has existed for multiple releases, Windows support is documented for newer releases, and Kubernetes 1.35 operators should still inspect actual kubelet configuration because default shutdown grace values can be zero. Do not assume a reboot is graceful just because Kubernetes supports the feature.
Non-graceful shutdown is a different story. If a VM disappears, the kubelet has no chance to update pod status or detach volumes cleanly. Kubernetes has documented mechanisms such as out-of-service taints to help operators handle stuck workloads and storage detachment, but those mechanisms are not casual cleanup tools. Use them when you have confirmed the node is truly gone or unsafe to wait for, and record why the normal graceful path was unavailable.
Drain and cordon solve different problems. cordon prevents new pods from landing on the node, but it does not move existing pods. drain cordons the node and evicts eligible pods, while respecting PodDisruptionBudgets for voluntary disruptions and requiring explicit handling for DaemonSets and emptyDir data. In an exam, using the wrong one wastes time. In production, using the wrong one can either fail to clear the node or disrupt more workloads than intended.
# Drain node (evicts pods safely)kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Cordon only (prevent new pods)kubectl cordon <node-name>
# Uncordon (allow scheduling again)kubectl uncordon <node-name>If a node’s Kubernetes state is corrupted beyond quick repair, rejoining may be faster and safer than hand-editing every damaged file. This is common after certificate problems, bad kubelet bootstrapping, or broken local configuration. Treat kubeadm reset as destructive for the node’s Kubernetes membership, not for the entire cluster. Generate a fresh join command from the control plane, reset the worker, rejoin, then verify node readiness, labels, taints, and workload placement.
# On the worker nodesudo kubeadm reset -f
# On control plane - generate new join tokenkubeadm token create --print-join-command
# On worker - rejoinsudo kubeadm join <api-server>:6443 --token <token> --discovery-token-ca-cert-hash <hash>If the hardware or VM will never return, remove the node object so the cluster stops carrying stale state. Drain first when possible, because deletion alone does not magically move running containers from a dead machine; it only removes the API object. If the node is already gone and a drain is impossible, document the storage and application consequences before deleting it. Stateful workloads and local persistent volumes require extra care because the cluster may not be able to safely detach or replace data without operator action.
After recovery, validate more than Ready=True. Check that expected labels, taints, runtime versions, CNI files, kubelet configuration, and node allocatable resources match the rest of the pool. Confirm DaemonSets have returned, storage plugins are healthy, and a small test pod can schedule and reach cluster DNS. Many node repairs fail at this last step because the machine rejoins but lacks a label or daemon required by production workloads.
# Drain firstkubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# Delete node from clusterkubectl delete node <node-name>kubectl get nodes # Verify node is removed
# On the node itselfsudo kubeadm reset -fPatterns & Anti-Patterns
Section titled “Patterns & Anti-Patterns”Worker node repair works best when the team has repeatable habits rather than heroic improvisation. The patterns below are useful because they preserve evidence, reduce blast radius, and align with how Kubernetes actually moves from observation to scheduling and eviction. They also scale from a single CKA lab node to a production pool with autoscaling, maintenance windows, and multiple workload priorities.
| Pattern | When to Use | Why It Works | Scaling Considerations |
|---|---|---|---|
| Control-plane view first | Any node alert or CKA troubleshooting task | It classifies Ready, Unknown, pressure, taints, events, and scheduling state before local changes | Automate snapshots of kubectl get nodes, conditions, and events during incidents. |
| Node-local evidence second | The API reports unhealthy status but the host may still be reachable | systemctl, journalctl, crictl, and OS metrics reveal causes hidden from the API | Standardize SSH access, log retention, and host diagnostics across node images. |
| Cordon before uncertain repair | You need time to inspect a node without receiving new pods | It reduces new workload placement while preserving existing evidence | Pair with alerts for long-cordoned nodes so maintenance state does not linger. |
| Drain before planned maintenance | You need to reboot, patch, reset, or remove a node | It uses Kubernetes eviction logic rather than killing workloads blindly | Check PodDisruptionBudgets and cluster spare capacity before draining many nodes. |
| Runtime-aware cleanup | Disk pressure is tied to images, stopped containers, or logs | crictl and journald cleanup avoid random deletion under runtime directories | Use image garbage collection settings and log rotation to prevent repeated pressure. |
The anti-patterns are tempting because they feel fast. Restarting every service may temporarily hide a symptom. Deleting a node object may make a red status disappear. Raising PID or disk thresholds may delay an alert. Those actions are not inherently forbidden, but they become dangerous when they happen before classification, containment, and evidence collection.
Patterns also need boundaries. A drain is excellent for planned maintenance, but it may hang or cause excess disruption if the cluster lacks spare capacity or has strict disruption budgets. Runtime cleanup is helpful for disk pressure, but it is not a substitute for fixing log growth or image churn. Cordon is useful while investigating, but a forgotten cordon silently reduces cluster capacity. The best operators pair every pattern with a verification step that proves the node and the cluster returned to the intended state.
| Anti-Pattern | What Goes Wrong | Better Alternative |
|---|---|---|
| Restarting kubelet before reading logs | You lose timing clues and may chase a dependency failure as a kubelet failure | Capture systemctl status and recent journalctl output first. |
Treating cordon as workload evacuation | Existing pods keep running and maintenance remains blocked | Use drain when the goal is to clear workloads from the node. |
Deleting a NotReady node immediately | You remove API state without understanding storage, workload, or recovery impact | Drain when possible, investigate reachability, then delete only when replacement is intended. |
| Ignoring resource pressure conditions | Pending pods and evictions look mysterious even though kubelet is protecting the host | Inspect MemoryPressure, DiskPressure, PIDPressure, and OS metrics together. |
Cleaning runtime storage with rm -rf | You can corrupt runtime state or remove evidence needed for root cause | Prefer crictl rmi --prune, journald vacuuming, and targeted log cleanup. |
| Assuming all pod movement respects PodDisruptionBudgets | Node-pressure evictions are local emergency actions and can bypass voluntary disruption protections | Distinguish node-pressure eviction from kubectl drain and controller replacement. |
Decision Framework
Section titled “Decision Framework”The fastest safe response comes from asking four questions in order. First, can the API server still see recent node status? Second, can you reach the host by SSH or infrastructure console? Third, are kubelet and the container runtime healthy locally? Fourth, is the node safe to keep in service, or should it be isolated and repaired? This sequence avoids jumping from symptom to destructive action.
Use the framework as a decision tree, not a checklist to finish mechanically. If kubectl describe node already shows DiskPressure=True and image garbage collection failures, you do not need to spend ten minutes proving that kubelet exists before checking disk. If SSH is dead and the cloud console shows the instance powered off, local journalctl commands are impossible until the host returns. The value of the framework is that it keeps your next command tied to the strongest current signal.
flowchart TD A[Node alert or failed workload] --> B[Check kubectl get nodes and describe node] B --> C{Ready, NotReady, or Unknown?} C -->|Ready with pressure| D[Inspect resources and workload placement] C -->|NotReady| E[SSH and inspect kubelet, runtime, logs] C -->|Unknown| F[Test host reachability and API network path] D --> G{Immediate host risk?} E --> H{Service or config repair clear?} F --> I{Node reachable outside API?} G -->|Yes| J[Cordon, contain culprit, drain if needed] G -->|No| K[Collect evidence and tune workload requests] H -->|Yes| L[Repair service, verify Ready, uncordon] H -->|No| M[Drain, reset, rejoin, or replace] I -->|Yes| E I -->|No| N[Use console, out-of-service handling, or delete after impact review]| Situation | First Move | Next Check | Avoid |
|---|---|---|---|
Ready=Unknown and SSH fails | Check infrastructure console or VM health | Network path to API server and node power state | Restarting workloads from the API without knowing where they run. |
NotReady but SSH works | Inspect kubelet and container runtime with systemd and logs | Certificate, CRI socket, API reachability, kubelet config | Blind node deletion. |
MemoryPressure=True | Identify top memory users and workload QoS | Requests, limits, DaemonSets, host daemons | Increasing eviction thresholds during active pressure. |
DiskPressure=True | Check filesystem, inodes, image storage, logs | Runtime cleanup, log rotation, container writable layers | Random deletion under /var/lib/containerd. |
| Planned reboot | Cordon, drain, reboot, validate, uncordon | PodDisruptionBudgets and DaemonSet behavior | Using cordon alone and assuming pods moved. |
| Irrecoverable host | Drain if possible, delete node, replace capacity | Stateful storage and local volume impact | Leaving stale nodes indefinitely. |
For CKA practice, keep the framework compact in your head: API view, node access, kubelet, runtime, resources, network, safe isolation, recovery. In real operations, add communication and blast-radius controls around the same sequence. Tell application owners when a drain might be delayed by PodDisruptionBudgets. Watch cluster capacity before moving pods. Confirm that DaemonSets and node labels return after replacement, because a recovered node that lacks the right labels or taints can be just as disruptive as a failed one.
Finally, decide what evidence proves completion. For a kubelet repair, completion is not merely systemctl restart kubelet; it is the node returning to Ready=True, pressure conditions staying false, and recent kubelet logs showing stable status updates. For a drain, completion is not the command returning; it is workload replacement, no unintended pods left behind, and the node clearly marked for maintenance. For a replacement, completion includes new node capacity, correct labels, healthy DaemonSets, and no stale node objects confusing future responders.
Did You Know?
Section titled “Did You Know?”- Default pod toleration window: ordinary pods receive a default
tolerationSeconds: 300for thenode.kubernetes.io/not-readyandnode.kubernetes.io/unreachableNoExecutetaints, so failover after a node partition is intentionally delayed. - Lease heartbeats reduce API load: modern Kubernetes node liveness uses lightweight Lease objects as part of heartbeat reporting, so the node controller can monitor liveness without rewriting the full Node object for every signal.
- Node-pressure evictions are not voluntary disruptions: kubelet eviction under memory, disk, or PID pressure can bypass PodDisruptionBudgets because the node is protecting itself from host-level failure.
- Graceful shutdown needs real configuration: Kubernetes documents graceful node shutdown behavior, but kubelet shutdown grace periods can be zero by default, so operators must verify the actual node configuration before relying on orderly termination.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Not checking kubelet first | The API view is easier to reach, so engineers keep running kubectl while the node agent is down. | SSH to the node and run sudo systemctl status kubelet plus recent journalctl before restarting. |
| Ignoring node conditions | kubectl get nodes compresses a lot of state into one status column, hiding pressure details. | Inspect MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable, taints, and recent events. |
| Deleting a node before drain | Removing the API object feels like cleanup, but it does not safely evict reachable workloads. | Use kubectl drain when the node can participate, then delete only when replacement is intended. |
| Forgetting DaemonSet pods during drain | DaemonSet pods are managed differently and are not evicted like ordinary replicated pods. | Use --ignore-daemonsets and verify that node-level agents tolerate the maintenance workflow. |
| Blaming kubelet for runtime failures | Kubelet logs report CRI errors, so the dependency failure looks like a kubelet failure. | Check systemctl status containerd, the CRI socket, and sudo crictl info before restarting kubelet repeatedly. |
| Ignoring disk and inode usage | Memory alerts are obvious, while full filesystems and inodes surface as unrelated image or log failures. | Run df -h, inode checks, image cleanup, and journald vacuuming as part of node pressure diagnosis. |
Restarting without daemon-reload | Edited systemd drop-ins are not loaded automatically, so the old kubelet flags remain active. | Run sudo systemctl daemon-reload before restarting kubelet after changing unit files or drop-ins. |
| Skipping CNI and route checks | A node that answers SSH can still fail pod networking or API reachability. | Compare routes, firewall rules, DNS, CNI pods, MTU, and API server connectivity against a healthy node. |
Question 1: A worker node suddenly shows `Ready=Unknown`, but the application team says users are still reaching some pods that were already on that node. What should you conclude first?
- All containers on the node have definitely stopped.
- The control plane has lost reliable heartbeat visibility, but local containers may still be running.
- The scheduler is broken because it has not instantly replaced every pod.
- The node must be deleted before any other check.
Answer: Option 2 is the correct first conclusion. Unknown means the node controller is no longer receiving trustworthy status updates, not that every process on the host has stopped. Option 1 confuses API visibility with runtime state. Option 3 ignores default tolerations and eviction timing. Option 4 is unsafe because deletion removes API state before you know whether the host is reachable, recoverable, or holding sensitive workload and storage state.
Question 2: During a CKA task, `kubectl describe node worker-2` shows `MemoryPressure=True`, and a newly created pod remains pending. What should you investigate and why?
- Investigate memory usage, pod requests, workload QoS, and kubelet eviction events.
- Delete the kube-system namespace because the scheduler is stuck.
- Assume the pod image is invalid because pending pods always mean image pull failure.
- Restart the API server because node pressure is stored in etcd.
Answer: Option 1 is correct because MemoryPressure=True tells the scheduler to avoid the node and tells you the kubelet may be evicting pods locally to protect the host. Pod requests and QoS influence eviction risk, so they matter during root cause analysis. Option 2 is destructive and unrelated. Option 3 confuses Pending scheduling failure with image pull states. Option 4 treats the control plane as the cause even though the condition is reported by the node.
Question 3: Kubelet logs repeatedly show connection refused for `unix:///run/containerd/containerd.sock`. Which action gives the strongest next signal?
- Restart kubelet immediately and ignore the runtime logs.
- Check
systemctl status containerd, verify the socket path, and inspect containerd logs. - Delete all pods scheduled to the node from the API server.
- Increase
pid_maxbecause socket errors always mean PID pressure.
Answer: Option 2 is correct because kubelet is reporting that its CRI dependency is unreachable. Verifying containerd service state and socket existence tests the dependency directly. Option 1 may reproduce the same error without fixing anything. Option 3 disrupts workloads without explaining why the node is unable to create or inspect containers. Option 4 is speculation unless process exhaustion is also visible in OS metrics.
Question 4: You need to patch a healthy worker node and want existing workloads to move away before rebooting. Which command sequence is appropriate?
kubectl cordon, reboot immediately, then hope controllers replace pods.kubectl drain <node> --ignore-daemonsets --delete-emptydir-data, patch, reboot, validate, thenkubectl uncordon.kubectl delete node, patch, and expect the same node object to return automatically.- Restart containerd because maintenance is a runtime problem.
Answer: Option 2 is correct because draining safely evicts eligible pods and cordons the node as part of the maintenance flow. --ignore-daemonsets acknowledges that DaemonSet pods are not evicted like ordinary pods, and --delete-emptydir-data explicitly accepts ephemeral local data loss. Option 1 only prevents new scheduling and leaves old pods behind. Option 3 removes cluster state rather than preparing for planned maintenance. Option 4 does not address workload evacuation.
Question 5: The API server is reachable from your laptop, but `kubectl logs` times out for a pod on a damaged worker. SSH to the worker works. How can you inspect local container logs?
- Use
sudo crictl psto find the container andsudo crictl logs <container-id>on the node. - Run
kubectl logsrepeatedly until the timeout clears. - Delete the pod and inspect the replacement instead.
- Query etcd directly for the container stdout file.
Answer: Option 1 is correct because crictl talks directly to the node-local CRI endpoint and can work when Kubernetes API-mediated log retrieval is failing. Option 2 wastes time if the failure path is kubelet, runtime, or node networking. Option 3 destroys useful local evidence and may not reproduce the same failure. Option 4 misunderstands where container logs live; etcd stores cluster state, not normal container stdout files.
Question 6: A node has `DiskPressure=True`, image pulls are failing, and `/var/log` is very large. Which remediation is safest as an initial step?
- Remove random directories under
/var/lib/containerdwithrm -rf. - Vacuum old journald logs, prune unused images with
crictl, and verify free space and inodes. - Raise every eviction threshold so Kubernetes stops complaining.
- Delete the node object before checking the filesystem.
Answer: Option 2 is correct because it uses runtime-aware and log-aware cleanup before touching fragile runtime internals. It also confirms whether capacity and inode pressure actually improve. Option 1 risks corrupting runtime metadata or removing evidence. Option 3 hides the symptom while the host remains close to failure. Option 4 is an API cleanup action, not a disk repair.
Question 7: A worker is unable to reach `https://:6443/healthz`, but kubelet and containerd are active locally. What should you compare next?
- Routes, firewall rules, DNS, interface addresses, and API reachability against a healthy node.
- Only application pod logs, because the node services are healthy.
- The scheduler logs first, because scheduling always controls node heartbeats.
- The NodePort range, because API server health uses NodePort.
Answer: Option 1 is correct because a healthy local kubelet still needs a network path to the API server to report status and receive pod updates. Comparing against a healthy node exposes host-specific routing, firewall, DNS, or address differences. Option 2 ignores the node-level control path. Option 3 starts too high in the stack. Option 4 confuses Kubernetes service NodePorts with the API server’s secure port.
Hands-On Exercise: Node Troubleshooting Simulation
Section titled “Hands-On Exercise: Node Troubleshooting Simulation”Scenario
Section titled “Scenario”Exercise scenario: you are the on-call engineer for a Kubernetes 1.35 cluster. Monitoring reports that one worker node is intermittently unstable, some pods are pending, and the team is unsure whether this is a kubelet failure, runtime failure, resource pressure event, or maintenance issue. Your task is to collect evidence, classify the failure, and practice the safe isolation commands without making irreversible changes.
Prerequisites
Section titled “Prerequisites”- Access to a Kubernetes cluster
- SSH access to at least one worker node
- Permission to run
kubectl,systemctl,journalctl, andcrictlin the lab environment
Task 1: Node Health Assessment
Section titled “Task 1: Node Health Assessment”Begin from the control plane view. Identify the node you want to investigate, record its status, and inspect all node conditions so you know whether this is readiness, pressure, network, or scheduling state.
Solution
# Check all nodeskubectl get nodes -o wide
# Get detailed node informationkubectl describe node <node-name>
# Check node conditions specificallykubectl get node <node-name> -o jsonpath='{.status.conditions[*].type}' | tr ' ' '\n'Task 2: kubelet Investigation
Section titled “Task 2: kubelet Investigation”Assume the node is showing signs of distress. SSH directly into the node and interrogate the primary agent before restarting it, because the first log messages often tell you whether the issue is configuration, certificates, runtime connectivity, or API reachability.
Solution
# SSH to a worker nodessh <node>
# Check kubelet statussudo systemctl status kubelet
# View recent kubelet logssudo journalctl -u kubelet --since "5 minutes ago" | tail -50
# Check kubelet configurationcat /var/lib/kubelet/config.yaml | head -30Task 3: Container Runtime Check
Section titled “Task 3: Container Runtime Check”The kubelet relies on the container runtime, so verify that containerd is healthy and that CRI inspection works locally. This gives you evidence even if API-mediated commands are slow or unavailable.
Solution
# Check containerd statussudo systemctl status containerd
# List running containerssudo crictl ps
# Check container runtime infosudo crictl info
# List images on nodesudo crictl imagesTask 4: Resource Assessment
Section titled “Task 4: Resource Assessment”The node may be healthy at the service level but starving for resources. Compare Kubernetes metrics and OS metrics so you can tell whether pressure is caused by pods, host daemons, image storage, logs, or process exhaustion.
Solution
# Check memoryfree -m
# Check diskdf -h
# Check what's using resourceskubectl top node <node-name>
# See allocated resourceskubectl describe node <node-name> | grep -A 10 "Allocated resources"Task 5: Cordon and Uncordon Safely
Section titled “Task 5: Cordon and Uncordon Safely”You have decided the node needs a reboot to clear a suspected memory leak. Cordon the node, verify that the scheduler avoids it, then uncordon it so the lab does not leave the cluster in maintenance mode.
Solution
# Cordon a node (prevents new scheduling)kubectl cordon <node-name>
# Verify it's unschedulablekubectl get node <node-name>
# Try to schedule a podkubectl run test-pod --image=nginxkubectl get pods test-pod -o wide # Should NOT be on cordoned node
# Uncordonkubectl uncordon <node-name>
# Verify node is schedulable againkubectl get node <node-name>
# Cleanupkubectl delete pod test-podSuccess Criteria
Section titled “Success Criteria”- Checked node conditions for all nodes using jsonpath.
- Verified kubelet is running and inspected the systemd logs.
- Verified containerd is running and used crictl to list images.
- Assessed node resource usage at both the OS and cluster levels.
- Successfully cordoned a node, tested scheduler avoidance, and uncordoned it.
Practice Drills
Section titled “Practice Drills”These short drills build command recall after you understand the reasoning. Run them only in a lab or approved environment, and say what signal each command is supposed to prove before you execute it.
Drill 1: Node Status Check
Section titled “Drill 1: Node Status Check”# Task: List all nodes with their statuskubectl get nodesDrill 2: Node Conditions
Section titled “Drill 2: Node Conditions”# Task: Check all conditions for a specific nodekubectl describe node <node> | grep -A 10 ConditionsDrill 3: kubelet Status
Section titled “Drill 3: kubelet Status”# Task: Check if kubelet is running (on node)sudo systemctl status kubeletDrill 4: kubelet Logs
Section titled “Drill 4: kubelet Logs”# Task: View last 20 lines of kubelet logssudo journalctl -u kubelet -n 20Drill 5: Container Runtime Status
Section titled “Drill 5: Container Runtime Status”# Task: Check containerd and list containerssudo systemctl status containerdsudo crictl psDrill 6: Resource Usage
Section titled “Drill 6: Resource Usage”# Task: Check node resource usagekubectl top nodeskubectl describe node <node> | grep -A 5 "Allocated resources"Drill 7: Drain Node
Section titled “Drill 7: Drain Node”# Task: Safely drain a nodekubectl drain <node> --ignore-daemonsets --delete-emptydir-dataDrill 8: Disk Usage
Section titled “Drill 8: Disk Usage”# Task: Check disk usage on nodedf -hdu -sh /var/lib/containerd/Cleanup
Section titled “Cleanup”Ensure the node is uncordoned, the test pod is deleted, and any temporary notes clearly distinguish observation from action. In a shared lab, verify that no node remains in SchedulingDisabled state unless the exercise environment explicitly expects it.
Sources
Section titled “Sources”- kubernetes.io: taint and toleration
- Node Status
- kubernetes.io: nodes
- kubernetes.io: kubernetes 1 29 taint eviction controller
- Certificate Management with kubeadm
- Node-pressure Eviction
- kubernetes.io: node shutdown
- kubernetes.io: kubectl drain
- Node Status Reference
- Debugging Kubernetes nodes with crictl
Next Module
Section titled “Next Module”Continue to Module 5.5: Network Troubleshooting to learn how to diagnose and fix pod-to-pod, pod-to-service, and external connectivity issues that plague distributed systems.