Skip to content

Module 5.4: Worker Node Failures

Hands-On Lab Available
K8s Cluster advanced 40 min
Launch Lab ↗

Opens in Killercoda in a new tab

Complexity: [MEDIUM] - Critical for cluster operations

Time to Complete: 45-55 minutes

Prerequisites: Module 5.1 (Methodology), Module 1.1 (Cluster Architecture)

In 2018, a major online retailer experienced a catastrophic global outage during their peak holiday sales event. The root cause was not a complex network intrusion or a database corruption, but a simple memory leak in a third-party logging daemon deployed across their worker nodes. As the daemon consumed RAM, individual worker nodes sequentially exhausted their memory capacities. Each node hit MemoryPressure, stopped accepting new pods, and began aggressively evicting existing workloads.

Because the underlying issue was not immediately diagnosed, the Kubernetes scheduler desperately scrambled to place the newly evicted pods onto the remaining healthy nodes. This cascading failure created a massive “thundering herd” effect. The surviving worker nodes were instantaneously overwhelmed by the flood of rescheduled pods, causing them to run out of memory as well. Within minutes, the entire e-commerce platform collapsed, resulting in six hours of downtime and an estimated $15 million in lost revenue. This incident underscores a brutal truth in distributed systems: a localized node failure, if left unchecked, can quickly metastasize into a global cluster outage.

Worker nodes are the fundamental workhorses of your Kubernetes cluster. They are where your applications actually execute. When a node fails, the applications running on it suffer immediately. Understanding how to definitively diagnose and fix worker node issues—whether it is a crashed kubelet agent, an unresponsive container runtime, or critical resource exhaustion—is essential for maintaining cluster health. This module prepares you to jump into a failing node, interpret the low-level system signals, and confidently restore service before the cascading effects take hold of your infrastructure.

The Factory Floor Analogy

If the control plane is management, worker nodes are the factory floor. The kubelet is the floor supervisor - if they’re out, nothing gets done. The container runtime is the machinery - if it breaks, production stops. Node resources (CPU, memory, disk) are the raw materials - run out, and the factory grinds to a halt.

  • Diagnose the root cause of NotReady and Unknown node states using systematic debugging techniques and system logs.
  • Evaluate node resource pressure conditions and implement immediate remediation strategies to prevent cascading failures across the cluster.
  • Debug kubelet and container runtime integration failures by analyzing systemd service states, journalctl logs, and CRI socket configurations.
  • Implement safe node recovery and maintenance procedures, including cordoning, draining, and component restarts while respecting workload disruption budgets.
  • Diagnose worker node NotReady status by checking kubelet, container runtime, and network
  • Fix kubelet failures caused by configuration errors, certificate expiry, and resource pressure
  • Recover a node from disk pressure, memory pressure, and PID pressure conditions
  • Drain and cordon nodes safely during maintenance while respecting PodDisruptionBudgets
  • 10-second heartbeats: The kubelet reports its node status to the API server every 10 seconds. If 40 seconds pass without a heartbeat, the node is marked Unknown.
  • 5-minute eviction threshold: By default, pods running on a NotReady node are tolerated for exactly 300 seconds (5 minutes) before the control plane initiates eviction.
  • 15 percent disk threshold: The kubelet automatically triggers DiskPressure and begins garbage collecting unused container images when the node’s root filesystem drops below 15% available space.
  • 65536 PID limit: In many default Linux distributions configured for Kubernetes, the pid_max limit is historically set to 32768, which can easily be exhausted by rogue microservices, causing PIDPressure.

Before diving into the command line, it is critical to understand how Kubernetes thinks about node health. The Kubernetes control plane does not actively poll the worker nodes; instead, it relies on a push-based mechanism. The kubelet agent running on each worker node is responsible for periodically evaluating the node’s health and pushing a status update (a heartbeat) back to the API server. The Node Controller, running inside the kube-controller-manager on the control plane, monitors these heartbeats. If the heartbeats stop, or if the kubelet explicitly reports a problem, the Node Controller changes the node’s status to reflect the failure.

The node’s status is expressed through a set of Node Conditions. These conditions are boolean flags that describe specific aspects of the node’s health.

┌──────────────────────────────────────────────────────────────┐
│ NODE CONDITIONS │
│ │
│ Condition Healthy Meaning │
│ ───────────────────────────────────────────────────────── │
│ Ready True Node is healthy, can run pods │
│ MemoryPressure False Memory is sufficient │
│ DiskPressure False Disk space is sufficient │
│ PIDPressure False Process IDs are available │
│ NetworkUnavailable False Network is configured │
│ │
│ Any unhealthy condition → scheduling problems │
│ Ready=False or Unknown → node is NotReady │
│ │
└──────────────────────────────────────────────────────────────┘

Architecturally, we can visualize the relationship between these conditions and the overall node readiness as follows:

graph TD
A[Node Conditions] --> B{Ready?}
B -->|True| C[Healthy, can run pods]
B -->|False/Unknown| D[NotReady, scheduling problems]
A --> E[Resource Pressures]
E --> F[MemoryPressure]
E --> G[DiskPressure]
E --> H[PIDPressure]
E --> I[NetworkUnavailable]
F -.->|True| D
G -.->|True| D
H -.->|True| D
I -.->|True| D

To diagnose a node, you must first interrogate the API server to see what it believes the node’s state is. You can start with a broad overview and then drill down into the specific conditions of a problematic node.

Terminal window
# Quick status
k get nodes
# Detailed conditions
k describe node <node-name> | grep -A 10 Conditions
# All nodes with conditions
k get nodes -o custom-columns='NAME:.metadata.name,READY:.status.conditions[?(@.type=="Ready")].status,REASON:.status.conditions[?(@.type=="Ready")].reason'
# Check for resource pressure
k describe node <node-name> | grep -E "MemoryPressure|DiskPressure|PIDPressure"

When you query the node status, you will typically see one of the following states. Understanding the distinction between NotReady and Unknown is particularly important for troubleshooting.

StatusMeaningCommon Causes
ReadyHealthy and accepting podsNormal operation
NotReadyUnhealthykubelet down, network issues
UnknownNo heartbeat receivedNode unreachable, kubelet crashed
SchedulingDisabledCordonedManual cordon or maintenance

Stop and think: If a node transitions to the Unknown state, does that mean the applications running on it have crashed? Think about the separation of concerns between the control plane and the runtime before moving on.


The kubelet is the most critical Kubernetes component running on a worker node. It is the primary node agent, the direct representative of the control plane on the factory floor. If the kubelet is not functioning, the node is effectively severed from the cluster, regardless of whether the physical server is perfectly healthy.

┌──────────────────────────────────────────────────────────────┐
│ KUBELET RESPONSIBILITIES │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ kubelet │ │
│ │ │ │
│ │ • Registers node with API server │ │
│ │ • Watches for pod assignments │ │
│ │ • Manages container lifecycle (via runtime) │ │
│ │ • Reports node/pod status │ │
│ │ • Handles probes (liveness, readiness) │ │
│ │ • Mounts volumes │ │
│ │ • Runs static pods │ │
│ │ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ If kubelet fails → Node goes NotReady → Pods stop working │
│ │
└──────────────────────────────────────────────────────────────┘

Here is a structural view of the kubelet’s responsibilities and the consequences of its failure:

flowchart TD
K[kubelet] --> R[Registers node with API server]
K --> W[Watches for pod assignments]
K --> M[Manages container lifecycle via runtime]
K --> S[Reports node/pod status]
K --> H[Handles probes: liveness, readiness]
K --> V[Mounts volumes]
K --> P[Runs static pods]
K -.->|Fails| N[Node goes NotReady]
N -.->|Result| X[Pods stop working or face eviction]

When a node is NotReady, your very first step should be to bypass Kubernetes entirely, SSH directly into the affected node, and check the health of the kubelet service using the Linux system manager, systemd.

Terminal window
# SSH to the node first
ssh <node-name>
# Check kubelet service status
sudo systemctl status kubelet
# Check if kubelet is running
ps aux | grep kubelet
# Check kubelet logs
sudo journalctl -u kubelet -f
# Check recent kubelet errors
sudo journalctl -u kubelet --since "10 minutes ago" | grep -i error

Based on the output of the commands above, you can categorize the failure into one of several common buckets.

IssueSymptomDiagnosisFix
kubelet stoppedNode NotReadysystemctl status kubeletsystemctl start kubelet
kubelet crash loopNode flappingjournalctl -u kubeletFix config, check logs
Wrong configCan’t startError in logsFix /var/lib/kubelet/config.yaml
Can’t reach APINotReadyNetwork timeout in logsCheck network, firewall
Certificate issuesTLS errorsCert errors in logsRenew certs
Container runtime downCan’t create podsRuntime errorsFix containerd/docker

If the kubelet is simply stopped (perhaps due to an accidental administrative command or an abrupt system restart where the service wasn’t enabled), starting it is straightforward:

Terminal window
# Start kubelet
sudo systemctl start kubelet
# Enable on boot
sudo systemctl enable kubelet
# Check status
sudo systemctl status kubelet

More often, the kubelet is in a crash loop due to a configuration error. The kubelet’s configuration is typically split between a YAML file and a set of systemd drop-in arguments. A typo in either will prevent the daemon from starting.

Terminal window
# Check kubelet config file
cat /var/lib/kubelet/config.yaml
# Check kubelet flags
cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
# After fixing config, reload and restart
sudo systemctl daemon-reload
sudo systemctl restart kubelet

Another pernicious issue is certificate expiration. The kubelet authenticates to the API server using client certificates. If these expire (usually after one year in a kubeadm-provisioned cluster), the API server will reject the kubelet’s heartbeats, and the node will drop offline silently.

Terminal window
# Check certificate paths
cat /var/lib/kubelet/config.yaml | grep -i cert
# Verify certificates exist
ls -la /var/lib/kubelet/pki/
# For expired certs, may need to rejoin node
# On control plane: kubeadm token create --print-join-command
# On worker: kubeadm reset && kubeadm join ...

The kubelet does not actually run containers itself; it delegates that responsibility to a Container Runtime via the Container Runtime Interface (CRI). If the container runtime crashes, hangs, or corrupts its local storage, the kubelet will be unable to spin up new pods or retrieve the status of existing ones.

┌──────────────────────────────────────────────────────────────┐
│ CONTAINER RUNTIME STACK │
│ │
│ kubelet │
│ │ │
│ │ CRI (Container Runtime Interface) │
│ ▼ │
│ containerd (or docker, cri-o) │
│ │ │
│ │ OCI (Open Container Initiative) │
│ ▼ │
│ runc (low-level runtime) │
│ │ │
│ ▼ │
│ Linux kernel (cgroups, namespaces) │
│ │
└──────────────────────────────────────────────────────────────┘

The flow of instructions down the stack looks like this:

flowchart TD
K[kubelet] -->|CRI - gRPC via unix socket| C[containerd / cri-o]
C -->|OCI - JSON spec| R[runc / crun - low-level runtime]
R -->|System Calls| L[Linux kernel: cgroups, namespaces]

To troubleshoot the runtime, we use crictl, a CLI tool specifically designed for CRI-compatible runtimes. It is invaluable because it allows you to inspect the state of containers directly on the node without needing the Kubernetes API server to be reachable.

Terminal window
# Check containerd (most common)
sudo systemctl status containerd
sudo crictl info
# Check container runtime socket
ls -la /run/containerd/containerd.sock
# List containers with crictl
sudo crictl ps
# List images
sudo crictl images

Runtime issues often manifest as pods stuck in the ContainerCreating state, or as cryptic CRI integration errors inside the kubelet logs.

IssueSymptomDiagnosisFix
containerd stoppedPods ContainerCreatingsystemctl status containerdsystemctl start containerd
Socket missingkubelet errorsCheck socket pathRestart containerd
Disk fullContainer create failsdf -hClean up disk
Image pull failsImagePullBackOffCheck registry accessFix registry auth
Resource exhaustedRandom container failuresCheck cgroupsIncrease resources

If containerd has crashed, restarting it via systemd is the immediate remediation:

Terminal window
# Start containerd
sudo systemctl start containerd
# Check status
sudo systemctl status containerd
# Check logs for issues
sudo journalctl -u containerd --since "10 minutes ago"

If you need to dig deeper into why a specific container is failing to start, configuring and using crictl is your best path forward. Ensure crictl knows where your CRI socket is located by writing a quick config file.

Terminal window
# Configure crictl for containerd
cat <<EOF | sudo tee /etc/crictl.yaml
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 10
debug: false
EOF
# List all containers (including stopped)
sudo crictl ps -a
# Get container logs
sudo crictl logs <container-id>
# Inspect container
sudo crictl inspect <container-id>

Worker nodes possess finite physical resources. When a node begins to run out of memory, disk space, or process IDs, the kubelet detects this via cAdvisor (Container Advisor, which is embedded in the kubelet) and asserts a resource pressure condition.

┌──────────────────────────────────────────────────────────────┐
│ RESOURCE PRESSURE TYPES │
│ │
│ MEMORY PRESSURE │
│ • Available memory below threshold │
│ • Triggers pod eviction │
│ • Check: free -m, cat /proc/meminfo │
│ │
│ DISK PRESSURE │
│ • Disk usage above threshold │
│ • Triggers image garbage collection │
│ • Check: df -h │
│ │
│ PID PRESSURE │
│ • Process IDs exhausted │
│ • Can't fork new processes │
│ • Check: cat /proc/sys/kernel/pid_max │
│ │
│ When any pressure is True, node may not accept new pods │
│ │
└──────────────────────────────────────────────────────────────┘
mindmap
root((Resource Pressure))
Memory
Available memory below threshold
Triggers pod eviction
Check: free -m
Disk
Usage above threshold
Triggers image GC
Check: df -h
PID
Process IDs exhausted
Cannot fork processes
Check: pid_max

When diagnosing resource exhaustion, you must check both the Kubernetes API’s view of the node and the raw operating system metrics.

Terminal window
# Check node conditions
k describe node <node> | grep -A 10 Conditions
# On the node - check memory
free -m
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable"
# Check disk
df -h
du -sh /var/lib/containerd/* # Container storage
du -sh /var/log/* # Log storage
# Check PIDs
cat /proc/sys/kernel/pid_max
ps aux | wc -l

The kubelet determines when a node is under pressure based on configured eviction thresholds. These are defined in the kubelet’s configuration YAML.

evictionHard:
memory.available: "100Mi"
nodefs.available: "10%"
nodefs.inodesFree: "5%"
imagefs.available: "15%

When these thresholds are crossed, the kubelet acts defensively:

  1. Node condition set to True (e.g., MemoryPressure=True).
  2. The scheduler stops assigning new pods to the node.
  3. The kubelet begins evicting existing pods to reclaim resources, starting with BestEffort pods.

Pause and predict: If a runaway pod is consuming all the memory on a node, and the kubelet decides to evict it, what happens to the pod’s data if it was using an emptyDir volume? Think about the ephemeral nature of local storage before proceeding.

To fix memory pressure, you need to identify the culprit and intervene:

Terminal window
# Find memory-hungry processes
ps aux --sort=-%mem | head -20
# Find pods using most memory
k top pods -A --sort-by=memory
# Options:
# 1. Kill unnecessary processes
# 2. Evict low-priority pods
# 3. Add more memory to node

For disk pressure, the solution is aggressive cleanup. A node with a completely full disk will often completely freeze up, requiring a hard reboot.

Terminal window
# Find large files
sudo find / -type f -size +100M -exec ls -lh {} \;
# Clean up container images
sudo crictl rmi --prune
# Clean up old logs
sudo journalctl --vacuum-time=3d
# Clean up unused containers
sudo crictl rm $(sudo crictl ps -a -q --state exited)

PID pressure is an insidious problem. The Linux kernel limits the maximum number of Process IDs that can exist simultaneously. If a container forks processes rapidly without cleaning them up (a fork bomb), the node will hit its PID limit, preventing any new processes (including basic shell commands) from running.

Terminal window
# Check current PID limit
cat /proc/sys/kernel/pid_max
# Increase limit temporarily
echo 65536 | sudo tee /proc/sys/kernel/pid_max
# Find processes by count
ps aux | awk '{print $1}' | sort | uniq -c | sort -rn | head

Even if the kubelet is healthy and resources are abundant, a node must have robust network connectivity to function within the cluster. It must be able to reach the API server to send heartbeats, reach other nodes for overlay networking, and reach container registries to pull images.

┌──────────────────────────────────────────────────────────────┐
│ NODE NETWORK REQUIREMENTS │
│ │
│ Node needs connectivity to: │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ API Server (port 6443) - Required │ │
│ │ Other nodes (varies) - For pod networking │ │
│ │ DNS servers (port 53) - For name resolution │ │
│ │ Registry (port 443) - For pulling images │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Network failures → Node NotReady or Unknown │
│ │
└──────────────────────────────────────────────────────────────┘
flowchart LR
Node -->|port 6443| API[API Server]
Node -->|varies based on CNI| Nodes[Other Nodes]
Node -->|port 53| DNS[DNS Servers]
Node -->|port 443| Reg[Container Registry]
style API stroke:#f66,stroke-width:2px

When diagnosing a network partition, use standard Linux networking tools directly from the affected worker node to trace the connection failure.

Terminal window
# Check basic connectivity
ping <api-server-ip>
# Check API server reachability
curl -k https://<api-server>:6443/healthz
# Check DNS
nslookup kubernetes.default.svc.cluster.local
cat /etc/resolv.conf
# Check firewall
sudo iptables -L -n
sudo firewall-cmd --list-all # If using firewalld
# Check network interfaces
ip addr
ip route

Common network issues range from aggressive firewall rules dropping packets to asymmetric routing configurations causing silent timeouts.

IssueSymptomDiagnosisFix
Firewall blockingCan’t reach APItelnet api-server 6443Open firewall ports
DNS failureName resolution failsnslookupFix /etc/resolv.conf
IP address changeNode NotReadyCheck IP in node specReconfigure or rejoin
CNI plugin issuesPod networking failsCheck CNI podsRestart CNI, fix config
MTU mismatchIntermittent failuresCheck MTU settingsAlign MTU values

Familiarize yourself with the default ports required for Kubernetes components to communicate securely.

PortProtocolComponentPurpose
6443TCPAPI ServerKubernetes API
10250TCPkubeletkubelet API
10259TCPkube-schedulerScheduler metrics
10257TCPkube-controller-managerController metrics
2379-2380TCPetcdClient and peer
30000-32767TCPNodePortService NodePorts

When you have exhausted your troubleshooting options and need to perform deep maintenance on a node, you must follow safe recovery procedures. A chaotic recovery approach causes more downtime than the initial failure.

┌──────────────────────────────────────────────────────────────┐
│ NODE RECOVERY DECISION TREE │
│ │
│ Node NotReady? │
│ │ │
│ ├── Can SSH to node? │
│ │ │ │
│ │ ├── YES → Check kubelet, runtime, network │
│ │ │ │
│ │ └── NO → Check physical/VM, cloud console │
│ │ │
│ ├── kubelet running? │
│ │ │ │
│ │ ├── YES → Check logs, certs, API connectivity │
│ │ │ │
│ │ └── NO → Start kubelet │
│ │ │
│ └── Still NotReady after fixes? │
│ │ │
│ └── Drain and rejoin node │
│ │
└──────────────────────────────────────────────────────────────┘
flowchart TD
A{Node NotReady?} -->|Yes| B{Can SSH to node?}
B -->|YES| C{kubelet running?}
B -->|NO| D[Check physical/VM, cloud console]
C -->|YES| E[Check logs, certs, API connectivity]
C -->|NO| F[Start kubelet]
E --> G{Still NotReady after fixes?}
F --> G
G -->|Yes| H[Drain and rejoin node]

Before rebooting a node or ripping out its configuration, you must safely remove the workloads it is hosting. The cordon command marks the node as unschedulable, and the drain command safely evicts all running pods (respecting PodDisruptionBudgets and graceful termination periods).

Terminal window
# Drain node (evicts pods safely)
k drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Cordon only (prevent new pods)
k cordon <node-name>
# Uncordon (allow scheduling again)
k uncordon <node-name>

If the node’s local configuration is entirely corrupted (e.g., messed up certificates or network configurations), the fastest path to recovery is often to wipe the node’s Kubernetes state and rejoin it to the cluster from scratch.

Terminal window
# On the worker node
sudo kubeadm reset
# On control plane - generate new join token
kubeadm token create --print-join-command
# On worker - rejoin
sudo kubeadm join <api-server>:6443 --token <token> --discovery-token-ca-cert-hash <hash>

If a node has suffered catastrophic hardware failure and will never return, you must clean it out of the API server to prevent the cluster from waiting for it forever.

Terminal window
# Drain first
k drain <node> --ignore-daemonsets --delete-emptydir-data
# Delete node from cluster
k delete node <node-name>
# On the node itself
sudo kubeadm reset

When troubleshooting worker nodes, panic often leads to rushed commands that exacerbate the problem. Avoid these common pitfalls:

MistakeWhyFix
Not checking kubelet firstMiss obvious issueAlways start with systemctl status kubelet
Ignoring node conditionsMiss resource pressureCheck all conditions, not just Ready
Deleting node before drainPod disruptionAlways drain before delete
Forgetting DaemonSet podsDrain failsUse --ignore-daemonsets
Not checking runtimeBlame kubeletCheck containerd status too
Ignoring disk usageNode degradationMonitor disk, clean regularly
Restarting without reloadChanges to systemd drop-ins do not take effectAlways run systemctl daemon-reload before restart
Skipping CNI checksAssume node is broken when only pod networking is downVerify CNI binary paths and configurations

You receive an alert that a production worker node has transitioned to the Unknown state. The node was perfectly healthy a minute ago. Based on the Kubernetes node controller architecture, how many consecutive seconds of missed heartbeats does this state represent, and what happens next?

Answer

It represents 40 seconds of missed heartbeats. The kubelet sends heartbeats to the API server every 10 seconds. After 4 missed heartbeats (40s), the node-controller marks the node as Unknown. After 5 minutes (by default) in this state, pods on that node are forcefully scheduled for eviction to ensure application availability.

You discover a critical control plane component is failing on a master node, but when you run crictl ps, you see the container still running. You then check systemctl status kubelet and see it has crashed. What is the key difference between how the kubelet and control plane components run that explains this?

Answer

The kubelet runs as a systemd service directly on the host OS, while control plane components (API server, scheduler, etc.) run as static pods managed by the kubelet. If the kubelet crashes, the container runtime (containerd) keeps the existing static pods running independently. However, the kubelet is no longer around to report their status or apply updates, meaning Kubernetes loses management visibility over them.

A node in your cluster is marked with MemoryPressure=True. A developer complains that their newly deployed pod is stuck in the Pending state. How does the node’s condition directly affect the scheduler’s behavior regarding new workloads?

Answer

New pods will not be scheduled to this node. The kube-scheduler actively filters out nodes that have resource pressure conditions set to True during its predicate evaluation phase. Additionally, the kubelet on that node will begin actively evicting existing pods to free up memory, starting with BestEffort pods, to prevent the entire node from freezing.

During a severe API server outage, you need to inspect the logs of a failing ingress controller pod on a worker node. kubectl commands are timing out globally. What is the most effective approach to retrieve these logs?

Answer

You should use crictl directly on the worker node. Use crictl ps to find the container ID, and crictl logs <container-id> to view the output. Because crictl communicates directly with the container runtime (containerd) over the local Unix socket, it bypasses the Kubernetes API entirely, allowing you to debug even when the control plane is unreachable.

You need to perform emergency kernel patching on a worker node. You execute kubectl cordon <node-name>. Five minutes later, you notice that all the original pods are still running on the node, blocking your maintenance window. What operational misunderstanding caused this delay?

Answer

The cordon command only marks the node as unschedulable (preventing new pods from arriving); it does not stop or move existing workloads. To properly clear a node for maintenance, you must use the drain command. Draining will cordon the node AND safely evict all existing pods (except DaemonSets, which you ignore with a flag) while respecting PodDisruptionBudgets.

A worker node’s kubelet fails to start, throwing errors in journalctl about missing files in /run/containerd/. You suspect the container runtime interface socket is unavailable. Where exactly should you look to verify the socket’s existence?

Answer

You should check for the socket at /run/containerd/containerd.sock. This Unix socket is created and listened on by the containerd systemd service. If it is missing, it typically means containerd has crashed or failed to start, which subsequently prevents the kubelet from initializing because it cannot connect via the CRI.

You notice a node flipping between Ready and NotReady states every few minutes. Upon inspecting the kubelet logs, you see repeated TLS handshake timeouts. The cluster was provisioned exactly one year ago. What is the most highly probable root cause of this flapping behavior?

Answer

The most highly probable root cause is an expired kubelet client certificate. By default, kubeadm-provisioned clusters issue kubelet client certificates with a one-year validity period. When the certificate expires, the kubelet can no longer authenticate with the API server to send heartbeats, causing total communication failure until the certificate is rotated or the node is rejoined.


Hands-On Exercise: Node Troubleshooting Simulation

Section titled “Hands-On Exercise: Node Troubleshooting Simulation”

You are the on-call engineer. Monitoring has alerted you that a critical worker node is experiencing intermittent instability and resource spikes. You need to log into the environment, systematically diagnose the health of the node, inspect its core services, evaluate its resources, and safely prepare it for maintenance.

  • Access to a Kubernetes cluster
  • SSH access to at least one worker node

Begin by evaluating the cluster-wide state from the perspective of the control plane. Identify the node you want to investigate.

Solution
Terminal window
# Check all nodes
k get nodes -o wide
# Get detailed node information
k describe node <node-name>
# Check node conditions specifically
k get node <node-name> -o jsonpath='{.status.conditions[*].type}' | tr ' ' '\n'

Assume the node is showing signs of distress. SSH directly into the node and interrogate the primary agent.

Solution
Terminal window
# SSH to a worker node
ssh <node>
# Check kubelet status
sudo systemctl status kubelet
# View recent kubelet logs
sudo journalctl -u kubelet --since "5 minutes ago" | tail -50
# Check kubelet configuration
cat /var/lib/kubelet/config.yaml | head -30

The kubelet relies entirely on the container runtime. Verify that containerd is healthy and properly managing containers.

Solution
Terminal window
# Check containerd status
sudo systemctl status containerd
# List running containers
sudo crictl ps
# Check container runtime info
sudo crictl info
# List images on node
sudo crictl images

The node is healthy at the service level, but it might be starving for resources. Check the physical resource consumption.

Solution
Terminal window
# Check memory
free -m
# Check disk
df -h
# Check what's using resources
k top node <node-name>
# See allocated resources
k describe node <node-name> | grep -A 10 "Allocated resources"

You have decided the node needs a reboot to clear a suspected memory leak. Safely cordon the node and verify that the scheduler respects your command.

Solution
Terminal window
# Cordon a node (prevents new scheduling)
k cordon <node-name>
# Verify it's unschedulable
k get node <node-name>
# Try to schedule a pod
k run test-pod --image=nginx
k get pods test-pod -o wide # Should NOT be on cordoned node
# Uncordon
k uncordon <node-name>
# Cleanup
k delete pod test-pod
  • Checked node conditions for all nodes using jsonpath.
  • Verified kubelet is running and inspected the systemd logs.
  • Verified containerd is running and used crictl to list images.
  • Assessed node resource usage at both the OS and cluster levels.
  • Successfully cordoned a node, tested scheduler avoidance, and uncordoned it.

Ensure the node is uncordoned and the test pod is deleted after completing the exercise.


Develop muscle memory for node troubleshooting by executing these rapid-fire drills.

Terminal window
# Task: List all nodes with their status
k get nodes
Terminal window
# Task: Check all conditions for a specific node
k describe node <node> | grep -A 10 Conditions
Terminal window
# Task: Check if kubelet is running (on node)
sudo systemctl status kubelet
Terminal window
# Task: View last 20 lines of kubelet logs
sudo journalctl -u kubelet -n 20

Drill 5: Container Runtime Status (30 sec)

Section titled “Drill 5: Container Runtime Status (30 sec)”
Terminal window
# Task: Check containerd and list containers
sudo systemctl status containerd
sudo crictl ps
Terminal window
# Task: Check node resource usage
k top nodes
k describe node <node> | grep -A 5 "Allocated resources"
Terminal window
# Task: Safely drain a node
k drain <node> --ignore-daemonsets --delete-emptydir-data
Terminal window
# Task: Check disk usage on node
df -h
du -sh /var/lib/containerd/

Continue to Module 5.5: Network Troubleshooting to learn how to diagnose and fix pod-to-pod, pod-to-service, and external connectivity issues that plague distributed systems.