Skip to content

Module 5.1: Troubleshooting Methodology

Hands-On Lab Available
K8s Cluster intermediate 30 min
Launch Lab ↗

Opens in Killercoda in a new tab

Complexity: [MEDIUM] - Foundation for all troubleshooting

Time to Complete: 40-50 minutes

Prerequisites: Parts 1-4 completed (cluster architecture, workloads, networking, storage)


After this module, you will be able to:

  • Apply a systematic troubleshooting methodology (symptoms → hypotheses → verify → fix → validate)
  • Triage CKA troubleshooting questions by identifying the failure layer (application, service, node, control plane)
  • Use kubectl commands (describe, logs, events, get -o yaml) in the correct diagnostic order
  • Avoid the #1 troubleshooting mistake: making changes before understanding the problem

Troubleshooting is 30% of the CKA exam - the largest single domain. More importantly, troubleshooting is what separates Kubernetes operators from Kubernetes experts. When a production cluster is down at 3 AM, systematic debugging is the difference between a 5-minute fix and a 5-hour nightmare.

The Doctor Analogy

A good doctor doesn’t just guess treatments - they follow a diagnostic process. Symptoms → examination → tests → diagnosis → treatment. Kubernetes troubleshooting works the same way. Random “fixes” might work occasionally, but systematic investigation works every time.


By the end of this module, you’ll be able to:

  • Apply a systematic troubleshooting methodology
  • Quickly identify which component is failing
  • Use kubectl commands for rapid diagnosis
  • Understand where to look for different problem types
  • Triage problems using the three-pass strategy

  • 80% of issues are in 5 places: Pod spec errors, image pull problems, resource constraints, network policies, and misconfigured services
  • Events expire: Kubernetes events are only kept for 1 hour by default - if you don’t check soon, evidence disappears
  • describe > logs: Most beginners jump straight to logs. Experienced troubleshooters check describe first - the Events section often reveals the problem immediately

Stop and think: Think about what you do when something breaks. Do you immediately start changing things? Most people do. But random changes make things worse — you lose the ability to tell what fixed it (or what broke it more). The framework below forces you to understand BEFORE you act. It feels slower at first, but it’s faster in the long run because you fix the right thing the first time.

Every troubleshooting session should follow this pattern:

┌──────────────────────────────────────────────────────────────┐
│ TROUBLESHOOTING FRAMEWORK │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ 1. IDENTIFY │────▶│ 2. ISOLATE │────▶│ 3. DIAGNOSE │ │
│ │ What's │ │ Where's │ │ Why's │ │
│ │ wrong? │ │ it wrong? │ │ it wrong? │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ 4. FIX │ │
│ │ Apply │ │
│ │ solution │ │
│ └─────────────┘ │
└──────────────────────────────────────────────────────────────┘

Start with the symptom. Be specific:

VagueSpecific
”App is broken""Pod is in CrashLoopBackOff"
"Network doesn’t work""Pod can’t reach external DNS"
"Cluster is slow""API server response time > 5s”

Initial triage commands:

Terminal window
# Cluster-wide health check
k get nodes
k get pods -A | grep -v Running
k get events -A --sort-by='.lastTimestamp' | tail -20
# Component health
k get componentstatuses # Deprecated but still useful
k -n kube-system get pods

Narrow down the scope systematically:

┌──────────────────────────────────────────────────────────────┐
│ ISOLATION LAYERS │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ CLUSTER │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ NODE │ │ │
│ │ │ ┌─────────────────────────────────────┐ │ │ │
│ │ │ │ POD │ │ │ │
│ │ │ │ ┌─────────────────────────────┐ │ │ │ │
│ │ │ │ │ CONTAINER │ │ │ │ │
│ │ │ │ │ ┌─────────────────────┐ │ │ │ │ │
│ │ │ │ │ │ APPLICATION │ │ │ │ │ │
│ │ │ │ │ └─────────────────────┘ │ │ │ │ │
│ │ │ │ └─────────────────────────────┘ │ │ │ │
│ │ │ └─────────────────────────────────────┘ │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Start wide, drill down until you find the problem layer │
└──────────────────────────────────────────────────────────────┘

Isolation questions:

  • Is it all pods or specific pods?
  • Is it all nodes or specific nodes?
  • Is it all namespaces or specific namespaces?
  • Did it ever work? What changed?

Once you’ve isolated the layer, gather detailed information:

Terminal window
# Pod-level diagnosis
k describe pod <pod-name> # Events section is gold
k logs <pod-name> # Current container logs
k logs <pod-name> --previous # Previous container (if crashed)
# Node-level diagnosis
k describe node <node-name>
ssh <node> journalctl -u kubelet
# Cluster-level diagnosis
k -n kube-system logs <component-pod>

Only after diagnosis do you fix:

Terminal window
# Apply the fix
k edit <resource> # Direct edit
k apply -f <fixed-yaml> # Apply corrected spec
k delete pod <pod> # Force restart
# Verify the fix
k get pods -w # Watch for status change
k logs <pod> # Check new logs

Understanding what each component does helps you know where to look:

┌──────────────────────────────────────────────────────────────┐
│ COMPONENT FAILURE MAP │
│ │
│ SYMPTOM CHECK THESE COMPONENTS │
│ ─────────────────────────────────────────────────────────────│
│ │
│ Pods not scheduling → kube-scheduler │
│ Pods stuck Pending → scheduler, node resources │
│ Pods stuck ContainerCreating → kubelet, image pull, volumes │
│ Pods CrashLoopBackOff → container, app config │
│ Pods can't communicate → CNI, network policies │
│ Services not working → kube-proxy, endpoints │
│ kubectl times out → API server, etcd │
│ Node NotReady → kubelet, container runtime │
│ Persistent volume issues → CSI driver, storage class │
│ │
└──────────────────────────────────────────────────────────────┘
ComponentWhat It DoesFailure Symptoms
kube-apiserverAll API operationskubectl fails, nothing works
etcdState storageData loss, inconsistent state
kube-schedulerPod placementPods stuck Pending
kube-controller-managerReconciliation loopsResources not updating
ComponentWhat It DoesFailure Symptoms
kubeletPod lifecyclePods not starting, node NotReady
kube-proxyService networkingServices not reachable
Container runtimeContainer executionContainerCreating stuck
CNI pluginPod networkingPods can’t communicate

Part 3: Essential Troubleshooting Commands

Section titled “Part 3: Essential Troubleshooting Commands”

Memorize these - you’ll use them constantly:

Terminal window
# Status overview
k get pods # Pod status
k get pods -o wide # Plus node and IP
k get events --sort-by='.lastTimestamp' # Recent events
# Deep inspection
k describe pod <pod> # Full details + events
k logs <pod> # Container stdout/stderr
k logs <pod> -c <container> # Specific container
k logs <pod> --previous # Previous container instance
# Interactive debugging
k exec -it <pod> -- sh # Shell into container
k exec <pod> -- cat /etc/resolv.conf # Run single command
# Resource status
k get <resource> -o yaml # Full resource spec
k explain <resource.field> # API documentation
Terminal window
# Find problem pods
k get pods -A | grep -v Running
k get pods -A --field-selector=status.phase!=Running
# Find pods on specific node
k get pods -A --field-selector spec.nodeName=worker-1
# Find pods by label
k get pods -l app=nginx
# Search events for errors
k get events -A | grep -i error
k get events -A | grep -i fail
Terminal window
# Node resources
k top nodes
k describe node <node> | grep -A 5 "Allocated resources"
# Pod resources
k top pods
k top pods --containers
# Check resource requests/limits
k get pods -o custom-columns=\
'NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory'
Terminal window
# DNS resolution
k exec <pod> -- nslookup kubernetes
k exec <pod> -- cat /etc/resolv.conf
# Connectivity
k exec <pod> -- wget -qO- http://service-name
k exec <pod> -- nc -zv service-name 80
# Service endpoints
k get endpoints <service>
k get endpointslices -l kubernetes.io/service-name=<service>

Pause and predict: A pod shows Running but your application isn’t working. Is the pod healthy? Not necessarily — Running means at least one container started, but it doesn’t mean the application is serving traffic. The pod could be in a crash loop (restarting), failing readiness probes (excluded from service), or running but returning errors. Status ≠ health. This distinction trips up even experienced engineers.

┌──────────────────────────────────────────────────────────────┐
│ POD PHASES │
│ │
│ Pending ──────▶ Running ──────▶ Succeeded │
│ │ │ │
│ │ ▼ │
│ │ Failed │
│ │ │ │
│ ▼ ▼ │
│ [Problem] [Problem] │
│ │
│ Pending: Waiting for scheduling or image pull │
│ Running: At least one container running │
│ Succeeded: All containers exited 0 (completed) │
│ Failed: At least one container exited non-zero │
│ Unknown: Node communication lost │
└──────────────────────────────────────────────────────────────┘
StatusMeaningFirst Check
PendingNot scheduled yetk describe pod - Events section
ContainerCreatingImage pull or volume mountk describe pod - Events section
RunningContainer(s) runningk logs for app issues
CrashLoopBackOffContainer crashing repeatedlyk logs --previous
ImagePullBackOffCan’t pull imageImage name, registry auth
ErrImagePullImage pull failedSame as above
CreateContainerConfigErrorConfig issueConfigMap/Secret missing
OOMKilledOut of memoryIncrease memory limit
EvictedNode pressureNode resources, pod priority

The most common troubleshooting scenario:

Terminal window
# Step 1: Check events
k describe pod <pod> | grep -A 20 Events
# Step 2: Check previous logs
k logs <pod> --previous
# Step 3: Check container exit code
k get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Common exit codes:
# 0 - Success (shouldn't cause CrashLoop)
# 1 - Application error
# 137 - SIGKILL (OOMKilled or killed by system)
# 139 - SIGSEGV (segmentation fault)
# 143 - SIGTERM (graceful termination)

Terminal window
k describe pod <pod-name>
┌──────────────────────────────────────────────────────────────┐
│ DESCRIBE OUTPUT SECTIONS │
│ │
│ Section What to Look For │
│ ─────────────────────────────────────────────────────────────│
│ │
│ Status: Current phase (Pending/Running/etc) │
│ │
│ Containers: State, Ready, Restart Count │
│ Last State (for crash info) │
│ Image (verify it's correct) │
│ │
│ Conditions: Ready, ContainersReady, PodScheduled │
│ False = problem │
│ │
│ Volumes: ConfigMaps, Secrets, PVCs │
│ Missing = pod won't start │
│ │
│ Events: ⭐ THE MOST IMPORTANT SECTION │
│ Shows what's happening/happened │
│ Errors appear here first │
│ │
└──────────────────────────────────────────────────────────────┘
Terminal window
k describe node <node-name>
SectionWhat to Look For
ConditionsReady=True, MemoryPressure=False, DiskPressure=False
CapacityTotal CPU, memory, pods
AllocatableAvailable for pods (after system reservation)
Allocated resourcesCurrent usage and requests
EventsEvictions, pressure conditions

┌──────────────────────────────────────────────────────────────┐
│ THREE-PASS TROUBLESHOOTING STRATEGY │
│ │
│ PASS 1: Quick Fixes (1-3 min) │
│ • Obvious typos in YAML │
│ • Wrong image name/tag │
│ • Missing namespace in command │
│ • Label selector mismatch │
│ │
│ PASS 2: Standard Issues (4-6 min) │
│ • Missing ConfigMap/Secret │
│ • Resource constraints │
│ • Service selector mismatch │
│ • Network policy blocking traffic │
│ │
│ PASS 3: Complex Issues (7+ min) │
│ • Control plane component failures │
│ • Node-level issues │
│ • CNI/networking problems │
│ • Storage/CSI issues │
│ │
└──────────────────────────────────────────────────────────────┘

For a 2-hour exam with troubleshooting worth 30%:

  • ~36 minutes for troubleshooting questions
  • Probably 3-4 troubleshooting scenarios
  • ~9-12 minutes per scenario maximum

Golden rule: If you can’t identify the problem in 3 minutes of investigation, flag it and move on.

ScenarioLikely IssueQuick Check
Pod not startingImage, ConfigMap/Secretk describe pod
Service not accessibleSelector, endpointsk get endpoints
Node NotReadykubelet, runtimessh node; systemctl status kubelet
DNS not workingCoreDNS podsk -n kube-system get pods -l k8s-app=kube-dns
Persistent volume pendingStorageClass, PVk describe pvc

MistakeProblemSolution
Jumping to logs firstMiss scheduling/config issuesAlways describe before logs
Not checking eventsMiss critical error messagesCheck events immediately
Fixing without diagnosisMight not fix real issueAlways identify root cause
Forgetting --previousCan’t see why container crashedUse for CrashLoopBackOff
Ignoring exit codesMiss OOM vs app errorCheck exit code for cause
Not checking all containersMulti-container podsUse -c <container> flag

You’ve just deployed a new release of your backend API, but the deployment is failing to progress. When you check the pods, you see they are in a CrashLoopBackOff state. What is the very first kubectl command you should run to begin your investigation, and why?

Answer

You should run kubectl describe pod <pod-name>. While it’s tempting to immediately jump to checking the logs, the describe command provides the critical Events section at the bottom of its output. These events will often tell you immediately if the issue is a failure to pull an image, a missing ConfigMap, or a failing liveness probe. Only after checking the events and confirming it’s an application-level crash should you move on to checking kubectl logs <pod-name> --previous to view the actual crash logs.

A developer reports that their batch job failed mysteriously over the weekend. When you check the cluster on Monday morning, the pod is gone and you can’t find any obvious errors. Why might you struggle to find the root cause using standard Kubernetes event logs?

Answer

You will struggle because Kubernetes events are only retained for 1 hour by default in etcd. By the time you check on Monday, the cluster will have already garbage-collected the events related to the weekend failure. Furthermore, the Events section in the describe output truncates, meaning a flood of newer events can quickly push out the older, relevant ones even within that one-hour window. This is why it is critical to capture cluster state and events immediately when an issue occurs, or rely on external logging and monitoring systems for historical data.

Your data processing pod suddenly stops working. When you inspect the pod status, you see that the container terminated with an exit code of 137. What does this specific exit code tell you about how the container died, and where should you look next?

Answer

An exit code of 137 indicates that the container was terminated forcefully with a SIGKILL signal (128 + 9). In a Kubernetes environment, this almost always means the container was OOMKilled (Out Of Memory) because it tried to consume more memory than its configured limits allowed. Alternatively, it could mean the node itself was under memory pressure and the kubelet killed the pod to protect system stability. You should immediately run kubectl describe pod <pod-name> to check the Last State section for the exact reason (like OOMKilled), and then review the pod’s resource limits compared to its actual usage.

You’ve applied a new Deployment, but the pods never seem to reach the Running state. Some pods show a Pending status, while others are stuck in ContainerCreating. How do these two states differ in terms of where the failure is occurring in the pod lifecycle?

Answer

The difference lies in whether the Kubernetes scheduler has successfully assigned the pod to a node. A Pending status means the pod has not yet been scheduled; this typically points to cluster-level issues like a lack of available resources (CPU/Memory), untolerated taints, or a misconfigured node selector. Conversely, ContainerCreating means the scheduler has assigned the pod to a node, but the node’s kubelet is struggling to start the container. This usually points to node-level or dependency issues, such as failing to pull the container image, inability to mount a required PersistentVolume, or Secret/ConfigMap resolution failures. Check describe Events to see which step is stuck.

You have a pod running a web application alongside a logging sidecar container. The web application is returning 500 errors, but when you run kubectl logs <pod-name>, you only see the sidecar’s output, which looks completely healthy. How do you retrieve the logs for the failing web application container?

Answer

By default, when you run kubectl logs against a multi-container pod without specifying a container, Kubernetes will either output the logs of the first container defined in the pod spec, or return an error prompting you to choose one. To view the logs of the specific web application container, you must use the -c flag followed by the container’s name. If you aren’t sure of the container’s exact name, you can use a jsonpath query to list all containers within that pod before checking the logs.

Terminal window
k logs <pod-name> -c <container-name>
# List containers in pod
k get pod <pod-name> -o jsonpath='{.spec.containers[*].name}'

During a routine cluster check, you notice that one of your worker nodes is marked as NotReady. All the pods on that node are beginning to transition to an Unknown state. Walk through the systematic steps you would take to diagnose why this node has dropped out of the cluster.

Answer

When a node becomes NotReady, it means the control plane has lost communication with the node’s kubelet. Your first step should be to run kubectl describe node <node> from the control plane to check the Conditions section for issues like memory or disk pressure that might have preceded the disconnect. If the cluster-level info isn’t conclusive, you must shift to node-level debugging.

  1. k describe node <node> - Check Conditions section
  2. SSH to node if accessible
  3. systemctl status kubelet - Is kubelet running?
  4. journalctl -u kubelet -f - Check kubelet logs
  5. systemctl status containerd (or docker) - Is runtime running?
  6. Check network connectivity to control plane

Hands-On Exercise: Systematic Troubleshooting Practice

Section titled “Hands-On Exercise: Systematic Troubleshooting Practice”

You’ll create several broken resources and practice systematic troubleshooting.

Terminal window
# Create namespace
k create ns troubleshoot-lab
# Create a "broken" deployment - see if you can spot all issues
cat <<'EOF' | k apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: broken-app
namespace: troubleshoot-lab
spec:
replicas: 2
selector:
matchLabels:
app: broken-app
template:
metadata:
labels:
app: broken-app
spec:
containers:
- name: app
image: nginx:latestt
ports:
- containerPort: 80
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "64Mi"
cpu: "500m"
volumeMounts:
- name: config
mountPath: /etc/nginx/conf.d
volumes:
- name: config
configMap:
name: nginx-config
EOF

Apply the troubleshooting methodology:

1. Identify - What’s wrong?

Terminal window
k get pods -n troubleshoot-lab
# What status do you see?

2. Isolate - Where’s it wrong?

Terminal window
k describe pod -n troubleshoot-lab -l app=broken-app
# Look at the Events section

3. Diagnose - Why’s it wrong? Find all issues (there are at least 2):

  • Issue 1: _______________
  • Issue 2: _______________

4. Fix - Apply solutions

Terminal window
# Fix issue 1: Image typo
k set image deployment/broken-app -n troubleshoot-lab app=nginx:latest
# Fix issue 2: Missing ConfigMap
k create configmap nginx-config -n troubleshoot-lab --from-literal=placeholder=true
# Verify
k get pods -n troubleshoot-lab -w

Create more broken scenarios:

Terminal window
# Scenario 2: CrashLoopBackOff
cat <<'EOF' | k apply -f -
apiVersion: v1
kind: Pod
metadata:
name: crash-pod
namespace: troubleshoot-lab
spec:
containers:
- name: app
image: busybox
command: ['sh', '-c', 'exit 1']
EOF
# Scenario 3: Pending pod
cat <<'EOF' | k apply -f -
apiVersion: v1
kind: Pod
metadata:
name: pending-pod
namespace: troubleshoot-lab
spec:
containers:
- name: app
image: nginx
resources:
requests:
memory: "100Gi"
cpu: "100"
EOF

Troubleshoot each one systematically.

  • Identified image typo in deployment
  • Identified missing ConfigMap
  • Fixed deployment to Running state
  • Explained why crash-pod is in CrashLoopBackOff
  • Explained why pending-pod stays Pending
Terminal window
k delete ns troubleshoot-lab

Practice these scenarios until they’re automatic:

Terminal window
# Task: Find all non-running pods across all namespaces
k get pods -A | grep -v Running
# Or: k get pods -A --field-selector=status.phase!=Running
Terminal window
# Task: Show last 10 events sorted by time
k get events -A --sort-by='.lastTimestamp' | tail -10
Terminal window
# Task: Full investigation of CrashLoopBackOff pod
k describe pod <pod> # Step 1: Events
k logs <pod> --previous # Step 2: Crash logs
k get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}' # Step 3: Exit code
Terminal window
# Task: Check node health and resources
k get nodes
k describe node <node> | grep -A 5 Conditions
k top nodes
Terminal window
# Task: Verify service has endpoints
k get svc <service>
k get endpoints <service>
k get pods -l <service-selector>
Terminal window
# Task: Verify DNS working in cluster
k run dnstest --image=busybox:1.36 --rm -it --restart=Never -- nslookup kubernetes
Terminal window
# Task: Get shell in running container
k exec -it <pod> -- /bin/sh
# If sh not available: k exec -it <pod> -- /bin/bash
Terminal window
# Task: View logs from specific container and follow
k logs <pod> -c <container> -f
# List all containers: k get pod <pod> -o jsonpath='{.spec.containers[*].name}'

Continue to Module 5.2: Application Failures to learn how to troubleshoot pods, deployments, and application-level issues.