Module 5.1: Troubleshooting Methodology
Complexity:
[MEDIUM]- Foundation for all troubleshootingTime to Complete: 40-50 minutes
Prerequisites: Parts 1-4 completed (cluster architecture, workloads, networking, storage)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After this module, you will be able to:
- Apply a systematic troubleshooting methodology (symptoms → hypotheses → verify → fix → validate)
- Triage CKA troubleshooting questions by identifying the failure layer (application, service, node, control plane)
- Use kubectl commands (describe, logs, events, get -o yaml) in the correct diagnostic order
- Avoid the #1 troubleshooting mistake: making changes before understanding the problem
Why This Module Matters
Section titled “Why This Module Matters”Troubleshooting is 30% of the CKA exam - the largest single domain. More importantly, troubleshooting is what separates Kubernetes operators from Kubernetes experts. When a production cluster is down at 3 AM, systematic debugging is the difference between a 5-minute fix and a 5-hour nightmare.
The Doctor Analogy
A good doctor doesn’t just guess treatments - they follow a diagnostic process. Symptoms → examination → tests → diagnosis → treatment. Kubernetes troubleshooting works the same way. Random “fixes” might work occasionally, but systematic investigation works every time.
What You’ll Learn
Section titled “What You’ll Learn”By the end of this module, you’ll be able to:
- Apply a systematic troubleshooting methodology
- Quickly identify which component is failing
- Use kubectl commands for rapid diagnosis
- Understand where to look for different problem types
- Triage problems using the three-pass strategy
Did You Know?
Section titled “Did You Know?”- 80% of issues are in 5 places: Pod spec errors, image pull problems, resource constraints, network policies, and misconfigured services
- Events expire: Kubernetes events are only kept for 1 hour by default - if you don’t check soon, evidence disappears
- describe > logs: Most beginners jump straight to logs. Experienced troubleshooters check
describefirst - the Events section often reveals the problem immediately
Part 1: The Troubleshooting Framework
Section titled “Part 1: The Troubleshooting Framework”Stop and think: Think about what you do when something breaks. Do you immediately start changing things? Most people do. But random changes make things worse — you lose the ability to tell what fixed it (or what broke it more). The framework below forces you to understand BEFORE you act. It feels slower at first, but it’s faster in the long run because you fix the right thing the first time.
1.1 The Four-Step Process
Section titled “1.1 The Four-Step Process”Every troubleshooting session should follow this pattern:
┌──────────────────────────────────────────────────────────────┐│ TROUBLESHOOTING FRAMEWORK ││ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ 1. IDENTIFY │────▶│ 2. ISOLATE │────▶│ 3. DIAGNOSE │ ││ │ What's │ │ Where's │ │ Why's │ ││ │ wrong? │ │ it wrong? │ │ it wrong? │ ││ └─────────────┘ └─────────────┘ └─────────────┘ ││ │ ││ ▼ ││ ┌─────────────┐ ││ │ 4. FIX │ ││ │ Apply │ ││ │ solution │ ││ └─────────────┘ │└──────────────────────────────────────────────────────────────┘1.2 Step 1: Identify - What’s Wrong?
Section titled “1.2 Step 1: Identify - What’s Wrong?”Start with the symptom. Be specific:
| Vague | Specific |
|---|---|
| ”App is broken" | "Pod is in CrashLoopBackOff" |
| "Network doesn’t work" | "Pod can’t reach external DNS" |
| "Cluster is slow" | "API server response time > 5s” |
Initial triage commands:
# Cluster-wide health checkk get nodesk get pods -A | grep -v Runningk get events -A --sort-by='.lastTimestamp' | tail -20
# Component healthk get componentstatuses # Deprecated but still usefulk -n kube-system get pods1.3 Step 2: Isolate - Where’s It Wrong?
Section titled “1.3 Step 2: Isolate - Where’s It Wrong?”Narrow down the scope systematically:
┌──────────────────────────────────────────────────────────────┐│ ISOLATION LAYERS ││ ││ ┌─────────────────────────────────────────────────────┐ ││ │ CLUSTER │ ││ │ ┌─────────────────────────────────────────────┐ │ ││ │ │ NODE │ │ ││ │ │ ┌─────────────────────────────────────┐ │ │ ││ │ │ │ POD │ │ │ ││ │ │ │ ┌─────────────────────────────┐ │ │ │ ││ │ │ │ │ CONTAINER │ │ │ │ ││ │ │ │ │ ┌─────────────────────┐ │ │ │ │ ││ │ │ │ │ │ APPLICATION │ │ │ │ │ ││ │ │ │ │ └─────────────────────┘ │ │ │ │ ││ │ │ │ └─────────────────────────────┘ │ │ │ ││ │ │ └─────────────────────────────────────┘ │ │ ││ │ └─────────────────────────────────────────────┘ │ ││ └─────────────────────────────────────────────────────┘ ││ ││ Start wide, drill down until you find the problem layer │└──────────────────────────────────────────────────────────────┘Isolation questions:
- Is it all pods or specific pods?
- Is it all nodes or specific nodes?
- Is it all namespaces or specific namespaces?
- Did it ever work? What changed?
1.4 Step 3: Diagnose - Why’s It Wrong?
Section titled “1.4 Step 3: Diagnose - Why’s It Wrong?”Once you’ve isolated the layer, gather detailed information:
# Pod-level diagnosisk describe pod <pod-name> # Events section is goldk logs <pod-name> # Current container logsk logs <pod-name> --previous # Previous container (if crashed)
# Node-level diagnosisk describe node <node-name>ssh <node> journalctl -u kubelet
# Cluster-level diagnosisk -n kube-system logs <component-pod>1.5 Step 4: Fix - Apply Solution
Section titled “1.5 Step 4: Fix - Apply Solution”Only after diagnosis do you fix:
# Apply the fixk edit <resource> # Direct editk apply -f <fixed-yaml> # Apply corrected speck delete pod <pod> # Force restart
# Verify the fixk get pods -w # Watch for status changek logs <pod> # Check new logsPart 2: The Kubernetes Component Map
Section titled “Part 2: The Kubernetes Component Map”2.1 Know Your Components
Section titled “2.1 Know Your Components”Understanding what each component does helps you know where to look:
┌──────────────────────────────────────────────────────────────┐│ COMPONENT FAILURE MAP ││ ││ SYMPTOM CHECK THESE COMPONENTS ││ ─────────────────────────────────────────────────────────────││ ││ Pods not scheduling → kube-scheduler ││ Pods stuck Pending → scheduler, node resources ││ Pods stuck ContainerCreating → kubelet, image pull, volumes ││ Pods CrashLoopBackOff → container, app config ││ Pods can't communicate → CNI, network policies ││ Services not working → kube-proxy, endpoints ││ kubectl times out → API server, etcd ││ Node NotReady → kubelet, container runtime ││ Persistent volume issues → CSI driver, storage class ││ │└──────────────────────────────────────────────────────────────┘2.2 Control Plane Components
Section titled “2.2 Control Plane Components”| Component | What It Does | Failure Symptoms |
|---|---|---|
| kube-apiserver | All API operations | kubectl fails, nothing works |
| etcd | State storage | Data loss, inconsistent state |
| kube-scheduler | Pod placement | Pods stuck Pending |
| kube-controller-manager | Reconciliation loops | Resources not updating |
2.3 Node Components
Section titled “2.3 Node Components”| Component | What It Does | Failure Symptoms |
|---|---|---|
| kubelet | Pod lifecycle | Pods not starting, node NotReady |
| kube-proxy | Service networking | Services not reachable |
| Container runtime | Container execution | ContainerCreating stuck |
| CNI plugin | Pod networking | Pods can’t communicate |
Part 3: Essential Troubleshooting Commands
Section titled “Part 3: Essential Troubleshooting Commands”3.1 The Core Commands
Section titled “3.1 The Core Commands”Memorize these - you’ll use them constantly:
# Status overviewk get pods # Pod statusk get pods -o wide # Plus node and IPk get events --sort-by='.lastTimestamp' # Recent events
# Deep inspectionk describe pod <pod> # Full details + eventsk logs <pod> # Container stdout/stderrk logs <pod> -c <container> # Specific containerk logs <pod> --previous # Previous container instance
# Interactive debuggingk exec -it <pod> -- sh # Shell into containerk exec <pod> -- cat /etc/resolv.conf # Run single command
# Resource statusk get <resource> -o yaml # Full resource speck explain <resource.field> # API documentation3.2 Filtering and Searching
Section titled “3.2 Filtering and Searching”# Find problem podsk get pods -A | grep -v Runningk get pods -A --field-selector=status.phase!=Running
# Find pods on specific nodek get pods -A --field-selector spec.nodeName=worker-1
# Find pods by labelk get pods -l app=nginx
# Search events for errorsk get events -A | grep -i errork get events -A | grep -i fail3.3 Resource Consumption
Section titled “3.3 Resource Consumption”# Node resourcesk top nodesk describe node <node> | grep -A 5 "Allocated resources"
# Pod resourcesk top podsk top pods --containers
# Check resource requests/limitsk get pods -o custom-columns=\'NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory'3.4 Network Debugging
Section titled “3.4 Network Debugging”# DNS resolutionk exec <pod> -- nslookup kubernetesk exec <pod> -- cat /etc/resolv.conf
# Connectivityk exec <pod> -- wget -qO- http://service-namek exec <pod> -- nc -zv service-name 80
# Service endpointsk get endpoints <service>k get endpointslices -l kubernetes.io/service-name=<service>Part 4: Reading Pod Status
Section titled “Part 4: Reading Pod Status”Pause and predict: A pod shows
Runningbut your application isn’t working. Is the pod healthy? Not necessarily —Runningmeans at least one container started, but it doesn’t mean the application is serving traffic. The pod could be in a crash loop (restarting), failing readiness probes (excluded from service), or running but returning errors. Status ≠ health. This distinction trips up even experienced engineers.
4.1 Pod Phase Meanings
Section titled “4.1 Pod Phase Meanings”┌──────────────────────────────────────────────────────────────┐│ POD PHASES ││ ││ Pending ──────▶ Running ──────▶ Succeeded ││ │ │ ││ │ ▼ ││ │ Failed ││ │ │ ││ ▼ ▼ ││ [Problem] [Problem] ││ ││ Pending: Waiting for scheduling or image pull ││ Running: At least one container running ││ Succeeded: All containers exited 0 (completed) ││ Failed: At least one container exited non-zero ││ Unknown: Node communication lost │└──────────────────────────────────────────────────────────────┘4.2 Common Pod Conditions
Section titled “4.2 Common Pod Conditions”| Status | Meaning | First Check |
|---|---|---|
| Pending | Not scheduled yet | k describe pod - Events section |
| ContainerCreating | Image pull or volume mount | k describe pod - Events section |
| Running | Container(s) running | k logs for app issues |
| CrashLoopBackOff | Container crashing repeatedly | k logs --previous |
| ImagePullBackOff | Can’t pull image | Image name, registry auth |
| ErrImagePull | Image pull failed | Same as above |
| CreateContainerConfigError | Config issue | ConfigMap/Secret missing |
| OOMKilled | Out of memory | Increase memory limit |
| Evicted | Node pressure | Node resources, pod priority |
4.3 Decoding CrashLoopBackOff
Section titled “4.3 Decoding CrashLoopBackOff”The most common troubleshooting scenario:
# Step 1: Check eventsk describe pod <pod> | grep -A 20 Events
# Step 2: Check previous logsk logs <pod> --previous
# Step 3: Check container exit codek get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Common exit codes:# 0 - Success (shouldn't cause CrashLoop)# 1 - Application error# 137 - SIGKILL (OOMKilled or killed by system)# 139 - SIGSEGV (segmentation fault)# 143 - SIGTERM (graceful termination)Part 5: The Describe Output Deep Dive
Section titled “Part 5: The Describe Output Deep Dive”5.1 Key Sections in describe pod
Section titled “5.1 Key Sections in describe pod”k describe pod <pod-name>┌──────────────────────────────────────────────────────────────┐│ DESCRIBE OUTPUT SECTIONS ││ ││ Section What to Look For ││ ─────────────────────────────────────────────────────────────││ ││ Status: Current phase (Pending/Running/etc) ││ ││ Containers: State, Ready, Restart Count ││ Last State (for crash info) ││ Image (verify it's correct) ││ ││ Conditions: Ready, ContainersReady, PodScheduled ││ False = problem ││ ││ Volumes: ConfigMaps, Secrets, PVCs ││ Missing = pod won't start ││ ││ Events: ⭐ THE MOST IMPORTANT SECTION ││ Shows what's happening/happened ││ Errors appear here first ││ │└──────────────────────────────────────────────────────────────┘5.2 Key Sections in describe node
Section titled “5.2 Key Sections in describe node”k describe node <node-name>| Section | What to Look For |
|---|---|
| Conditions | Ready=True, MemoryPressure=False, DiskPressure=False |
| Capacity | Total CPU, memory, pods |
| Allocatable | Available for pods (after system reservation) |
| Allocated resources | Current usage and requests |
| Events | Evictions, pressure conditions |
Part 6: Exam Troubleshooting Strategy
Section titled “Part 6: Exam Troubleshooting Strategy”6.1 Three-Pass Applied to Troubleshooting
Section titled “6.1 Three-Pass Applied to Troubleshooting”┌──────────────────────────────────────────────────────────────┐│ THREE-PASS TROUBLESHOOTING STRATEGY ││ ││ PASS 1: Quick Fixes (1-3 min) ││ • Obvious typos in YAML ││ • Wrong image name/tag ││ • Missing namespace in command ││ • Label selector mismatch ││ ││ PASS 2: Standard Issues (4-6 min) ││ • Missing ConfigMap/Secret ││ • Resource constraints ││ • Service selector mismatch ││ • Network policy blocking traffic ││ ││ PASS 3: Complex Issues (7+ min) ││ • Control plane component failures ││ • Node-level issues ││ • CNI/networking problems ││ • Storage/CSI issues ││ │└──────────────────────────────────────────────────────────────┘6.2 Time Management
Section titled “6.2 Time Management”For a 2-hour exam with troubleshooting worth 30%:
- ~36 minutes for troubleshooting questions
- Probably 3-4 troubleshooting scenarios
- ~9-12 minutes per scenario maximum
Golden rule: If you can’t identify the problem in 3 minutes of investigation, flag it and move on.
6.3 Common Exam Patterns
Section titled “6.3 Common Exam Patterns”| Scenario | Likely Issue | Quick Check |
|---|---|---|
| Pod not starting | Image, ConfigMap/Secret | k describe pod |
| Service not accessible | Selector, endpoints | k get endpoints |
| Node NotReady | kubelet, runtime | ssh node; systemctl status kubelet |
| DNS not working | CoreDNS pods | k -n kube-system get pods -l k8s-app=kube-dns |
| Persistent volume pending | StorageClass, PV | k describe pvc |
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Jumping to logs first | Miss scheduling/config issues | Always describe before logs |
| Not checking events | Miss critical error messages | Check events immediately |
| Fixing without diagnosis | Might not fix real issue | Always identify root cause |
Forgetting --previous | Can’t see why container crashed | Use for CrashLoopBackOff |
| Ignoring exit codes | Miss OOM vs app error | Check exit code for cause |
| Not checking all containers | Multi-container pods | Use -c <container> flag |
Q1: The Restarting Application
Section titled “Q1: The Restarting Application”You’ve just deployed a new release of your backend API, but the deployment is failing to progress. When you check the pods, you see they are in a CrashLoopBackOff state. What is the very first kubectl command you should run to begin your investigation, and why?
Answer
You should run kubectl describe pod <pod-name>. While it’s tempting to immediately jump to checking the logs, the describe command provides the critical Events section at the bottom of its output. These events will often tell you immediately if the issue is a failure to pull an image, a missing ConfigMap, or a failing liveness probe. Only after checking the events and confirming it’s an application-level crash should you move on to checking kubectl logs <pod-name> --previous to view the actual crash logs.
Q2: The Disappearing Evidence
Section titled “Q2: The Disappearing Evidence”A developer reports that their batch job failed mysteriously over the weekend. When you check the cluster on Monday morning, the pod is gone and you can’t find any obvious errors. Why might you struggle to find the root cause using standard Kubernetes event logs?
Answer
You will struggle because Kubernetes events are only retained for 1 hour by default in etcd. By the time you check on Monday, the cluster will have already garbage-collected the events related to the weekend failure. Furthermore, the Events section in the describe output truncates, meaning a flood of newer events can quickly push out the older, relevant ones even within that one-hour window. This is why it is critical to capture cluster state and events immediately when an issue occurs, or rely on external logging and monitoring systems for historical data.
Q3: The Silent Killer
Section titled “Q3: The Silent Killer”Your data processing pod suddenly stops working. When you inspect the pod status, you see that the container terminated with an exit code of 137. What does this specific exit code tell you about how the container died, and where should you look next?
Answer
An exit code of 137 indicates that the container was terminated forcefully with a SIGKILL signal (128 + 9). In a Kubernetes environment, this almost always means the container was OOMKilled (Out Of Memory) because it tried to consume more memory than its configured limits allowed. Alternatively, it could mean the node itself was under memory pressure and the kubelet killed the pod to protect system stability. You should immediately run kubectl describe pod <pod-name> to check the Last State section for the exact reason (like OOMKilled), and then review the pod’s resource limits compared to its actual usage.
Q4: The Stuck Deployment
Section titled “Q4: The Stuck Deployment”You’ve applied a new Deployment, but the pods never seem to reach the Running state. Some pods show a Pending status, while others are stuck in ContainerCreating. How do these two states differ in terms of where the failure is occurring in the pod lifecycle?
Answer
The difference lies in whether the Kubernetes scheduler has successfully assigned the pod to a node. A Pending status means the pod has not yet been scheduled; this typically points to cluster-level issues like a lack of available resources (CPU/Memory), untolerated taints, or a misconfigured node selector. Conversely, ContainerCreating means the scheduler has assigned the pod to a node, but the node’s kubelet is struggling to start the container. This usually points to node-level or dependency issues, such as failing to pull the container image, inability to mount a required PersistentVolume, or Secret/ConfigMap resolution failures. Check describe Events to see which step is stuck.
Q5: The Sidecar Mystery
Section titled “Q5: The Sidecar Mystery”You have a pod running a web application alongside a logging sidecar container. The web application is returning 500 errors, but when you run kubectl logs <pod-name>, you only see the sidecar’s output, which looks completely healthy. How do you retrieve the logs for the failing web application container?
Answer
By default, when you run kubectl logs against a multi-container pod without specifying a container, Kubernetes will either output the logs of the first container defined in the pod spec, or return an error prompting you to choose one. To view the logs of the specific web application container, you must use the -c flag followed by the container’s name. If you aren’t sure of the container’s exact name, you can use a jsonpath query to list all containers within that pod before checking the logs.
k logs <pod-name> -c <container-name>
# List containers in podk get pod <pod-name> -o jsonpath='{.spec.containers[*].name}'Q6: The Unresponsive Node
Section titled “Q6: The Unresponsive Node”During a routine cluster check, you notice that one of your worker nodes is marked as NotReady. All the pods on that node are beginning to transition to an Unknown state. Walk through the systematic steps you would take to diagnose why this node has dropped out of the cluster.
Answer
When a node becomes NotReady, it means the control plane has lost communication with the node’s kubelet. Your first step should be to run kubectl describe node <node> from the control plane to check the Conditions section for issues like memory or disk pressure that might have preceded the disconnect. If the cluster-level info isn’t conclusive, you must shift to node-level debugging.
k describe node <node>- Check Conditions section- SSH to node if accessible
systemctl status kubelet- Is kubelet running?journalctl -u kubelet -f- Check kubelet logssystemctl status containerd(or docker) - Is runtime running?- Check network connectivity to control plane
Hands-On Exercise: Systematic Troubleshooting Practice
Section titled “Hands-On Exercise: Systematic Troubleshooting Practice”Scenario
Section titled “Scenario”You’ll create several broken resources and practice systematic troubleshooting.
# Create namespacek create ns troubleshoot-lab
# Create a "broken" deployment - see if you can spot all issuescat <<'EOF' | k apply -f -apiVersion: apps/v1kind: Deploymentmetadata: name: broken-app namespace: troubleshoot-labspec: replicas: 2 selector: matchLabels: app: broken-app template: metadata: labels: app: broken-app spec: containers: - name: app image: nginx:latestt ports: - containerPort: 80 resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "64Mi" cpu: "500m" volumeMounts: - name: config mountPath: /etc/nginx/conf.d volumes: - name: config configMap: name: nginx-configEOFApply the troubleshooting methodology:
1. Identify - What’s wrong?
k get pods -n troubleshoot-lab# What status do you see?2. Isolate - Where’s it wrong?
k describe pod -n troubleshoot-lab -l app=broken-app# Look at the Events section3. Diagnose - Why’s it wrong? Find all issues (there are at least 2):
- Issue 1: _______________
- Issue 2: _______________
4. Fix - Apply solutions
# Fix issue 1: Image typok set image deployment/broken-app -n troubleshoot-lab app=nginx:latest
# Fix issue 2: Missing ConfigMapk create configmap nginx-config -n troubleshoot-lab --from-literal=placeholder=true
# Verifyk get pods -n troubleshoot-lab -wExtended Challenge
Section titled “Extended Challenge”Create more broken scenarios:
# Scenario 2: CrashLoopBackOffcat <<'EOF' | k apply -f -apiVersion: v1kind: Podmetadata: name: crash-pod namespace: troubleshoot-labspec: containers: - name: app image: busybox command: ['sh', '-c', 'exit 1']EOF
# Scenario 3: Pending podcat <<'EOF' | k apply -f -apiVersion: v1kind: Podmetadata: name: pending-pod namespace: troubleshoot-labspec: containers: - name: app image: nginx resources: requests: memory: "100Gi" cpu: "100"EOFTroubleshoot each one systematically.
Success Criteria
Section titled “Success Criteria”- Identified image typo in deployment
- Identified missing ConfigMap
- Fixed deployment to Running state
- Explained why crash-pod is in CrashLoopBackOff
- Explained why pending-pod stays Pending
Cleanup
Section titled “Cleanup”k delete ns troubleshoot-labPractice Drills
Section titled “Practice Drills”Practice these scenarios until they’re automatic:
Drill 1: Quick Status Check (30 sec)
Section titled “Drill 1: Quick Status Check (30 sec)”# Task: Find all non-running pods across all namespacesk get pods -A | grep -v Running# Or: k get pods -A --field-selector=status.phase!=RunningDrill 2: Recent Events (30 sec)
Section titled “Drill 2: Recent Events (30 sec)”# Task: Show last 10 events sorted by timek get events -A --sort-by='.lastTimestamp' | tail -10Drill 3: Pod Crash Investigation (2 min)
Section titled “Drill 3: Pod Crash Investigation (2 min)”# Task: Full investigation of CrashLoopBackOff podk describe pod <pod> # Step 1: Eventsk logs <pod> --previous # Step 2: Crash logsk get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}' # Step 3: Exit codeDrill 4: Node Health Check (1 min)
Section titled “Drill 4: Node Health Check (1 min)”# Task: Check node health and resourcesk get nodesk describe node <node> | grep -A 5 Conditionsk top nodesDrill 5: Service Endpoint Check (1 min)
Section titled “Drill 5: Service Endpoint Check (1 min)”# Task: Verify service has endpointsk get svc <service>k get endpoints <service>k get pods -l <service-selector>Drill 6: DNS Verification (1 min)
Section titled “Drill 6: DNS Verification (1 min)”# Task: Verify DNS working in clusterk run dnstest --image=busybox:1.36 --rm -it --restart=Never -- nslookup kubernetesDrill 7: Container Shell Access (30 sec)
Section titled “Drill 7: Container Shell Access (30 sec)”# Task: Get shell in running containerk exec -it <pod> -- /bin/sh# If sh not available: k exec -it <pod> -- /bin/bashDrill 8: Multi-Container Logs (1 min)
Section titled “Drill 8: Multi-Container Logs (1 min)”# Task: View logs from specific container and followk logs <pod> -c <container> -f# List all containers: k get pod <pod> -o jsonpath='{.spec.containers[*].name}'Next Module
Section titled “Next Module”Continue to Module 5.2: Application Failures to learn how to troubleshoot pods, deployments, and application-level issues.