Module 5.1: Troubleshooting Methodology

Complexity: [MEDIUM] - Foundation for all troubleshooting

Time to Complete: 40-50 minutes

Prerequisites: Parts 1-4 completed (cluster architecture, workloads, networking, storage)

What You’ll Be Able to Do

After this module, you will be able to:

Apply a systematic troubleshooting methodology (symptoms → hypotheses → verify → fix → validate)
Triage CKA troubleshooting questions by identifying the failure layer (application, service, node, control plane)
Use kubectl commands (describe, logs, events, get -o yaml) in the correct diagnostic order
Avoid the #1 troubleshooting mistake: making changes before understanding the problem

Why This Module Matters

Troubleshooting is 30% of the CKA exam - the largest single domain. More importantly, troubleshooting is what separates Kubernetes operators from Kubernetes experts. When a production cluster is down at 3 AM, systematic debugging is the difference between a 5-minute fix and a 5-hour nightmare.

The Doctor Analogy

A good doctor doesn’t just guess treatments - they follow a diagnostic process. Symptoms → examination → tests → diagnosis → treatment. Kubernetes troubleshooting works the same way. Random “fixes” might work occasionally, but systematic investigation works every time.

What You’ll Learn

By the end of this module, you’ll be able to:

Apply a systematic troubleshooting methodology
Quickly identify which component is failing
Use kubectl commands for rapid diagnosis
Understand where to look for different problem types
Triage problems using the three-pass strategy

Did You Know?

80% of issues are in 5 places: Pod spec errors, image pull problems, resource constraints, network policies, and misconfigured services
Events expire: Kubernetes events are only kept for 1 hour by default - if you don’t check soon, evidence disappears
describe > logs: Most beginners jump straight to logs. Experienced troubleshooters check describe first - the Events section often reveals the problem immediately

Part 1: The Troubleshooting Framework

Stop and think: Think about what you do when something breaks. Do you immediately start changing things? Most people do. But random changes make things worse — you lose the ability to tell what fixed it (or what broke it more). The framework below forces you to understand BEFORE you act. It feels slower at first, but it’s faster in the long run because you fix the right thing the first time.

1.1 The Four-Step Process

Every troubleshooting session should follow this pattern:

┌──────────────────────────────────────────────────────────────┐
│                 TROUBLESHOOTING FRAMEWORK                     │
│                                                               │
│   ┌─────────────┐     ┌─────────────┐     ┌─────────────┐   │
│   │ 1. IDENTIFY │────▶│ 2. ISOLATE  │────▶│ 3. DIAGNOSE │   │
│   │   What's    │     │   Where's   │     │   Why's     │   │
│   │   wrong?    │     │   it wrong? │     │   it wrong? │   │
│   └─────────────┘     └─────────────┘     └─────────────┘   │
│                                                  │            │
│                                                  ▼            │
│                                           ┌─────────────┐    │
│                                           │  4. FIX     │    │
│                                           │   Apply     │    │
│                                           │   solution  │    │
│                                           └─────────────┘    │
└──────────────────────────────────────────────────────────────┘

1.2 Step 1: Identify - What’s Wrong?

Start with the symptom. Be specific:

Vague	Specific
”App is broken"	"Pod is in CrashLoopBackOff"
"Network doesn’t work"	"Pod can’t reach external DNS"
"Cluster is slow"	"API server response time > 5s”

Initial triage commands:

# Cluster-wide health check
k get nodes
k get pods -A | grep -v Running
k get events -A --sort-by='.lastTimestamp' | tail -20

# Component health
k get componentstatuses  # Deprecated but still useful
k -n kube-system get pods

1.3 Step 2: Isolate - Where’s It Wrong?

Narrow down the scope systematically:

┌──────────────────────────────────────────────────────────────┐
│                    ISOLATION LAYERS                           │
│                                                               │
│   ┌─────────────────────────────────────────────────────┐    │
│   │                    CLUSTER                           │    │
│   │   ┌─────────────────────────────────────────────┐   │    │
│   │   │                  NODE                        │   │    │
│   │   │   ┌─────────────────────────────────────┐   │   │    │
│   │   │   │               POD                    │   │   │    │
│   │   │   │   ┌─────────────────────────────┐   │   │   │    │
│   │   │   │   │         CONTAINER            │   │   │   │    │
│   │   │   │   │   ┌─────────────────────┐   │   │   │   │    │
│   │   │   │   │   │     APPLICATION     │   │   │   │   │    │
│   │   │   │   │   └─────────────────────┘   │   │   │   │    │
│   │   │   │   └─────────────────────────────┘   │   │   │    │
│   │   │   └─────────────────────────────────────┘   │   │    │
│   │   └─────────────────────────────────────────────┘   │    │
│   └─────────────────────────────────────────────────────┘    │
│                                                               │
│   Start wide, drill down until you find the problem layer    │
└──────────────────────────────────────────────────────────────┘

Isolation questions:

Is it all pods or specific pods?
Is it all nodes or specific nodes?
Is it all namespaces or specific namespaces?
Did it ever work? What changed?

1.4 Step 3: Diagnose - Why’s It Wrong?

Once you’ve isolated the layer, gather detailed information:

# Pod-level diagnosis
k describe pod <pod-name>     # Events section is gold
k logs <pod-name>             # Current container logs
k logs <pod-name> --previous  # Previous container (if crashed)

# Node-level diagnosis
k describe node <node-name>
ssh <node> journalctl -u kubelet

# Cluster-level diagnosis
k -n kube-system logs <component-pod>

1.5 Step 4: Fix - Apply Solution

Only after diagnosis do you fix:

# Apply the fix
k edit <resource>          # Direct edit
k apply -f <fixed-yaml>    # Apply corrected spec
k delete pod <pod>         # Force restart

# Verify the fix
k get pods -w              # Watch for status change
k logs <pod>               # Check new logs

Part 2: The Kubernetes Component Map

2.1 Know Your Components

Understanding what each component does helps you know where to look:

┌──────────────────────────────────────────────────────────────┐
│                    COMPONENT FAILURE MAP                      │
│                                                               │
│ SYMPTOM                          CHECK THESE COMPONENTS       │
│ ─────────────────────────────────────────────────────────────│
│                                                               │
│ Pods not scheduling           →  kube-scheduler               │
│ Pods stuck Pending            →  scheduler, node resources    │
│ Pods stuck ContainerCreating  →  kubelet, image pull, volumes │
│ Pods CrashLoopBackOff         →  container, app config        │
│ Pods can't communicate        →  CNI, network policies        │
│ Services not working          →  kube-proxy, endpoints        │
│ kubectl times out             →  API server, etcd             │
│ Node NotReady                 →  kubelet, container runtime   │
│ Persistent volume issues      →  CSI driver, storage class    │
│                                                               │
└──────────────────────────────────────────────────────────────┘

2.2 Control Plane Components

Component	What It Does	Failure Symptoms
kube-apiserver	All API operations	kubectl fails, nothing works
etcd	State storage	Data loss, inconsistent state
kube-scheduler	Pod placement	Pods stuck Pending
kube-controller-manager	Reconciliation loops	Resources not updating

2.3 Node Components

Component	What It Does	Failure Symptoms
kubelet	Pod lifecycle	Pods not starting, node NotReady
kube-proxy	Service networking	Services not reachable
Container runtime	Container execution	ContainerCreating stuck
CNI plugin	Pod networking	Pods can’t communicate

Part 3: Essential Troubleshooting Commands

3.1 The Core Commands

Memorize these - you’ll use them constantly:

# Status overview
k get pods                    # Pod status
k get pods -o wide            # Plus node and IP
k get events --sort-by='.lastTimestamp'  # Recent events

# Deep inspection
k describe pod <pod>          # Full details + events
k logs <pod>                  # Container stdout/stderr
k logs <pod> -c <container>   # Specific container
k logs <pod> --previous       # Previous container instance

# Interactive debugging
k exec -it <pod> -- sh        # Shell into container
k exec <pod> -- cat /etc/resolv.conf  # Run single command

# Resource status
k get <resource> -o yaml      # Full resource spec
k explain <resource.field>    # API documentation

3.2 Filtering and Searching

# Find problem pods
k get pods -A | grep -v Running
k get pods -A --field-selector=status.phase!=Running

# Find pods on specific node
k get pods -A --field-selector spec.nodeName=worker-1

# Find pods by label
k get pods -l app=nginx

# Search events for errors
k get events -A | grep -i error
k get events -A | grep -i fail

3.3 Resource Consumption

# Node resources
k top nodes
k describe node <node> | grep -A 5 "Allocated resources"

# Pod resources
k top pods
k top pods --containers

# Check resource requests/limits
k get pods -o custom-columns=\
'NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory'

3.4 Network Debugging

# DNS resolution
k exec <pod> -- nslookup kubernetes
k exec <pod> -- cat /etc/resolv.conf

# Connectivity
k exec <pod> -- wget -qO- http://service-name
k exec <pod> -- nc -zv service-name 80

# Service endpoints
k get endpoints <service>
k get endpointslices -l kubernetes.io/service-name=<service>

Part 4: Reading Pod Status

Pause and predict: A pod shows Running but your application isn’t working. Is the pod healthy? Not necessarily — Running means at least one container started, but it doesn’t mean the application is serving traffic. The pod could be in a crash loop (restarting), failing readiness probes (excluded from service), or running but returning errors. Status ≠ health. This distinction trips up even experienced engineers.

4.1 Pod Phase Meanings

┌──────────────────────────────────────────────────────────────┐
│                      POD PHASES                               │
│                                                               │
│   Pending ──────▶ Running ──────▶ Succeeded                  │
│      │              │                                         │
│      │              ▼                                         │
│      │           Failed                                       │
│      │              │                                         │
│      ▼              ▼                                         │
│   [Problem]    [Problem]                                      │
│                                                               │
│   Pending: Waiting for scheduling or image pull              │
│   Running: At least one container running                    │
│   Succeeded: All containers exited 0 (completed)             │
│   Failed: At least one container exited non-zero             │
│   Unknown: Node communication lost                           │
└──────────────────────────────────────────────────────────────┘

4.2 Common Pod Conditions

Status	Meaning	First Check
Pending	Not scheduled yet	`k describe pod` - Events section
ContainerCreating	Image pull or volume mount	`k describe pod` - Events section
Running	Container(s) running	`k logs` for app issues
CrashLoopBackOff	Container crashing repeatedly	`k logs --previous`
ImagePullBackOff	Can’t pull image	Image name, registry auth
ErrImagePull	Image pull failed	Same as above
CreateContainerConfigError	Config issue	ConfigMap/Secret missing
OOMKilled	Out of memory	Increase memory limit
Evicted	Node pressure	Node resources, pod priority

4.3 Decoding CrashLoopBackOff

The most common troubleshooting scenario:

# Step 1: Check events
k describe pod <pod> | grep -A 20 Events

# Step 2: Check previous logs
k logs <pod> --previous

# Step 3: Check container exit code
k get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

# Common exit codes:
# 0   - Success (shouldn't cause CrashLoop)
# 1   - Application error
# 137 - SIGKILL (OOMKilled or killed by system)
# 139 - SIGSEGV (segmentation fault)
# 143 - SIGTERM (graceful termination)

Part 5: The Describe Output Deep Dive

5.1 Key Sections in describe pod

k describe pod <pod-name>

┌──────────────────────────────────────────────────────────────┐
│                 DESCRIBE OUTPUT SECTIONS                      │
│                                                               │
│ Section              What to Look For                         │
│ ─────────────────────────────────────────────────────────────│
│                                                               │
│ Status:              Current phase (Pending/Running/etc)      │
│                                                               │
│ Containers:          State, Ready, Restart Count              │
│                      Last State (for crash info)              │
│                      Image (verify it's correct)              │
│                                                               │
│ Conditions:          Ready, ContainersReady, PodScheduled     │
│                      False = problem                          │
│                                                               │
│ Volumes:             ConfigMaps, Secrets, PVCs                │
│                      Missing = pod won't start                │
│                                                               │
│ Events: ⭐           THE MOST IMPORTANT SECTION               │
│                      Shows what's happening/happened          │
│                      Errors appear here first                 │
│                                                               │
└──────────────────────────────────────────────────────────────┘

5.2 Key Sections in describe node

k describe node <node-name>

Section	What to Look For
Conditions	Ready=True, MemoryPressure=False, DiskPressure=False
Capacity	Total CPU, memory, pods
Allocatable	Available for pods (after system reservation)
Allocated resources	Current usage and requests
Events	Evictions, pressure conditions

Part 6: Exam Troubleshooting Strategy

6.1 Three-Pass Applied to Troubleshooting

┌──────────────────────────────────────────────────────────────┐
│            THREE-PASS TROUBLESHOOTING STRATEGY                │
│                                                               │
│ PASS 1: Quick Fixes (1-3 min)                                │
│   • Obvious typos in YAML                                    │
│   • Wrong image name/tag                                     │
│   • Missing namespace in command                             │
│   • Label selector mismatch                                  │
│                                                               │
│ PASS 2: Standard Issues (4-6 min)                            │
│   • Missing ConfigMap/Secret                                 │
│   • Resource constraints                                     │
│   • Service selector mismatch                                │
│   • Network policy blocking traffic                          │
│                                                               │
│ PASS 3: Complex Issues (7+ min)                              │
│   • Control plane component failures                         │
│   • Node-level issues                                        │
│   • CNI/networking problems                                  │
│   • Storage/CSI issues                                       │
│                                                               │
└──────────────────────────────────────────────────────────────┘

6.2 Time Management

For a 2-hour exam with troubleshooting worth 30%:

~36 minutes for troubleshooting questions
Probably 3-4 troubleshooting scenarios
~9-12 minutes per scenario maximum

Golden rule: If you can’t identify the problem in 3 minutes of investigation, flag it and move on.

6.3 Common Exam Patterns

Scenario	Likely Issue	Quick Check
Pod not starting	Image, ConfigMap/Secret	`k describe pod`
Service not accessible	Selector, endpoints	`k get endpoints`
Node NotReady	kubelet, runtime	`ssh node; systemctl status kubelet`
DNS not working	CoreDNS pods	`k -n kube-system get pods -l k8s-app=kube-dns`
Persistent volume pending	StorageClass, PV	`k describe pvc`

Common Mistakes

Mistake	Problem	Solution
Jumping to logs first	Miss scheduling/config issues	Always `describe` before `logs`
Not checking events	Miss critical error messages	Check events immediately
Fixing without diagnosis	Might not fix real issue	Always identify root cause
Forgetting `--previous`	Can’t see why container crashed	Use for CrashLoopBackOff
Ignoring exit codes	Miss OOM vs app error	Check exit code for cause
Not checking all containers	Multi-container pods	Use `-c <container>` flag

Quiz

Q1: The Restarting Application

You’ve just deployed a new release of your backend API, but the deployment is failing to progress. When you check the pods, you see they are in a CrashLoopBackOff state. What is the very first kubectl command you should run to begin your investigation, and why?

Answer

You should run kubectl describe pod <pod-name>. While it’s tempting to immediately jump to checking the logs, the describe command provides the critical Events section at the bottom of its output. These events will often tell you immediately if the issue is a failure to pull an image, a missing ConfigMap, or a failing liveness probe. Only after checking the events and confirming it’s an application-level crash should you move on to checking kubectl logs <pod-name> --previous to view the actual crash logs.

Q2: The Disappearing Evidence

A developer reports that their batch job failed mysteriously over the weekend. When you check the cluster on Monday morning, the pod is gone and you can’t find any obvious errors. Why might you struggle to find the root cause using standard Kubernetes event logs?

Answer

You will struggle because Kubernetes events are only retained for 1 hour by default in etcd. By the time you check on Monday, the cluster will have already garbage-collected the events related to the weekend failure. Furthermore, the Events section in the describe output truncates, meaning a flood of newer events can quickly push out the older, relevant ones even within that one-hour window. This is why it is critical to capture cluster state and events immediately when an issue occurs, or rely on external logging and monitoring systems for historical data.

Q3: The Silent Killer

Your data processing pod suddenly stops working. When you inspect the pod status, you see that the container terminated with an exit code of 137. What does this specific exit code tell you about how the container died, and where should you look next?

Answer

An exit code of 137 indicates that the container was terminated forcefully with a SIGKILL signal (128 + 9). In a Kubernetes environment, this almost always means the container was OOMKilled (Out Of Memory) because it tried to consume more memory than its configured limits allowed. Alternatively, it could mean the node itself was under memory pressure and the kubelet killed the pod to protect system stability. You should immediately run kubectl describe pod <pod-name> to check the Last State section for the exact reason (like OOMKilled), and then review the pod’s resource limits compared to its actual usage.

Q4: The Stuck Deployment

You’ve applied a new Deployment, but the pods never seem to reach the Running state. Some pods show a Pending status, while others are stuck in ContainerCreating. How do these two states differ in terms of where the failure is occurring in the pod lifecycle?

Answer

The difference lies in whether the Kubernetes scheduler has successfully assigned the pod to a node. A Pending status means the pod has not yet been scheduled; this typically points to cluster-level issues like a lack of available resources (CPU/Memory), untolerated taints, or a misconfigured node selector. Conversely, ContainerCreating means the scheduler has assigned the pod to a node, but the node’s kubelet is struggling to start the container. This usually points to node-level or dependency issues, such as failing to pull the container image, inability to mount a required PersistentVolume, or Secret/ConfigMap resolution failures. Check describe Events to see which step is stuck.

Q5: The Sidecar Mystery

You have a pod running a web application alongside a logging sidecar container. The web application is returning 500 errors, but when you run kubectl logs <pod-name>, you only see the sidecar’s output, which looks completely healthy. How do you retrieve the logs for the failing web application container?

Answer

By default, when you run kubectl logs against a multi-container pod without specifying a container, Kubernetes will either output the logs of the first container defined in the pod spec, or return an error prompting you to choose one. To view the logs of the specific web application container, you must use the -c flag followed by the container’s name. If you aren’t sure of the container’s exact name, you can use a jsonpath query to list all containers within that pod before checking the logs.

k logs <pod-name> -c <container-name>

# List containers in pod
k get pod <pod-name> -o jsonpath='{.spec.containers[*].name}'

Q6: The Unresponsive Node

During a routine cluster check, you notice that one of your worker nodes is marked as NotReady. All the pods on that node are beginning to transition to an Unknown state. Walk through the systematic steps you would take to diagnose why this node has dropped out of the cluster.

Answer

When a node becomes NotReady, it means the control plane has lost communication with the node’s kubelet. Your first step should be to run kubectl describe node <node> from the control plane to check the Conditions section for issues like memory or disk pressure that might have preceded the disconnect. If the cluster-level info isn’t conclusive, you must shift to node-level debugging.

k describe node <node> - Check Conditions section
SSH to node if accessible
systemctl status kubelet - Is kubelet running?
journalctl -u kubelet -f - Check kubelet logs
systemctl status containerd (or docker) - Is runtime running?
Check network connectivity to control plane

Hands-On Exercise: Systematic Troubleshooting Practice

Scenario

You’ll create several broken resources and practice systematic troubleshooting.

Setup

# Create namespace
k create ns troubleshoot-lab

# Create a "broken" deployment - see if you can spot all issues
cat <<'EOF' | k apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: broken-app
  namespace: troubleshoot-lab
spec:
  replicas: 2
  selector:
    matchLabels:
      app: broken-app
  template:
    metadata:
      labels:
        app: broken-app
    spec:
      containers:
      - name: app
        image: nginx:latestt
        ports:
        - containerPort: 80
        resources:
          requests:
            memory: "64Mi"
            cpu: "250m"
          limits:
            memory: "64Mi"
            cpu: "500m"
        volumeMounts:
        - name: config
          mountPath: /etc/nginx/conf.d
      volumes:
      - name: config
        configMap:
          name: nginx-config
EOF

Tasks

Apply the troubleshooting methodology:

1. Identify - What’s wrong?

k get pods -n troubleshoot-lab
# What status do you see?

2. Isolate - Where’s it wrong?

k describe pod -n troubleshoot-lab -l app=broken-app
# Look at the Events section

3. Diagnose - Why’s it wrong? Find all issues (there are at least 2):

Issue 1: _______________
Issue 2: _______________

4. Fix - Apply solutions

# Fix issue 1: Image typo
k set image deployment/broken-app -n troubleshoot-lab app=nginx:latest

# Fix issue 2: Missing ConfigMap
k create configmap nginx-config -n troubleshoot-lab --from-literal=placeholder=true

# Verify
k get pods -n troubleshoot-lab -w

Extended Challenge

Create more broken scenarios:

# Scenario 2: CrashLoopBackOff
cat <<'EOF' | k apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: crash-pod
  namespace: troubleshoot-lab
spec:
  containers:
  - name: app
    image: busybox
    command: ['sh', '-c', 'exit 1']
EOF

# Scenario 3: Pending pod
cat <<'EOF' | k apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: pending-pod
  namespace: troubleshoot-lab
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        memory: "100Gi"
        cpu: "100"
EOF

Troubleshoot each one systematically.

Success Criteria

Identified image typo in deployment
Identified missing ConfigMap
Fixed deployment to Running state
Explained why crash-pod is in CrashLoopBackOff
Explained why pending-pod stays Pending

Cleanup

k delete ns troubleshoot-lab

Practice Drills

Practice these scenarios until they’re automatic:

Drill 1: Quick Status Check (30 sec)

# Task: Find all non-running pods across all namespaces
k get pods -A | grep -v Running
# Or: k get pods -A --field-selector=status.phase!=Running

Drill 2: Recent Events (30 sec)

# Task: Show last 10 events sorted by time
k get events -A --sort-by='.lastTimestamp' | tail -10

Drill 3: Pod Crash Investigation (2 min)

# Task: Full investigation of CrashLoopBackOff pod
k describe pod <pod>        # Step 1: Events
k logs <pod> --previous     # Step 2: Crash logs
k get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'  # Step 3: Exit code

Drill 4: Node Health Check (1 min)

# Task: Check node health and resources
k get nodes
k describe node <node> | grep -A 5 Conditions
k top nodes

Drill 5: Service Endpoint Check (1 min)

# Task: Verify service has endpoints
k get svc <service>
k get endpoints <service>
k get pods -l <service-selector>

Drill 6: DNS Verification (1 min)

# Task: Verify DNS working in cluster
k run dnstest --image=busybox:1.36 --rm -it --restart=Never -- nslookup kubernetes

Drill 7: Container Shell Access (30 sec)

# Task: Get shell in running container
k exec -it <pod> -- /bin/sh
# If sh not available: k exec -it <pod> -- /bin/bash

Drill 8: Multi-Container Logs (1 min)

# Task: View logs from specific container and follow
k logs <pod> -c <container> -f
# List all containers: k get pod <pod> -o jsonpath='{.spec.containers[*].name}'

Next Module

Continue to Module 5.2: Application Failures to learn how to troubleshoot pods, deployments, and application-level issues.