Module 3.3: Debugging in Kubernetes
Complexity:
[MEDIUM]- Critical exam skill requiring systematic approachTime to Complete: 45-55 minutes
Prerequisites: Module 3.1 (Probes), Module 3.2 (Logging)
Learning Outcomes
Section titled “Learning Outcomes”After completing this module, you will be able to:
- Diagnose pod failures systematically using events, logs, describe, and exec
- Debug CrashLoopBackOff, ImagePullBackOff, and Pending pod states under time pressure
- Troubleshoot networking issues between pods using temporary debug containers
- Implement a repeatable debugging workflow: status, events, logs, exec, network
Why This Module Matters
Section titled “Why This Module Matters”Debugging is where exam performance is won or lost. When something doesn’t work, you need a systematic approach to find the problem quickly. The CKAD exam deliberately includes broken configurations—you must diagnose and fix them under time pressure.
This module covers:
- Debugging pods that won’t start
- Debugging running pods with issues
- Using ephemeral containers for debugging
- The systematic debugging workflow
The Detective Analogy
Debugging is detective work. You arrive at a crime scene (broken pod) and must find clues. You check the victim’s history (
describe), examine the evidence (logs), interview witnesses (events), and sometimes need to go undercover (exec) to catch the culprit. Systematic investigation beats random guessing.
War Story: The Two-Hour Typo
During a major production rollout, a critical microservice failed to start. The on-call engineer panicked and spent two hours randomly restarting the deployment, rolling back healthy database changes, and rewriting ingress rules. If they had simply followed a systematic workflow—starting with
kubectl describe podand reading the events—they would have immediately seen anErrImagePullevent caused by a typo in the image tag. A systematic workflow turns a two-hour panic attack into a two-minute fix. This same panic is what causes candidates to fail the CKAD exam when faced with a broken environment.
The Debugging Workflow
Section titled “The Debugging Workflow”┌─────────────────────────────────────────────────────────────┐│ Systematic Debugging Workflow │├─────────────────────────────────────────────────────────────┤│ ││ 1. GET STATUS ││ k get pod POD -o wide ││ └── What state? Ready? Restarts? Node? ││ ││ 2. DESCRIBE ││ k describe pod POD ││ └── Events? Conditions? Container status? ││ ││ 3. LOGS ││ k logs POD [--previous] ││ └── What did the app say? Errors? ││ ││ 4. EXEC ││ k exec -it POD -- sh ││ └── What's inside? Files? Processes? Network? ││ ││ 5. EVENTS ││ k get events --sort-by='.lastTimestamp' ││ └── What happened cluster-wide? ││ │└─────────────────────────────────────────────────────────────┘Debug Commands
Section titled “Debug Commands”Quick Status Check
Section titled “Quick Status Check”# Pod status overviewk get pod POD -o wide
# All pods in namespacek get pods
# Watch for changesk get pods -wDescribe for Details
Section titled “Describe for Details”# Full pod detailsk describe pod POD
# Key sections to check:# - Status/Conditions# - Containers (State, Ready, Restart Count)# - Events (bottom of output)View Logs
Section titled “View Logs”# Current logsk logs POD
# Previous instance (after crash)k logs POD --previous
# Specific containerk logs POD -c CONTAINER
# Stream logsk logs -f PODExecute Commands
Section titled “Execute Commands”# Interactive shellk exec -it POD -- shk exec -it POD -- /bin/bash
# Run single commandk exec POD -- ls /appk exec POD -- cat /etc/config
# Specific containerk exec -it POD -c CONTAINER -- shView Events
Section titled “View Events”# Namespace events (sorted by time)k get events --sort-by='.lastTimestamp'
# Filter by typek get events --field-selector type=Warning
# Events for specific podk get events --field-selector involvedObject.name=PODCommon Pod States and Fixes
Section titled “Common Pod States and Fixes”Pending
Section titled “Pending”Pod stuck in Pending state.
# Check whyk describe pod POD
# Common causes:# 1. Insufficient resources# → Check node resources: k describe node# → Reduce pod resource requests
# 2. No matching node (nodeSelector/affinity)# → Check node labels: k get nodes --show-labels# → Fix selector or label nodes
# 3. PVC not bound# → Check PVC: k get pvc# → Create matching PVPause and predict: A pod is stuck in
Pendingstate. You checkkubectl describe podand see the event “0/3 nodes are available: 3 Insufficient cpu.” Is this a pod problem or a cluster problem? What are your options?
ImagePullBackOff / ErrImagePull
Section titled “ImagePullBackOff / ErrImagePull”Can’t pull container image.
# Check eventsk describe pod POD | grep -A5 Events
# Common causes:# 1. Wrong image name/tag# → Fix image in pod spec
# 2. Private registry without credentials# → Create imagePullSecret
# 3. Image doesn't exist# → Verify image exists in registryCrashLoopBackOff
Section titled “CrashLoopBackOff”Container crashes repeatedly.
# Check logs from crashed instancek logs POD --previous
# Check exit codek describe pod POD | grep "Last State"
# Common causes:# 1. Application error (check logs)# 2. Missing config/secrets# 3. Wrong command/args# 4. Liveness probe killing healthy appStop and think: A pod shows
CrashLoopBackOffwith exit code 137. Before looking at the answer section, what does exit code 137 tell you about the cause? What would you check first?
Running but Not Ready
Section titled “Running but Not Ready”Container running but readiness fails.
# Check readiness probek describe pod POD | grep -A5 Readiness
# Check endpointsk get endpoints SERVICE
# Common causes:# 1. Wrong readiness probe path/port# 2. App not fully started# 3. Dependency not availableDebugging Inside Containers
Section titled “Debugging Inside Containers”Basic Commands
Section titled “Basic Commands”# Get shellk exec -it POD -- sh
# Check processesk exec POD -- ps aux
# Check networkk exec POD -- netstat -tlnpk exec POD -- ss -tlnp
# Check DNSk exec POD -- nslookup kubernetesk exec POD -- cat /etc/resolv.conf
# Check connectivityk exec POD -- wget -qO- http://service:portk exec POD -- curl -s http://service:port
# Check filesk exec POD -- ls -la /appk exec POD -- cat /etc/config/filePause and predict: You need to test network connectivity from inside a distroless container that has no shell, no curl, no wget. How would you approach this?
When Shell Isn’t Available
Section titled “When Shell Isn’t Available”Some containers (distroless, scratch) don’t have a shell:
# Check if shell existsk exec POD -- /bin/sh -c 'echo works'
# If no shell, use debug container (Kubernetes 1.25+)k debug POD -it --image=busybox --target=container-nameEphemeral Debug Containers
Section titled “Ephemeral Debug Containers”Kubernetes 1.25+ supports ephemeral containers for debugging:
# Add debug container to running podk debug POD -it --image=busybox --target=container-name
# Debug with specific imagek debug POD -it --image=nicolaka/netshoot
# Copy pod for debugging (doesn't affect original)k debug POD -it --copy-to=debug-pod --container=debug --image=busyboxDebug Container Use Cases
Section titled “Debug Container Use Cases”# Network debugging (no curl in original container)k debug POD -it --image=nicolaka/netshoot --target=app# Then: curl, dig, nslookup, tcpdump
# File system inspectionk debug POD -it --image=busybox --target=app# Then: ls, cat, find
# Process debuggingk debug POD -it --image=busybox --target=app --share-processes# Then: ps auxService Debugging
Section titled “Service Debugging”Check Service-to-Pod Connection
Section titled “Check Service-to-Pod Connection”# Verify service existsk get svc SERVICE
# Check endpoints (should list pod IPs)k get endpoints SERVICE
# If no endpoints:# - Check pod labels match service selector# - Check pod readinessk get pods --show-labelsk describe svc SERVICE | grep SelectorTest Service DNS
Section titled “Test Service DNS”# From inside a podk exec POD -- nslookup SERVICEk exec POD -- nslookup SERVICE.NAMESPACE.svc.cluster.local
# Create test pod for debuggingk run test --image=busybox --rm -it -- nslookup SERVICETest Service Connectivity
Section titled “Test Service Connectivity”# From inside a podk exec POD -- wget -qO- http://SERVICE:PORTk exec POD -- curl -s http://SERVICE:PORTDebug Scenarios
Section titled “Debug Scenarios”Scenario 1: Pod Won’t Start
Section titled “Scenario 1: Pod Won’t Start”# Step 1: Check statusk get pod broken-pod
# Step 2: Describe for eventsk describe pod broken-pod
# Step 3: Check if image exists# If ErrImagePull: fix image name# If Pending: check resources/node selector
# Step 4: Check logs if container startedk logs broken-podScenario 2: Pod Keeps Crashing
Section titled “Scenario 2: Pod Keeps Crashing”# Step 1: Get restart countk get pod crashing-pod
# Step 2: Check previous logsk logs crashing-pod --previous
# Step 3: Check exit codek describe pod crashing-pod | grep -A3 "Last State"
# Step 4: Check liveness probek describe pod crashing-pod | grep -A5 LivenessScenario 3: Service Not Reachable
Section titled “Scenario 3: Service Not Reachable”# Step 1: Check service existsk get svc myservice
# Step 2: Check endpointsk get endpoints myservice
# Step 3: If no endpoints, check pod labelsk get pods --show-labelsk describe svc myservice | grep Selector
# Step 4: Test from inside clusterk run test --image=busybox --rm -it -- wget -qO- http://myserviceDid You Know?
Section titled “Did You Know?”-
kubectl debugcreates ephemeral containers that share the pod’s network and process namespaces. This means you can see network connections and processes from a debug container. -
Exit code 137 means OOMKilled (Out of Memory). The container was killed because it exceeded its memory limit.
-
Exit code 1 is a generic failure, usually from the application itself. Check logs for details.
-
Exit code 0 means success—but if a container exits 0 and wasn’t supposed to, it’s still a problem (wrong command).
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Hurts | Solution |
|---|---|---|
| Random guessing | Wastes exam time | Follow systematic workflow |
Skipping describe | Miss obvious events | Always check events first |
Not checking --previous | Miss crash logs | Check previous instance on CrashLoop |
| Ignoring exit codes | Miss OOM/signal issues | Check Last State in describe |
| Forgetting readiness | Pod works but no traffic | Check endpoints and probes |
-
A pod named
api-serveris in CrashLoopBackOff. You runkubectl logs api-serverbut see only a fresh startup message — no errors. How do you find out what caused the crash, and what is your systematic next step?Answer
The current logs are from the freshly restarted instance, which hasn't crashed yet. Run `kubectl logs api-server --previous` to see the logs from the instance that actually crashed. If that doesn't reveal the issue, run `kubectl describe pod api-server` and check the "Last State" section for the exit code and reason. Exit code 1 means the application threw an error (check logs more carefully), exit code 137 means OOMKilled (increase memory limits), and exit code 0 means the process exited successfully but shouldn't have (likely a wrong command). -
A developer deployed a new version of their app. The pods show
ImagePullBackOff. They swear the image exists because they pushed it 10 minutes ago. What are the three most common causes, and how do you investigate each one?Answer
Run `kubectl describe pod` and check the Events section for the exact error message. The three most common causes: (1) Typo in image name or tag — verify by comparing the pod spec image with what's in the registry. (2) Private registry without imagePullSecrets — the image exists but the node can't authenticate. Check if the pod spec has `imagePullSecrets` and if the secret contains valid credentials. (3) The image was pushed to a different registry or repository than what's in the manifest. The describe output usually includes the registry's error message, which tells you exactly which of these is the problem. -
Users report they can’t reach your application through a Service, but all pods show Running and Ready (1/1). You run
kubectl get endpoints myserviceand see<none>. What is the most likely root cause and how do you fix it?Answer
Empty endpoints with Running pods means the Service's label selector doesn't match the pods' labels. Debug by comparing: run `kubectl describe svc myservice | grep Selector` to see what the Service expects, then `kubectl get pods --show-labels` to see actual pod labels. The fix is to either patch the Service selector to match the pod labels (`kubectl patch svc myservice -p '{"spec":{"selector":{"app":"correct-label"}}}'`) or update the pod/deployment labels to match the Service selector. This is one of the most common debugging scenarios in the CKAD exam. -
You need to debug a network issue from inside a running pod, but the container image is distroless (no shell, no curl, no network tools). The pod is serving production traffic and you cannot restart it. What do you do?
Answer
Use an ephemeral debug container: `kubectl debug pod-name -it --image=nicolaka/netshoot --target=container-name`. This attaches a new container to the running pod that shares the pod's network namespace, so you can use tools like `curl`, `dig`, `nslookup`, and `tcpdump` to diagnose the issue. The `--target` flag shares the process namespace with the specified container. The original container continues running unaffected. Alternatively, `kubectl debug pod-name -it --copy-to=debug-pod --image=busybox` creates a copy of the pod for investigation without touching the original.
Hands-On Exercise
Section titled “Hands-On Exercise”Task: Debug and fix broken pods.
Setup:
# Create a broken pod (wrong image)cat << 'EOF' | k apply -f -apiVersion: v1kind: Podmetadata: name: broken1spec: containers: - name: app image: nginx:nonexistent-tagEOF
# Create a crashing podcat << 'EOF' | k apply -f -apiVersion: v1kind: Podmetadata: name: broken2spec: containers: - name: app image: busybox command: ['sh', '-c', 'echo "Config not found"; exit 1']EOF
# Create pod with resource issuecat << 'EOF' | k apply -f -apiVersion: v1kind: Podmetadata: name: broken3spec: containers: - name: app image: nginx resources: requests: memory: "999Gi"EOFDebug Each:
# Debug broken1k get pod broken1k describe pod broken1 | tail -10# Fix: Change image to nginx:latest
# Debug broken2k get pod broken2k logs broken2 --previous# Fix: Provide correct config
# Debug broken3k get pod broken3k describe pod broken3 | grep -A5 Events# Fix: Reduce memory requestCleanup:
k delete pod broken1 broken2 broken3Practice Drills
Section titled “Practice Drills”Drill 1: Describe and Events (Target: 2 minutes)
Section titled “Drill 1: Describe and Events (Target: 2 minutes)”# Create podk run drill1 --image=nginx
# Describe itk describe pod drill1
# Check eventsk get events --field-selector involvedObject.name=drill1
# Cleanupk delete pod drill1Drill 2: Exec Into Pod (Target: 2 minutes)
Section titled “Drill 2: Exec Into Pod (Target: 2 minutes)”# Create podk run drill2 --image=nginx
# Exec into itk exec -it drill2 -- bash
# Inside: check nginx is runningps aux | grep nginxexit
# Cleanupk delete pod drill2Drill 3: Debug Crashing Pod (Target: 3 minutes)
Section titled “Drill 3: Debug Crashing Pod (Target: 3 minutes)”# Create crashing podcat << 'EOF' | k apply -f -apiVersion: v1kind: Podmetadata: name: drill3spec: containers: - name: app image: busybox command: ['sh', '-c', 'echo error; exit 1']EOF
# Wait for crashk get pod drill3 -w
# Get logs from previousk logs drill3 --previous
# Cleanupk delete pod drill3Drill 4: Debug ImagePullBackOff (Target: 3 minutes)
Section titled “Drill 4: Debug ImagePullBackOff (Target: 3 minutes)”# Create pod with bad imagek run drill4 --image=invalid-registry.io/no-such-image:v1
# Check statusk get pod drill4
# Describe for detailsk describe pod drill4 | grep -A5 Events
# Cleanupk delete pod drill4Drill 5: Service Debug (Target: 4 minutes)
Section titled “Drill 5: Service Debug (Target: 4 minutes)”# Create pod and service with mismatched labelscat << 'EOF' | k apply -f -apiVersion: v1kind: Podmetadata: name: drill5 labels: app: myappspec: containers: - name: nginx image: nginx---apiVersion: v1kind: Servicemetadata: name: drill5-svcspec: selector: app: wronglabel ports: - port: 80EOF
# Check endpoints (should be empty)k get endpoints drill5-svc
# Find the problemk get pod drill5 --show-labelsk describe svc drill5-svc | grep Selector
# Fix by patching servicek patch svc drill5-svc -p '{"spec":{"selector":{"app":"myapp"}}}'
# Verify endpoints now existk get endpoints drill5-svc
# Cleanupk delete pod drill5 svc drill5-svcDrill 6: Complete Debug Scenario (Target: 5 minutes)
Section titled “Drill 6: Complete Debug Scenario (Target: 5 minutes)”Scenario: Application deployed but not accessible.
# Create "broken" deploymentcat << 'EOF' | k apply -f -apiVersion: apps/v1kind: Deploymentmetadata: name: drill6spec: replicas: 2 selector: matchLabels: app: drill6 template: metadata: labels: app: drill6 spec: containers: - name: nginx image: nginx readinessProbe: httpGet: path: /nonexistent port: 80---apiVersion: v1kind: Servicemetadata: name: drill6-svcspec: selector: app: drill6 ports: - port: 80EOF
# Check pods (running but not ready)k get pods -l app=drill6
# Check endpoints (empty)k get endpoints drill6-svc
# Describe pod for probe failurek describe pod -l app=drill6 | grep -A5 Readiness
# Fix readiness probek patch deploy drill6 --type='json' -p='[{"op":"replace","path":"/spec/template/spec/containers/0/readinessProbe/httpGet/path","value":"/"}]'
# Wait for rolloutk rollout status deploy drill6
# Verify endpointsk get endpoints drill6-svc
# Cleanupk delete deploy drill6 svc drill6-svcNext Module
Section titled “Next Module”Module 3.4: Monitoring Applications - Monitor application health and resource usage.