Module 5.2: Application Failures
Complexity:
[MEDIUM]- Most common troubleshooting scenariosTime to Complete: 45-55 minutes
Prerequisites: Module 5.1 (Troubleshooting Methodology), Module 2.1-2.7 (Workloads)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After this module, you will be able to:
- Diagnose CrashLoopBackOff, ImagePullBackOff, and CreateContainerConfigError systematically
- Fix application failures caused by wrong images, missing ConfigMaps, incorrect probes, and resource limits
- Debug multi-container pod failures by identifying which container is failing and why
- Trace an application failure from symptom (pod not running) to root cause (specific configuration error)
Why This Module Matters
Section titled “Why This Module Matters”Application failures are the most common issues you’ll encounter - both in the exam and in production. A pod that won’t start, a container that keeps crashing, or a deployment that won’t roll out are daily occurrences. Mastering application troubleshooting is essential for any Kubernetes administrator.
The Restaurant Kitchen Analogy
Think of pods as dishes being prepared in a kitchen. Sometimes the dish fails because of a bad recipe (wrong image), sometimes the ingredients are missing (ConfigMap/Secret), sometimes the chef runs out of space (resources), and sometimes the dish just doesn’t come out right (application bug). Each failure has different symptoms and different fixes.
What You’ll Learn
Section titled “What You’ll Learn”By the end of this module, you’ll be able to:
- Troubleshoot pods that won’t start
- Diagnose CrashLoopBackOff containers
- Fix image pull failures
- Resolve configuration issues
- Handle resource constraints and OOM kills
- Debug deployment rollout problems
Did You Know?
Section titled “Did You Know?”- CrashLoopBackOff has exponential backoff: It starts at 10s, then 20s, 40s, up to 5 minutes between restart attempts.
- Init containers run first: If init containers fail, main containers never start - many people forget to check them.
- ImagePullBackOff vs ErrImagePull: ErrImagePull is the first failure, ImagePullBackOff is after multiple retries.
- OOMKilled doesn’t always mean a memory leak: It can simply mean your
limitsare set lower than the application’s baseline startup requirement.
Part 1: Pod Startup Failures
Section titled “Part 1: Pod Startup Failures”1.1 The Pod Startup Sequence
Section titled “1.1 The Pod Startup Sequence”Understanding what happens when a pod starts:
┌──────────────────────────────────────────────────────────────┐│ POD STARTUP SEQUENCE ││ ││ 1. Scheduling 2. Preparation 3. Startup ││ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Pending │────▶│ Container │───▶│ Init │ ││ │ │ │ Creating │ │ Containers │ ││ └──────────┘ └──────────────┘ └──────────────┘ ││ │ │ │ ││ ▼ ▼ ▼ ││ • Node selection • Pull images • Run in order ││ • Resource check • Mount volumes • Each must exit 0 ││ • Taints/affinity • Setup network • Sequential only ││ ││ 4. Running 5. Ready ││ ┌──────────────┐ ┌──────────────┐ ││ │ Main │─▶│ Readiness │ ││ │ Containers │ │ Probes Pass │ ││ └──────────────┘ └──────────────┘ ││ │ │ ││ ▼ ▼ ││ • Start all • Pod marked Ready ││ • Run probes • Added to Service ││ │└──────────────────────────────────────────────────────────────┘1.2 Pending - Scheduling Issues
Section titled “1.2 Pending - Scheduling Issues”When a pod is stuck in Pending:
# Check why pod is pendingk describe pod <pod> | grep -A 10 EventsCommon causes:
| Message | Cause | Solution |
|---|---|---|
0/3 nodes available | No suitable nodes | Check node taints, affinity rules |
Insufficient cpu | Not enough CPU | Reduce requests or add capacity |
Insufficient memory | Not enough memory | Reduce requests or add capacity |
node(s) had taint that pod didn't tolerate | Taints blocking | Add tolerations or remove taints |
node(s) didn't match node selector | nodeSelector mismatch | Fix labels or selector |
persistentvolumeclaim not found | PVC missing | Create PVC |
persistentvolumeclaim not bound | No matching PV | Check StorageClass, create PV |
Investigation commands:
# Check node resourcesk describe nodes | grep -A 5 "Allocated resources"k top nodes
# Check node taintsk get nodes -o custom-columns='NAME:.metadata.name,TAINTS:.spec.taints[*].key'
# Check node labels (for nodeSelector)k get nodes --show-labels1.3 ContainerCreating - Preparation Issues
Section titled “1.3 ContainerCreating - Preparation Issues”Stop and think: If a pod is stuck in
ContainerCreatingfor 5 minutes, what is the most likely external dependency causing the hang?
When a pod is stuck in ContainerCreating:
# Always check Events firstk describe pod <pod> | grep -A 15 EventsCommon causes:
| Message | Cause | Solution |
|---|---|---|
pulling image (stuck) | Slow/large image | Wait, or use smaller image |
ImagePullBackOff | Wrong image name | Fix image reference |
ErrImagePull | Registry auth failed | Check imagePullSecrets |
MountVolume.SetUp failed | Volume mount issue | Check PVC, ConfigMap, Secret exists |
configmap not found | Missing ConfigMap | Create ConfigMap |
secret not found | Missing Secret | Create Secret |
network not ready | CNI issues | Check CNI pods |
Investigation commands:
# Check image pull issuesk get events --field-selector involvedObject.name=<pod>
# Check if ConfigMap/Secret existsk get configmap <name>k get secret <name>
# Check PVC statusk get pvck describe pvc <name>Part 2: Container Crash Troubleshooting
Section titled “Part 2: Container Crash Troubleshooting”2.1 Understanding CrashLoopBackOff
Section titled “2.1 Understanding CrashLoopBackOff”┌──────────────────────────────────────────────────────────────┐│ CRASHLOOPBACKOFF CYCLE ││ ││ Container Start ──▶ Container Crash ──▶ Wait ──┐ ││ ▲ │ ││ └─────────────────────────────────────────┘ ││ ││ Backoff Times: ││ 1st crash: wait 10s ││ 2nd crash: wait 20s ││ 3rd crash: wait 40s ││ 4th crash: wait 80s ││ 5th crash: wait 160s ││ 6th+ crash: wait 300s (5 min max) ││ ││ After 10 minutes of running successfully, timer resets │└──────────────────────────────────────────────────────────────┘2.2 CrashLoopBackOff Investigation
Section titled “2.2 CrashLoopBackOff Investigation”Pause and predict: If a pod has restarted 50 times but currently shows as
Running, how can you find out why it crashed previously?
Step-by-step approach:
# Step 1: Check pod status and restart countk get pod <pod># Look at RESTARTS column
# Step 2: Check eventsk describe pod <pod> | grep -A 10 Events
# Step 3: Check current container statek describe pod <pod> | grep -A 10 "State:"
# Step 4: Check PREVIOUS container logs (crucial!)k logs <pod> --previous
# Step 5: Check exit codek get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'2.3 Exit Codes Decoded
Section titled “2.3 Exit Codes Decoded”| Exit Code | Signal | Meaning | Common Cause |
|---|---|---|---|
| 0 | - | Success | Normal exit (shouldn’t cause CrashLoop) |
| 1 | - | Application error | App logic error, missing config |
| 2 | - | Misuse of shell builtin | Script error |
| 126 | - | Command not executable | Permission issue |
| 127 | - | Command not found | Wrong entrypoint/command |
| 128+N | Signal N | Killed by signal | Fatal error raised by OS |
| 137 | SIGKILL (9) | Force killed | OOMKilled, or kill -9 |
| 139 | SIGSEGV (11) | Segmentation fault | Application bug |
| 143 | SIGTERM (15) | Graceful termination | Normal shutdown |
| 255 | - | Unknown/Custom error | Application specific fatal error |
2.4 OOMKilled Investigation
Section titled “2.4 OOMKilled Investigation”When exit code is 137 or status shows OOMKilled:
# Check for OOMKilled statusk describe pod <pod> | grep -i oom
# Check memory limitsk get pod <pod> -o jsonpath='{.spec.containers[0].resources.limits.memory}'
# Check actual memory usage (if pod is running)k top pod <pod>
# Fix: Increase memory limitk patch deployment <name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"512Mi"}}}]}}}}'2.5 Common CrashLoopBackOff Causes
Section titled “2.5 Common CrashLoopBackOff Causes”| Symptom | Diagnosis | Fix |
|---|---|---|
| Exit code 1 | App error | Check logs, fix application |
| Exit code 127 | Command not found | Fix command or args in spec |
| Exit code 137 + OOMKilled | Memory exceeded | Increase memory limit |
| Exit code 137 no OOM | Killed externally | Check liveness probe |
| Container exits immediately | No foreground process | Add sleep infinity or fix command |
| Logs show “file not found” | Missing ConfigMap/Secret | Verify mounts exist |
| Logs show “permission denied” | Security context | Fix runAsUser or fsGroup |
Part 3: Image Pull Failures
Section titled “Part 3: Image Pull Failures”3.1 Image Pull Error Types
Section titled “3.1 Image Pull Error Types”┌──────────────────────────────────────────────────────────────┐│ IMAGE PULL ERROR FLOW ││ ││ Attempt Pull ──▶ ErrImagePull ──▶ ImagePullBackOff ││ │ │ │ ││ │ │ │ ││ (Success) (First failure) (Repeated failures) ││ ││ ErrImagePull causes: ││ • Image doesn't exist ││ • Registry unreachable ││ • Authentication failed ││ • Rate limited (Docker Hub) ││ │└──────────────────────────────────────────────────────────────┘3.2 Diagnosing Image Pull Issues
Section titled “3.2 Diagnosing Image Pull Issues”# Check events for specific errork describe pod <pod> | grep -A 5 "Failed to pull"
# Common error messages:# "manifest unknown" - Image tag doesn't exist# "unauthorized" - Registry auth failed# "timeout" - Registry unreachable# "toomanyrequests" - Rate limited3.3 Fixing Image Pull Issues
Section titled “3.3 Fixing Image Pull Issues”Wrong image name/tag:
# Check current imagek get pod <pod> -o jsonpath='{.spec.containers[0].image}'
# Fix with set imagek set image deployment/<name> <container>=<correct-image>
# Or edit directlyk edit deployment <name>Registry authentication:
# Create registry secretk create secret docker-registry regcred \ --docker-server=registry.example.com \ --docker-username=user \ --docker-password=pass \ --docker-email=user@example.com
# Add to pod speck patch serviceaccount default -p '{"imagePullSecrets":[{"name":"regcred"}]}'
# Or add to specific deploymentk patch deployment <name> -p '{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":"regcred"}]}}}}'Docker Hub rate limiting:
# Option 1: Use authenticated pullsk create secret docker-registry dockerhub \ --docker-server=https://index.docker.io/v1/ \ --docker-username=<username> \ --docker-password=<token>
# Option 2: Use alternative registry (gcr.io, quay.io)# nginx:latest -> gcr.io/google-containers/nginx:latestPart 4: Configuration Issues
Section titled “Part 4: Configuration Issues”4.1 Missing ConfigMap/Secret
Section titled “4.1 Missing ConfigMap/Secret”Pause and predict: What exact pod status would you expect if a referenced Secret does not exist in the namespace?
Symptoms:
- Pod stuck in ContainerCreating
- Events show “configmap not found” or “secret not found”
Diagnosis:
# Check what ConfigMaps/Secrets the pod needsk get pod <pod> -o yaml | grep -A 5 "configMap\|secret"
# Verify they existk get configmapk get secret
# Check specific onek describe configmap <name>Fix:
# Create missing ConfigMapk create configmap <name> --from-literal=key=value
# Create missing Secretk create secret generic <name> --from-literal=password=secret
# If you have the data filek create configmap <name> --from-file=config.yamlk create secret generic <name> --from-file=credentials.json4.2 Incorrect ConfigMap/Secret Keys
Section titled “4.2 Incorrect ConfigMap/Secret Keys”Symptoms:
- Container starts but app fails
- Logs show “file not found” or “key not found”
Diagnosis:
# Check what keys exist in ConfigMapk get configmap <name> -o yaml
# Check pod's expected keysk get pod <pod> -o yaml | grep -A 10 configMapKeyRef
# Compare expected vs actualFix:
# Add missing key to ConfigMapk patch configmap <name> -p '{"data":{"missing-key":"value"}}'
# Or recreatek create configmap <name> --from-literal=key1=val1 --from-literal=key2=val2 --dry-run=client -o yaml | k apply -f -4.3 Environment Variable Issues
Section titled “4.3 Environment Variable Issues”# Check environment variables in running containerk exec <pod> -- env
# Check what's defined in speck get pod <pod> -o jsonpath='{.spec.containers[0].env[*]}'
# Common issue: ConfigMap key name doesn't match env var name# Check with:k get pod <pod> -o yaml | grep -A 5 valueFromPart 5: Deployment Rollout Failures
Section titled “Part 5: Deployment Rollout Failures”5.1 Stuck Deployments
Section titled “5.1 Stuck Deployments”Stop and think: If
kubectl rollout statushangs indefinitely, what object should you describe next to find the actual pod creation errors?
Symptoms:
k rollout status deployment/<name>hangs- Old and new ReplicaSets both exist
- Pods not reaching Ready state
# Check deployment statusk get deployment <name>k describe deployment <name>
# Check ReplicaSetsk get rs -l app=<name>
# Check pods from new ReplicaSetk get pods -l app=<name>5.2 Common Rollout Issues
Section titled “5.2 Common Rollout Issues”┌──────────────────────────────────────────────────────────────┐│ DEPLOYMENT ROLLOUT STATES ││ ││ Progressing Stuck ││ ┌──────────────┐ ┌──────────────┐ ││ │ New RS │ │ New RS │ ││ │ scaling up │ │ pods failing │ ││ └──────────────┘ └──────────────┘ ││ │ │ ││ ▼ ▼ ││ ┌──────────────┐ ┌──────────────┐ ││ │ Old RS │ │ Old RS │ ││ │ scaling down │ │ still running│ ││ └──────────────┘ └──────────────┘ ││ ││ Rollout waits for new pods to become Ready ││ If pods never Ready, rollout stalls ││ │└──────────────────────────────────────────────────────────────┘Investigation:
# Check deployment conditionsk describe deployment <name> | grep -A 10 Conditions
# Check new ReplicaSet's podsNEW_RS=$(k get rs -l app=<name> --sort-by='.metadata.creationTimestamp' -o name | tail -1)k describe $NEW_RS
# Check why pods aren't readyk get pods -l app=<name> | grep -v Runningk describe pod <failing-pod>5.3 Rollback
Section titled “5.3 Rollback”When new version is broken:
# Check rollout historyk rollout history deployment/<name>
# Rollback to previous versionk rollout undo deployment/<name>
# Rollback to specific revisionk rollout undo deployment/<name> --to-revision=2
# Verify rollbackk rollout status deployment/<name>5.4 Fixing Stuck Rollouts
Section titled “5.4 Fixing Stuck Rollouts”# Option 1: Fix the issue and let rollout continuek set image deployment/<name> <container>=<fixed-image>
# Option 2: Rollbackk rollout undo deployment/<name>
# Option 3: Force restart (deletes and recreates pods)k rollout restart deployment/<name>
# Option 4: Scale down then up (nuclear option)k scale deployment/<name> --replicas=0k scale deployment/<name> --replicas=3Part 6: Readiness and Liveness Probe Failures
Section titled “Part 6: Readiness and Liveness Probe Failures”6.1 Probe Types Review
Section titled “6.1 Probe Types Review”┌──────────────────────────────────────────────────────────────┐│ PROBE TYPES ││ ││ LIVENESS READINESS ││ Is container alive? Is container ready? ││ ││ Failure action: Failure action: ││ RESTART container REMOVE from service ││ ││ Use for: Use for: ││ • Deadlock detection • Startup dependencies ││ • Hung processes • Warming caches ││ ││ ⚠️ Wrong liveness config ⚠️ Wrong readiness config ││ = crash loops = no traffic ││ │└──────────────────────────────────────────────────────────────┘6.2 Diagnosing Probe Failures
Section titled “6.2 Diagnosing Probe Failures”# Check probe configurationk get pod <pod> -o yaml | grep -A 10 "livenessProbe\|readinessProbe"
# Check for probe failures in eventsk describe pod <pod> | grep -i "unhealthy\|probe"
# Test probe manuallyk exec <pod> -- wget -qO- http://localhost:8080/healthk exec <pod> -- cat /tmp/healthy6.3 Common Probe Issues
Section titled “6.3 Common Probe Issues”| Issue | Symptom | Fix |
|---|---|---|
| Wrong port | Probe fails, container works | Fix port in probe spec |
| Wrong path | 404 errors in events | Fix httpGet path |
| Too aggressive | Containers keep restarting | Increase timeoutSeconds, periodSeconds |
| Missing initialDelaySeconds | Fails during startup | Add initialDelaySeconds |
| App slow to start | CrashLoop at startup | Use startupProbe |
Fix probe timing:
livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 # Wait 30s before first probe periodSeconds: 10 # Probe every 10s timeoutSeconds: 5 # Timeout after 5s failureThreshold: 3 # Restart after 3 failuresCommon Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
Not checking --previous | Can’t see crash reason | Always check previous logs for CrashLoop |
| Ignoring init containers | Main container never starts | Check init container logs too |
| Fixing symptoms not cause | Problem recurs | Investigate root cause before fixing |
| Wrong resource units | Unexpected OOM or throttling | Use correct units: Mi, Gi, m |
| Liveness probe too aggressive | Healthy containers killed | Increase timeouts and failure threshold |
| Forgetting imagePullSecrets | Private images fail | Add secrets at ServiceAccount or pod level |
Using restartPolicy: Never for Deployments | Pods won’t restart on failure | Deployments require Always. Use Jobs for run-once tasks. |
| Overlooking namespace | ”Pod not found” errors | Always add -n <namespace> or set default namespace context |
Q1: Exit Code Analysis
Section titled “Q1: Exit Code Analysis”Stop and think: You deploy a new version of your application. The pod immediately goes into CrashLoopBackOff. When you check the status, the container shows exit code 1, and the logs end with “Error: REDIS_HOST not set”. What exactly happened and how do you fix it?
Answer
The application is missing a required environment variable, which causes the application process to terminate with a generic error (exit code 1). Kubernetes sees the main process exit and restarts the container, leading to CrashLoopBackOff. You can fix this by adding the variable directly using `k set env` or by verifying that the referenced ConfigMap/Secret exists and contains the correct key.Q2: Image Pull Sequence
Section titled “Q2: Image Pull Sequence”Pause and predict: A developer typoed an image name as
nginx:1.255in a new Deployment. You decide to watch the pod status. What sequence of statuses will you observe over the next 5 minutes?
Answer
The pod will first show `ErrImagePull`, and then transition to `ImagePullBackOff`. Why? `ErrImagePull` is the immediate status after the first failed attempt to pull the non-existent image tag. After Kubernetes retries and fails again, it enters `ImagePullBackOff` which uses exponential backoff to space out retry attempts, preventing registry hammering.Q3: Pending Diagnosis
Section titled “Q3: Pending Diagnosis”Stop and think: You apply a pod manifest and it remains stuck in Pending state.
kubectl describe podreveals the message: “0/3 nodes are available: 3 Insufficient memory”. You know all 3 nodes have 8GB RAM physically available. What is actually preventing the pod from scheduling?
Answer
The Kubernetes scheduler cannot find a node with enough unallocated memory to satisfy the pod's memory `requests`. Even if nodes have physical memory available, the scheduler only looks at what has been formally requested by other running pods. To fix this, you must either reduce the memory requests of the pending pod, delete other pods to free up allocated capacity, or add a new node to the cluster.Q4: CrashLoopBackOff Max
Section titled “Q4: CrashLoopBackOff Max”Pause and predict: A pod has been in CrashLoopBackOff for 2 hours due to a bad configuration. You finally locate the issue and fix the ConfigMap it depends on. How long might you have to wait for Kubernetes to automatically restart the container, and why?
Answer
You might have to wait up to **5 minutes (300 seconds)**. Why? Kubernetes uses an exponential backoff delay for crashing containers (10s, 20s, 40s, 80s, 160s, up to a maximum of 300s). Since the pod has been crashing for 2 hours, it has hit the 5-minute maximum cap and will only retry once every 5 minutes. You can manually delete the pod to force an immediate restart instead of waiting.Q5: Init Container Failure
Section titled “Q5: Init Container Failure”Stop and think: A new Deployment scales up, but the pod’s main container never starts. You run
kubectl logs <pod>and the output is completely empty. What hidden component should you investigate next?
Answer
You need to check the logs of the pod's **init containers** using `kubectl logsQ6: Rollback Decision
Section titled “Q6: Rollback Decision”Pause and predict: You trigger an image update on a critical Deployment. The rollout gets stuck midway. You see that the old ReplicaSet still has 2 running pods, but the new ReplicaSet has 1 pod stuck in CrashLoopBackOff. What is the fastest way to restore full capacity safely?
Answer
Executing `kubectl rollout undo deployment/Q7: Readiness vs Liveness
Section titled “Q7: Readiness vs Liveness”Stop and think: Your Java application takes 30 seconds to load data into memory before it can serve requests. You configure a liveness probe that checks the
/healthendpoint immediately on startup. What happens to the pod?
Answer
The container will likely enter a CrashLoopBackOff state before it ever finishes loading. Why? Because the liveness probe starts checking immediately (without an `initialDelaySeconds`) and fails because the app isn't ready. A failed liveness probe instructs Kubernetes to restart the container, actively interrupting the 30-second initialization. To fix this, you should use a `startupProbe` to cover the loading period, or add an `initialDelaySeconds` to the liveness probe.Hands-On Exercise: Application Failure Scenarios
Section titled “Hands-On Exercise: Application Failure Scenarios”Scenario
Section titled “Scenario”Practice diagnosing and fixing various application failures.
# Create namespacek create ns app-debug-labScenario 1: CrashLoopBackOff
Section titled “Scenario 1: CrashLoopBackOff”cat <<'EOF' | k apply -f -apiVersion: v1kind: Podmetadata: name: crash-app namespace: app-debug-labspec: containers: - name: app image: busybox:1.36 command: ['sh', '-c', 'echo "Starting..."; exit 1']EOFTask: Find why it’s crashing and what exit code it has.
Solution
k logs crash-app -n app-debug-lab --previousk get pod crash-app -n app-debug-lab -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'# Exit code 1 - the command explicitly exits with errorScenario 2: Missing ConfigMap
Section titled “Scenario 2: Missing ConfigMap”cat <<'EOF' | k apply -f -apiVersion: v1kind: Podmetadata: name: config-app namespace: app-debug-labspec: containers: - name: app image: nginx:1.25 volumeMounts: - name: config mountPath: /etc/app volumes: - name: config configMap: name: app-settingsEOFTask: Find why it’s stuck in ContainerCreating and fix it.
Solution
# Diagnosek describe pod config-app -n app-debug-lab | grep -A 5 Events# "configmap "app-settings" not found"
# Fixk create configmap app-settings -n app-debug-lab --from-literal=key=value
# Verifyk get pod config-app -n app-debug-labScenario 3: Wrong Image Tag
Section titled “Scenario 3: Wrong Image Tag”cat <<'EOF' | k apply -f -apiVersion: v1kind: Podmetadata: name: image-app namespace: app-debug-labspec: containers: - name: app image: nginx:v99.99.99EOFTask: Diagnose and fix the image pull failure.
Solution
# Diagnosek describe pod image-app -n app-debug-lab | grep -A 5 "Failed\|Error"# "manifest for nginx:v99.99.99 not found"
# Fix - delete and recreate with correct imagek delete pod image-app -n app-debug-labk run image-app -n app-debug-lab --image=nginx:1.25Scenario 4: Resource Constraint (OOM)
Section titled “Scenario 4: Resource Constraint (OOM)”cat <<'EOF' | k apply -f -apiVersion: v1kind: Podmetadata: name: oom-app namespace: app-debug-labspec: containers: - name: app image: progrium/stress args: ['--vm', '1', '--vm-bytes', '500M'] resources: limits: memory: "100Mi"EOFTask: Diagnose why the container keeps getting killed.
Solution
# Diagnosek describe pod oom-app -n app-debug-lab | grep -i oomk get pod oom-app -n app-debug-lab -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'# "OOMKilled"
# The container tries to use 500MB but only has 100Mi limit# Fix: increase memory limit or reduce app memory usageSuccess Criteria
Section titled “Success Criteria”- Identified crash-app exit code as 1
- Created missing ConfigMap for config-app
- Fixed wrong image tag for image-app
- Identified OOMKilled status for oom-app
Cleanup
Section titled “Cleanup”k delete ns app-debug-labPractice Drills
Section titled “Practice Drills”Drill 1: Quick Pod Status (30 sec)
Section titled “Drill 1: Quick Pod Status (30 sec)”# Task: Show all pods with restart count > 0k get pods -A -o custom-columns='NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount' | awk '$2 > 0'Drill 2: Previous Logs (30 sec)
Section titled “Drill 2: Previous Logs (30 sec)”# Task: Get last 50 lines from previous container instancek logs <pod> --previous --tail=50Drill 3: Exit Code Check (1 min)
Section titled “Drill 3: Exit Code Check (1 min)”# Task: Get exit code from crashed containerk get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'# Or from describe:k describe pod <pod> | grep "Exit Code"Drill 4: Image Fix (1 min)
Section titled “Drill 4: Image Fix (1 min)”# Task: Update image in deploymentk set image deployment/<name> <container>=<new-image>Drill 5: Create Missing ConfigMap (1 min)
Section titled “Drill 5: Create Missing ConfigMap (1 min)”# Task: Create ConfigMap from literalk create configmap <name> --from-literal=key=value# From filek create configmap <name> --from-file=<filename>Drill 6: Environment Variable Debug (1 min)
Section titled “Drill 6: Environment Variable Debug (1 min)”# Task: Check all env vars in running containerk exec <pod> -- env | sortDrill 7: Rollback Deployment (1 min)
Section titled “Drill 7: Rollback Deployment (1 min)”# Task: Rollback to previous versionk rollout undo deployment/<name>k rollout status deployment/<name>Drill 8: Check Probe Config (1 min)
Section titled “Drill 8: Check Probe Config (1 min)”# Task: View probe configurationk get pod <pod> -o yaml | grep -A 15 livenessProbek get pod <pod> -o yaml | grep -A 15 readinessProbeNext Module
Section titled “Next Module”Continue to Module 5.3: Control Plane Failures to learn how to troubleshoot API server, scheduler, controller manager, and etcd issues.