Skip to content

Module 5.2: Application Failures

Hands-On Lab Available
K8s Cluster intermediate 45 min
Launch Lab ↗

Opens in Killercoda in a new tab

Complexity: [MEDIUM] - Most common troubleshooting scenarios

Time to Complete: 45-55 minutes

Prerequisites: Module 5.1 (Troubleshooting Methodology), Module 2.1-2.7 (Workloads)


After this module, you will be able to:

  • Diagnose CrashLoopBackOff, ImagePullBackOff, and CreateContainerConfigError systematically
  • Fix application failures caused by wrong images, missing ConfigMaps, incorrect probes, and resource limits
  • Debug multi-container pod failures by identifying which container is failing and why
  • Trace an application failure from symptom (pod not running) to root cause (specific configuration error)

Application failures are the most common issues you’ll encounter - both in the exam and in production. A pod that won’t start, a container that keeps crashing, or a deployment that won’t roll out are daily occurrences. Mastering application troubleshooting is essential for any Kubernetes administrator.

The Restaurant Kitchen Analogy

Think of pods as dishes being prepared in a kitchen. Sometimes the dish fails because of a bad recipe (wrong image), sometimes the ingredients are missing (ConfigMap/Secret), sometimes the chef runs out of space (resources), and sometimes the dish just doesn’t come out right (application bug). Each failure has different symptoms and different fixes.


By the end of this module, you’ll be able to:

  • Troubleshoot pods that won’t start
  • Diagnose CrashLoopBackOff containers
  • Fix image pull failures
  • Resolve configuration issues
  • Handle resource constraints and OOM kills
  • Debug deployment rollout problems

  • CrashLoopBackOff has exponential backoff: It starts at 10s, then 20s, 40s, up to 5 minutes between restart attempts.
  • Init containers run first: If init containers fail, main containers never start - many people forget to check them.
  • ImagePullBackOff vs ErrImagePull: ErrImagePull is the first failure, ImagePullBackOff is after multiple retries.
  • OOMKilled doesn’t always mean a memory leak: It can simply mean your limits are set lower than the application’s baseline startup requirement.

Understanding what happens when a pod starts:

┌──────────────────────────────────────────────────────────────┐
│ POD STARTUP SEQUENCE │
│ │
│ 1. Scheduling 2. Preparation 3. Startup │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Pending │────▶│ Container │───▶│ Init │ │
│ │ │ │ Creating │ │ Containers │ │
│ └──────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ • Node selection • Pull images • Run in order │
│ • Resource check • Mount volumes • Each must exit 0 │
│ • Taints/affinity • Setup network • Sequential only │
│ │
│ 4. Running 5. Ready │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Main │─▶│ Readiness │ │
│ │ Containers │ │ Probes Pass │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ • Start all • Pod marked Ready │
│ • Run probes • Added to Service │
│ │
└──────────────────────────────────────────────────────────────┘

When a pod is stuck in Pending:

Terminal window
# Check why pod is pending
k describe pod <pod> | grep -A 10 Events

Common causes:

MessageCauseSolution
0/3 nodes availableNo suitable nodesCheck node taints, affinity rules
Insufficient cpuNot enough CPUReduce requests or add capacity
Insufficient memoryNot enough memoryReduce requests or add capacity
node(s) had taint that pod didn't tolerateTaints blockingAdd tolerations or remove taints
node(s) didn't match node selectornodeSelector mismatchFix labels or selector
persistentvolumeclaim not foundPVC missingCreate PVC
persistentvolumeclaim not boundNo matching PVCheck StorageClass, create PV

Investigation commands:

Terminal window
# Check node resources
k describe nodes | grep -A 5 "Allocated resources"
k top nodes
# Check node taints
k get nodes -o custom-columns='NAME:.metadata.name,TAINTS:.spec.taints[*].key'
# Check node labels (for nodeSelector)
k get nodes --show-labels

1.3 ContainerCreating - Preparation Issues

Section titled “1.3 ContainerCreating - Preparation Issues”

Stop and think: If a pod is stuck in ContainerCreating for 5 minutes, what is the most likely external dependency causing the hang?

When a pod is stuck in ContainerCreating:

Terminal window
# Always check Events first
k describe pod <pod> | grep -A 15 Events

Common causes:

MessageCauseSolution
pulling image (stuck)Slow/large imageWait, or use smaller image
ImagePullBackOffWrong image nameFix image reference
ErrImagePullRegistry auth failedCheck imagePullSecrets
MountVolume.SetUp failedVolume mount issueCheck PVC, ConfigMap, Secret exists
configmap not foundMissing ConfigMapCreate ConfigMap
secret not foundMissing SecretCreate Secret
network not readyCNI issuesCheck CNI pods

Investigation commands:

Terminal window
# Check image pull issues
k get events --field-selector involvedObject.name=<pod>
# Check if ConfigMap/Secret exists
k get configmap <name>
k get secret <name>
# Check PVC status
k get pvc
k describe pvc <name>

┌──────────────────────────────────────────────────────────────┐
│ CRASHLOOPBACKOFF CYCLE │
│ │
│ Container Start ──▶ Container Crash ──▶ Wait ──┐ │
│ ▲ │ │
│ └─────────────────────────────────────────┘ │
│ │
│ Backoff Times: │
│ 1st crash: wait 10s │
│ 2nd crash: wait 20s │
│ 3rd crash: wait 40s │
│ 4th crash: wait 80s │
│ 5th crash: wait 160s │
│ 6th+ crash: wait 300s (5 min max) │
│ │
│ After 10 minutes of running successfully, timer resets │
└──────────────────────────────────────────────────────────────┘

Pause and predict: If a pod has restarted 50 times but currently shows as Running, how can you find out why it crashed previously?

Step-by-step approach:

Terminal window
# Step 1: Check pod status and restart count
k get pod <pod>
# Look at RESTARTS column
# Step 2: Check events
k describe pod <pod> | grep -A 10 Events
# Step 3: Check current container state
k describe pod <pod> | grep -A 10 "State:"
# Step 4: Check PREVIOUS container logs (crucial!)
k logs <pod> --previous
# Step 5: Check exit code
k get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
Exit CodeSignalMeaningCommon Cause
0-SuccessNormal exit (shouldn’t cause CrashLoop)
1-Application errorApp logic error, missing config
2-Misuse of shell builtinScript error
126-Command not executablePermission issue
127-Command not foundWrong entrypoint/command
128+NSignal NKilled by signalFatal error raised by OS
137SIGKILL (9)Force killedOOMKilled, or kill -9
139SIGSEGV (11)Segmentation faultApplication bug
143SIGTERM (15)Graceful terminationNormal shutdown
255-Unknown/Custom errorApplication specific fatal error

When exit code is 137 or status shows OOMKilled:

Terminal window
# Check for OOMKilled status
k describe pod <pod> | grep -i oom
# Check memory limits
k get pod <pod> -o jsonpath='{.spec.containers[0].resources.limits.memory}'
# Check actual memory usage (if pod is running)
k top pod <pod>
# Fix: Increase memory limit
k patch deployment <name> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"512Mi"}}}]}}}}'
SymptomDiagnosisFix
Exit code 1App errorCheck logs, fix application
Exit code 127Command not foundFix command or args in spec
Exit code 137 + OOMKilledMemory exceededIncrease memory limit
Exit code 137 no OOMKilled externallyCheck liveness probe
Container exits immediatelyNo foreground processAdd sleep infinity or fix command
Logs show “file not found”Missing ConfigMap/SecretVerify mounts exist
Logs show “permission denied”Security contextFix runAsUser or fsGroup

┌──────────────────────────────────────────────────────────────┐
│ IMAGE PULL ERROR FLOW │
│ │
│ Attempt Pull ──▶ ErrImagePull ──▶ ImagePullBackOff │
│ │ │ │ │
│ │ │ │ │
│ (Success) (First failure) (Repeated failures) │
│ │
│ ErrImagePull causes: │
│ • Image doesn't exist │
│ • Registry unreachable │
│ • Authentication failed │
│ • Rate limited (Docker Hub) │
│ │
└──────────────────────────────────────────────────────────────┘
Terminal window
# Check events for specific error
k describe pod <pod> | grep -A 5 "Failed to pull"
# Common error messages:
# "manifest unknown" - Image tag doesn't exist
# "unauthorized" - Registry auth failed
# "timeout" - Registry unreachable
# "toomanyrequests" - Rate limited

Wrong image name/tag:

Terminal window
# Check current image
k get pod <pod> -o jsonpath='{.spec.containers[0].image}'
# Fix with set image
k set image deployment/<name> <container>=<correct-image>
# Or edit directly
k edit deployment <name>

Registry authentication:

Terminal window
# Create registry secret
k create secret docker-registry regcred \
--docker-server=registry.example.com \
--docker-username=user \
--docker-password=pass \
--docker-email=user@example.com
# Add to pod spec
k patch serviceaccount default -p '{"imagePullSecrets":[{"name":"regcred"}]}'
# Or add to specific deployment
k patch deployment <name> -p '{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":"regcred"}]}}}}'

Docker Hub rate limiting:

Terminal window
# Option 1: Use authenticated pulls
k create secret docker-registry dockerhub \
--docker-server=https://index.docker.io/v1/ \
--docker-username=<username> \
--docker-password=<token>
# Option 2: Use alternative registry (gcr.io, quay.io)
# nginx:latest -> gcr.io/google-containers/nginx:latest

Pause and predict: What exact pod status would you expect if a referenced Secret does not exist in the namespace?

Symptoms:

  • Pod stuck in ContainerCreating
  • Events show “configmap not found” or “secret not found”

Diagnosis:

Terminal window
# Check what ConfigMaps/Secrets the pod needs
k get pod <pod> -o yaml | grep -A 5 "configMap\|secret"
# Verify they exist
k get configmap
k get secret
# Check specific one
k describe configmap <name>

Fix:

Terminal window
# Create missing ConfigMap
k create configmap <name> --from-literal=key=value
# Create missing Secret
k create secret generic <name> --from-literal=password=secret
# If you have the data file
k create configmap <name> --from-file=config.yaml
k create secret generic <name> --from-file=credentials.json

Symptoms:

  • Container starts but app fails
  • Logs show “file not found” or “key not found”

Diagnosis:

Terminal window
# Check what keys exist in ConfigMap
k get configmap <name> -o yaml
# Check pod's expected keys
k get pod <pod> -o yaml | grep -A 10 configMapKeyRef
# Compare expected vs actual

Fix:

Terminal window
# Add missing key to ConfigMap
k patch configmap <name> -p '{"data":{"missing-key":"value"}}'
# Or recreate
k create configmap <name> --from-literal=key1=val1 --from-literal=key2=val2 --dry-run=client -o yaml | k apply -f -
Terminal window
# Check environment variables in running container
k exec <pod> -- env
# Check what's defined in spec
k get pod <pod> -o jsonpath='{.spec.containers[0].env[*]}'
# Common issue: ConfigMap key name doesn't match env var name
# Check with:
k get pod <pod> -o yaml | grep -A 5 valueFrom

Stop and think: If kubectl rollout status hangs indefinitely, what object should you describe next to find the actual pod creation errors?

Symptoms:

  • k rollout status deployment/<name> hangs
  • Old and new ReplicaSets both exist
  • Pods not reaching Ready state
Terminal window
# Check deployment status
k get deployment <name>
k describe deployment <name>
# Check ReplicaSets
k get rs -l app=<name>
# Check pods from new ReplicaSet
k get pods -l app=<name>
┌──────────────────────────────────────────────────────────────┐
│ DEPLOYMENT ROLLOUT STATES │
│ │
│ Progressing Stuck │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ New RS │ │ New RS │ │
│ │ scaling up │ │ pods failing │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Old RS │ │ Old RS │ │
│ │ scaling down │ │ still running│ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Rollout waits for new pods to become Ready │
│ If pods never Ready, rollout stalls │
│ │
└──────────────────────────────────────────────────────────────┘

Investigation:

Terminal window
# Check deployment conditions
k describe deployment <name> | grep -A 10 Conditions
# Check new ReplicaSet's pods
NEW_RS=$(k get rs -l app=<name> --sort-by='.metadata.creationTimestamp' -o name | tail -1)
k describe $NEW_RS
# Check why pods aren't ready
k get pods -l app=<name> | grep -v Running
k describe pod <failing-pod>

When new version is broken:

Terminal window
# Check rollout history
k rollout history deployment/<name>
# Rollback to previous version
k rollout undo deployment/<name>
# Rollback to specific revision
k rollout undo deployment/<name> --to-revision=2
# Verify rollback
k rollout status deployment/<name>
Terminal window
# Option 1: Fix the issue and let rollout continue
k set image deployment/<name> <container>=<fixed-image>
# Option 2: Rollback
k rollout undo deployment/<name>
# Option 3: Force restart (deletes and recreates pods)
k rollout restart deployment/<name>
# Option 4: Scale down then up (nuclear option)
k scale deployment/<name> --replicas=0
k scale deployment/<name> --replicas=3

Part 6: Readiness and Liveness Probe Failures

Section titled “Part 6: Readiness and Liveness Probe Failures”
┌──────────────────────────────────────────────────────────────┐
│ PROBE TYPES │
│ │
│ LIVENESS READINESS │
│ Is container alive? Is container ready? │
│ │
│ Failure action: Failure action: │
│ RESTART container REMOVE from service │
│ │
│ Use for: Use for: │
│ • Deadlock detection • Startup dependencies │
│ • Hung processes • Warming caches │
│ │
│ ⚠️ Wrong liveness config ⚠️ Wrong readiness config │
│ = crash loops = no traffic │
│ │
└──────────────────────────────────────────────────────────────┘
Terminal window
# Check probe configuration
k get pod <pod> -o yaml | grep -A 10 "livenessProbe\|readinessProbe"
# Check for probe failures in events
k describe pod <pod> | grep -i "unhealthy\|probe"
# Test probe manually
k exec <pod> -- wget -qO- http://localhost:8080/health
k exec <pod> -- cat /tmp/healthy
IssueSymptomFix
Wrong portProbe fails, container worksFix port in probe spec
Wrong path404 errors in eventsFix httpGet path
Too aggressiveContainers keep restartingIncrease timeoutSeconds, periodSeconds
Missing initialDelaySecondsFails during startupAdd initialDelaySeconds
App slow to startCrashLoop at startupUse startupProbe

Fix probe timing:

livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # Wait 30s before first probe
periodSeconds: 10 # Probe every 10s
timeoutSeconds: 5 # Timeout after 5s
failureThreshold: 3 # Restart after 3 failures

MistakeProblemSolution
Not checking --previousCan’t see crash reasonAlways check previous logs for CrashLoop
Ignoring init containersMain container never startsCheck init container logs too
Fixing symptoms not causeProblem recursInvestigate root cause before fixing
Wrong resource unitsUnexpected OOM or throttlingUse correct units: Mi, Gi, m
Liveness probe too aggressiveHealthy containers killedIncrease timeouts and failure threshold
Forgetting imagePullSecretsPrivate images failAdd secrets at ServiceAccount or pod level
Using restartPolicy: Never for DeploymentsPods won’t restart on failureDeployments require Always. Use Jobs for run-once tasks.
Overlooking namespace”Pod not found” errorsAlways add -n <namespace> or set default namespace context

Stop and think: You deploy a new version of your application. The pod immediately goes into CrashLoopBackOff. When you check the status, the container shows exit code 1, and the logs end with “Error: REDIS_HOST not set”. What exactly happened and how do you fix it?

Answer The application is missing a required environment variable, which causes the application process to terminate with a generic error (exit code 1). Kubernetes sees the main process exit and restarts the container, leading to CrashLoopBackOff. You can fix this by adding the variable directly using `k set env` or by verifying that the referenced ConfigMap/Secret exists and contains the correct key.

Pause and predict: A developer typoed an image name as nginx:1.255 in a new Deployment. You decide to watch the pod status. What sequence of statuses will you observe over the next 5 minutes?

Answer The pod will first show `ErrImagePull`, and then transition to `ImagePullBackOff`. Why? `ErrImagePull` is the immediate status after the first failed attempt to pull the non-existent image tag. After Kubernetes retries and fails again, it enters `ImagePullBackOff` which uses exponential backoff to space out retry attempts, preventing registry hammering.

Stop and think: You apply a pod manifest and it remains stuck in Pending state. kubectl describe pod reveals the message: “0/3 nodes are available: 3 Insufficient memory”. You know all 3 nodes have 8GB RAM physically available. What is actually preventing the pod from scheduling?

Answer The Kubernetes scheduler cannot find a node with enough unallocated memory to satisfy the pod's memory `requests`. Even if nodes have physical memory available, the scheduler only looks at what has been formally requested by other running pods. To fix this, you must either reduce the memory requests of the pending pod, delete other pods to free up allocated capacity, or add a new node to the cluster.

Pause and predict: A pod has been in CrashLoopBackOff for 2 hours due to a bad configuration. You finally locate the issue and fix the ConfigMap it depends on. How long might you have to wait for Kubernetes to automatically restart the container, and why?

Answer You might have to wait up to **5 minutes (300 seconds)**. Why? Kubernetes uses an exponential backoff delay for crashing containers (10s, 20s, 40s, 80s, 160s, up to a maximum of 300s). Since the pod has been crashing for 2 hours, it has hit the 5-minute maximum cap and will only retry once every 5 minutes. You can manually delete the pod to force an immediate restart instead of waiting.

Stop and think: A new Deployment scales up, but the pod’s main container never starts. You run kubectl logs <pod> and the output is completely empty. What hidden component should you investigate next?

Answer You need to check the logs of the pod's **init containers** using `kubectl logs -c `. By design, Kubernetes executes init containers sequentially before any app containers are started. If an init container exits with an error or gets stuck, the pod remains in an initialization phase and the main container's process is never launched. Therefore, its logs will be entirely empty.

Pause and predict: You trigger an image update on a critical Deployment. The rollout gets stuck midway. You see that the old ReplicaSet still has 2 running pods, but the new ReplicaSet has 1 pod stuck in CrashLoopBackOff. What is the fastest way to restore full capacity safely?

Answer Executing `kubectl rollout undo deployment/` is the quickest and safest fix. Executing this command will immediately revert the deployment to its previous stable revision. This scales the old ReplicaSet back up to full capacity and terminates the failing pods in the new ReplicaSet, restoring service availability rapidly. Once the system is stable, you can investigate the root cause of the CrashLoopBackOff in the failed revision without impacting production traffic.

Stop and think: Your Java application takes 30 seconds to load data into memory before it can serve requests. You configure a liveness probe that checks the /health endpoint immediately on startup. What happens to the pod?

Answer The container will likely enter a CrashLoopBackOff state before it ever finishes loading. Why? Because the liveness probe starts checking immediately (without an `initialDelaySeconds`) and fails because the app isn't ready. A failed liveness probe instructs Kubernetes to restart the container, actively interrupting the 30-second initialization. To fix this, you should use a `startupProbe` to cover the loading period, or add an `initialDelaySeconds` to the liveness probe.

Hands-On Exercise: Application Failure Scenarios

Section titled “Hands-On Exercise: Application Failure Scenarios”

Practice diagnosing and fixing various application failures.

Terminal window
# Create namespace
k create ns app-debug-lab
Terminal window
cat <<'EOF' | k apply -f -
apiVersion: v1
kind: Pod
metadata:
name: crash-app
namespace: app-debug-lab
spec:
containers:
- name: app
image: busybox:1.36
command: ['sh', '-c', 'echo "Starting..."; exit 1']
EOF

Task: Find why it’s crashing and what exit code it has.

Solution
Terminal window
k logs crash-app -n app-debug-lab --previous
k get pod crash-app -n app-debug-lab -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Exit code 1 - the command explicitly exits with error
Terminal window
cat <<'EOF' | k apply -f -
apiVersion: v1
kind: Pod
metadata:
name: config-app
namespace: app-debug-lab
spec:
containers:
- name: app
image: nginx:1.25
volumeMounts:
- name: config
mountPath: /etc/app
volumes:
- name: config
configMap:
name: app-settings
EOF

Task: Find why it’s stuck in ContainerCreating and fix it.

Solution
Terminal window
# Diagnose
k describe pod config-app -n app-debug-lab | grep -A 5 Events
# "configmap "app-settings" not found"
# Fix
k create configmap app-settings -n app-debug-lab --from-literal=key=value
# Verify
k get pod config-app -n app-debug-lab
Terminal window
cat <<'EOF' | k apply -f -
apiVersion: v1
kind: Pod
metadata:
name: image-app
namespace: app-debug-lab
spec:
containers:
- name: app
image: nginx:v99.99.99
EOF

Task: Diagnose and fix the image pull failure.

Solution
Terminal window
# Diagnose
k describe pod image-app -n app-debug-lab | grep -A 5 "Failed\|Error"
# "manifest for nginx:v99.99.99 not found"
# Fix - delete and recreate with correct image
k delete pod image-app -n app-debug-lab
k run image-app -n app-debug-lab --image=nginx:1.25
Terminal window
cat <<'EOF' | k apply -f -
apiVersion: v1
kind: Pod
metadata:
name: oom-app
namespace: app-debug-lab
spec:
containers:
- name: app
image: progrium/stress
args: ['--vm', '1', '--vm-bytes', '500M']
resources:
limits:
memory: "100Mi"
EOF

Task: Diagnose why the container keeps getting killed.

Solution
Terminal window
# Diagnose
k describe pod oom-app -n app-debug-lab | grep -i oom
k get pod oom-app -n app-debug-lab -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# "OOMKilled"
# The container tries to use 500MB but only has 100Mi limit
# Fix: increase memory limit or reduce app memory usage
  • Identified crash-app exit code as 1
  • Created missing ConfigMap for config-app
  • Fixed wrong image tag for image-app
  • Identified OOMKilled status for oom-app
Terminal window
k delete ns app-debug-lab

Terminal window
# Task: Show all pods with restart count > 0
k get pods -A -o custom-columns='NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount' | awk '$2 > 0'
Terminal window
# Task: Get last 50 lines from previous container instance
k logs <pod> --previous --tail=50
Terminal window
# Task: Get exit code from crashed container
k get pod <pod> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Or from describe:
k describe pod <pod> | grep "Exit Code"
Terminal window
# Task: Update image in deployment
k set image deployment/<name> <container>=<new-image>
Terminal window
# Task: Create ConfigMap from literal
k create configmap <name> --from-literal=key=value
# From file
k create configmap <name> --from-file=<filename>

Drill 6: Environment Variable Debug (1 min)

Section titled “Drill 6: Environment Variable Debug (1 min)”
Terminal window
# Task: Check all env vars in running container
k exec <pod> -- env | sort
Terminal window
# Task: Rollback to previous version
k rollout undo deployment/<name>
k rollout status deployment/<name>
Terminal window
# Task: View probe configuration
k get pod <pod> -o yaml | grep -A 15 livenessProbe
k get pod <pod> -o yaml | grep -A 15 readinessProbe

Continue to Module 5.3: Control Plane Failures to learn how to troubleshoot API server, scheduler, controller manager, and etcd issues.