Module 6.1: Systematic Troubleshooting
Linux Troubleshooting | Complexity:
[MEDIUM]| Time: 25-30 min
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: Module 5.1: USE Method
- Helpful: Experience with production issues
- Helpful: Basic Linux command line familiarity
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After this module, you will be able to:
- Apply the scientific method to system debugging (observe → hypothesize → test → conclude)
- Reproduce issues systematically and document findings for post-incident reviews
- Triage problems by severity and identify the fastest path to resolution
- Avoid common debugging anti-patterns (random changes, ignoring evidence, skipping reproduction)
Why This Module Matters
Section titled “Why This Module Matters”When systems break, panic leads to random fixes. Methodical troubleshooting finds root causes faster and prevents recurrence. The difference between a 5-minute fix and hours of confusion is often just approach.
Systematic troubleshooting helps you:
- Reduce MTTR — Mean Time To Recovery
- Find root causes — Not just symptoms
- Avoid making things worse — Random changes create chaos
- Document for next time — Same problem won’t take as long
The best troubleshooters aren’t luckier—they’re more methodical.
Did You Know?
Section titled “Did You Know?”-
Most issues have simple causes — Disk full, service not running, wrong config. Exotic causes are rare. Check the obvious first.
-
“It worked yesterday” is a clue — Something changed. Find the change, find the cause. Check deployments, config changes, system updates.
-
Rubber duck debugging works — Explaining the problem to someone (or something) forces you to think through assumptions. Many bugs are found mid-explanation.
-
Cognitive bias is real — You’ll focus on recent changes you made. The actual cause might be something else entirely. Stay objective.
Troubleshooting Methodologies
Section titled “Troubleshooting Methodologies”The Scientific Method
Section titled “The Scientific Method”┌─────────────────────────────────────────────────────────────────┐│ SCIENTIFIC METHOD ││ ││ 1. OBSERVE ││ │ What are the symptoms? What's the actual behavior? ││ ▼ ││ 2. HYPOTHESIZE ││ │ What could cause this? List possibilities. ││ ▼ ││ 3. PREDICT ││ │ If hypothesis X is true, what else should we see? ││ ▼ ││ 4. TEST ││ │ Check the prediction. Does it match? ││ ▼ ││ 5. CONCLUDE ││ │ Hypothesis confirmed? Fix it. ││ │ Hypothesis rejected? Next hypothesis. ││ ▼ ││ 6. ITERATE ││ Back to step 2 with new information │└─────────────────────────────────────────────────────────────────┘Stop and think: You get an alert that users cannot check out their shopping carts. You observe that the checkout service is returning 500 errors. Before randomly restarting the service, what are two distinct hypotheses you could form based on the architecture of a typical web application?
Divide and Conquer
Section titled “Divide and Conquer”Binary search for problems:
# Example: Web request failing# Full path: Client → DNS → Network → LB → Pod → App → DB
# Step 1: Test middle of pathcurl -I http://internal-lb/health# If this works: problem is before LB (client, DNS, network)# If this fails: problem is after LB (pod, app, db)
# Step 2: Test half of remaining path# Continue until isolatedPause and predict: If you run
curl -I http://internal-lb/healthfrom a client machine and it times out, but running the same command directly from the load balancer node succeeds, which segment of the network path should you investigate first?
Timeline Analysis
Section titled “Timeline Analysis”What changed when the problem started?
# Recent system changesrpm -qa --last | head -20 # Package installsls -lt /etc/*.conf | head -10 # Config changesjournalctl --since "1 hour ago" | tail -100 # Recent logslast -10 # Recent logins
# Git history for config managementgit log --oneline --since="2 hours ago"
# Kubernetes changeskubectl get events --sort-by='.lastTimestamp' | tail -20Initial Triage
Section titled “Initial Triage”The First 60 Seconds
Section titled “The First 60 Seconds”Quick system health check:
# 1. What's happening now?uptime # Load, uptimedmesg | tail -20 # Kernel messagesjournalctl -p err -n 20 # Recent errors
# 2. Resource statefree -h # Memorydf -h # Disktop -bn1 | head -15 # CPU and processes
# 3. Network statess -tuln # Listening portsip addr # Network interfaces
# 4. What's running?systemctl --failed # Failed servicesdocker ps -a | head -10 # Containers (if applicable)Problem Categories
Section titled “Problem Categories”| Symptom | Likely Area | First Check |
|---|---|---|
| ”Can’t connect” | Network | ping, ss, iptables |
| ”Slow” | Performance | top, iostat, vmstat |
| ”Permission denied” | Security | Permissions, SELinux, AppArmor |
| ”Service won’t start” | Service | systemctl status, logs |
| ”Out of space” | Storage | df -h, du -sh /* |
| ”Process crashed” | Application | Core dumps, logs, dmesg |
Gathering Information
Section titled “Gathering Information”Questions to Ask
Section titled “Questions to Ask”WHAT is the problem?├── Exact error message?├── What's the expected behavior?└── What's the actual behavior?
WHEN did it start?├── Sudden or gradual?├── Correlates with any event?└── Time of first occurrence?
WHERE does it happen?├── All servers or one?├── All users or some?└── All requests or specific paths?
WHO is affected?├── Internal or external users?├── Specific services?└── Specific clients?
WHAT changed?├── Recent deployments?├── System updates?├── Configuration changes?└── Traffic patterns?Reproduce vs Observe
Section titled “Reproduce vs Observe”┌─────────────────────────────────────────────────────────────────┐│ REPRODUCTION ││ ││ Can you reproduce? ││ │ ││ ├── YES → Great! You can test hypotheses directly ││ │ ││ └── NO → Work with what you have: ││ - Logs from incident time ││ - Metrics history ││ - User reports ││ - Correlation with other events ││ ││ Intermittent issues: ││ - Increase logging ││ - Add monitoring ││ - Wait for next occurrence with better visibility │└─────────────────────────────────────────────────────────────────┘Common Patterns
Section titled “Common Patterns””It Was Working Yesterday"
Section titled “”It Was Working Yesterday"”# Find what changed# 1. Package changesrpm -qa --last | head -20 # RHEL/CentOSdpkg -l --no-pager | head -20 # Debian/Ubuntu
# 2. Config file changesfind /etc -mtime -1 -type f 2>/dev/null
# 3. Recent deploymentskubectl get deployments -A -o json | \ jq -r '.items[] | select(.metadata.creationTimestamp > "2024-01-01") | .metadata.name'
# 4. Cron jobs that rangrep CRON /var/log/syslog | tail -20
# 5. System updatescat /var/log/apt/history.log | tail -50 # Debiancat /var/log/dnf.log | tail -50 # RHEL"It Works On My Machine"
Section titled “"It Works On My Machine"”# Environment differencesenv # Environment variablescat /etc/os-release # OS versionuname -r # Kernel versionwhich python && python --version # Language versions
# Network differencesip route # Routingcat /etc/resolv.conf # DNSiptables -L -n # Firewall
# Configuration differencesdiff /etc/app/config.yaml /path/to/other/config.yaml"It’s Slow”
Section titled “"It’s Slow””Apply USE Method:
# CPU saturation?uptimevmstat 1 5
# Memory pressure?free -hvmstat 1 5 | awk '{print $7, $8}' # si/so
# Disk bottleneck?iostat -x 1 5
# Network?ss -ssar -n DEV 1 5
# If all OK, it's application-levelHypothesis Testing
Section titled “Hypothesis Testing”Forming Hypotheses
Section titled “Forming Hypotheses”Symptom: "API returns 500 errors"
Hypotheses (by likelihood):1. Database connection failed2. Service crashed/restarting3. Disk full (can't write logs/temp)4. Memory exhausted (OOM)5. Network partition6. Bad deployment
Test each:1. curl localhost:5432 # Can reach DB?2. systemctl status api # Service running?3. df -h # Disk space?4. dmesg | grep oom # OOM events?5. ping db-server # Network?6. kubectl rollout status # Deployment?Testing Safely
Section titled “Testing Safely”# Read-only commands firstcat /var/log/app.log # Read logssystemctl status service # Check statuscurl -I endpoint # Test connectivity
# Non-destructive testsping host # Network reachabilitydig domain # DNS resolutiontelnet host port # Port connectivity
# Careful with write operations# - Don't restart unless sure# - Don't change config without backup# - Don't delete files without understanding
# If you must change something:cp config.yaml config.yaml.bak # Backup firstDocumentation During Troubleshooting
Section titled “Documentation During Troubleshooting”Keep a Log
Section titled “Keep a Log”# Start a script sessionscript troubleshooting-$(date +%Y%m%d-%H%M).log
# Or use shell historyhistory | tail -50
# Note your findingsecho "# $(date): Hypothesis: disk full" >> notes.mdecho "# Result: df shows 95% on /var" >> notes.mdIncident Timeline
Section titled “Incident Timeline”TIME EVENT09:00 First alert: API errors09:02 Checked dashboard: error rate 50%09:05 SSH to api-server-109:06 Checked logs: "Connection refused to db"09:08 SSH to db-server09:09 MySQL not running09:10 dmesg shows OOM kill09:12 Increased memory limit09:14 Started MySQL09:15 API recovering09:20 Error rate back to normalCommon Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Random restarts | Lose diagnostic info | Check logs/state FIRST |
| Changing multiple things | Don’t know what fixed it | One change at a time |
| Not documenting | Same problem takes same time | Keep notes |
| Assuming the obvious | Miss actual cause | Verify assumptions |
| Tunnel vision | Focus on one thing | Step back, consider all |
| Not asking for help | Waste hours alone | Fresh eyes help |
Question 1
Section titled “Question 1”Scenario: You receive a PagerDuty alert at 3 AM stating that the payment processing service has crashed. The dashboard shows a complete drop in successful transactions. Question: What is the most critical first step you must take before attempting to restore the service?
Show Answer
Gather state and log information before taking any destructive action.
If you immediately restart the service to restore functionality, you risk destroying the ephemeral evidence (like memory state, temporary files, or specific error logs) needed to determine the root cause. Without this evidence, the service is likely to crash again for the exact same reason. By first capturing the current state (e.g., systemctl status, dmesg, journalctl), you ensure that you have the data necessary to formulate an accurate hypothesis and prevent recurrence.
Question 2
Section titled “Question 2”Scenario: Customer support reports that “the website is down” for all users. The architecture consists of a CDN, an external load balancer, an internal API gateway, web app pods, and a backend database cluster. Question: How would you apply the divide-and-conquer strategy to isolate the failure in this specific architecture?
Show Answer
Test the middle of the request path to eliminate half of the components immediately.
For example, testing the internal API gateway directly bypasses the CDN and external load balancer. If the API gateway returns a healthy response, you instantly know the problem lies upstream (CDN, external LB, or internet routing) and can stop investigating the application code or database. If it fails, you know the issue is at the gateway, the web pods, or the database, allowing you to split the remaining path again.
Question 3
Section titled “Question 3”Scenario: A critical background job processing nightly reports has been running successfully for six months. Tonight, it suddenly failed with a “connection timeout” error to the analytics database, despite no scheduled deployments or infrastructure updates occurring today. Question: Why is asking “What changed?” still the most crucial investigative path in this scenario, even when no planned changes occurred?
Show Answer
Working systems governed by deterministic code do not spontaneously break without an underlying state change.
Even if no explicit deployments occurred, environmental factors constantly shift: SSL certificates expire, databases grow until disks fill, log files rotate, cloud provider network topologies update, or external API rate limits are reached. By aggressively seeking out what changed in the environment—using timeline analysis for package updates, system events, or metric anomalies—you move away from guessing and directly target the catalyst of the failure.
Question 4
Section titled “Question 4”Scenario: The primary database node for a high-traffic application has unexpectedly stopped running. Your team lead suggests immediately issuing a systemctl restart mysql command to minimize downtime.
Question: Under what specific conditions should you push back against immediately restarting the database service?
Show Answer
You should delay a restart if the root cause of the crash is unknown and the current crashed state holds critical diagnostic data that a restart would wipe out.
Restarting a database clears process memory, resets active connections, and often rotates or overwrites the very logs needed to understand the failure. If the database crashed due to an Out-Of-Memory (OOM) killer or a corrupted disk sector, restarting it without capturing dmesg logs or verifying disk space will likely just cause it to crash again immediately, prolonging the outage while destroying the evidence needed to fix it permanently.
Question 5
Section titled “Question 5”Scenario: An application server is exhibiting high latency. A junior engineer logs in and immediately modifies the application’s configuration file to double the connection pool size, restarts the application, and then flushes the Redis cache, hoping one of these actions will speed things up. Question: What core principles of systematic troubleshooting did the engineer violate, and what is the danger of their approach?
Show Answer
The engineer violated the principles of “read-only first” and “testing one hypothesis at a time.”
By making multiple, unverified changes simultaneously (config change, restart, and cache flush), it becomes impossible to know which action, if any, resolved the issue. Furthermore, flushing a cache under high latency could cause a massive cache stampede, severely degrading database performance and escalating a minor slowdown into a total system outage. Systematic troubleshooting requires forming a hypothesis, making a single, isolated change, and measuring the result before proceeding.
Hands-On Exercise
Section titled “Hands-On Exercise”Practicing Systematic Troubleshooting
Section titled “Practicing Systematic Troubleshooting”Objective: Apply troubleshooting methodology to a simulated problem.
Environment: Any Linux system
Part 1: Initial Triage Script
Section titled “Part 1: Initial Triage Script”# Create a triage scriptcat > /tmp/triage.sh << 'EOF'#!/bin/bashecho "=== System Triage $(date) ==="echo ""echo "--- Uptime & Load ---"uptimeecho ""echo "--- Memory ---"free -hecho ""echo "--- Disk ---"df -h | grep -v tmpfsecho ""echo "--- Recent Errors ---"journalctl -p err -n 10 --no-pager 2>/dev/null || dmesg | tail -10echo ""echo "--- Failed Services ---"systemctl --failed 2>/dev/null || echo "N/A"echo ""echo "--- Top Processes ---"ps aux --sort=-%cpu | head -5echo ""echo "--- Network Listeners ---"ss -tuln | head -10EOFchmod +x /tmp/triage.sh
# Run it/tmp/triage.shPart 2: Simulate and Diagnose
Section titled “Part 2: Simulate and Diagnose”# Simulate: Fill up /tmp (safely)dd if=/dev/zero of=/tmp/bigfile bs=1M count=100 2>/dev/null
# Now imagine you get alert: "Application failing"# Apply methodology:
# 1. OBSERVE: What's the symptom?echo "Symptom: Application reports 'cannot write file'"
# 2. HYPOTHESIZE: What could cause this?echo "Hypotheses: 1) Disk full, 2) Permissions, 3) Process limit"
# 3. TEST Hypothesis 1df -h /tmp# Shows /tmp usage
# 4. CONCLUDEecho "Root cause: /tmp filled by bigfile"
# 5. FIXrm /tmp/bigfiledf -h /tmpPart 3: Timeline Analysis
Section titled “Part 3: Timeline Analysis”# Find recent changes on your system
# 1. Recent package changesif command -v rpm &>/dev/null; then rpm -qa --last | head -10elif command -v dpkg &>/dev/null; then ls -lt /var/lib/dpkg/info/*.list | head -10fi
# 2. Recent config changesfind /etc -type f -mtime -7 2>/dev/null | head -10
# 3. Recent loginslast -10
# 4. Recent cron runsgrep CRON /var/log/syslog 2>/dev/null | tail -10 || \journalctl -u cron -n 10 --no-pager 2>/dev/nullPart 4: Hypothesis Testing Practice
Section titled “Part 4: Hypothesis Testing Practice”# Scenario: "Can't SSH to server"# Let's test hypotheses systematically
# Hypothesis 1: Network unreachableping -c 1 localhost > /dev/null && echo "H1: Network OK" || echo "H1: Network problem"
# Hypothesis 2: SSH not runningsystemctl is-active sshd 2>/dev/null || \ systemctl is-active ssh 2>/dev/null || \ echo "SSH service check (verify manually)"
# Hypothesis 3: Port not listeningss -tuln | grep :22 > /dev/null && echo "H3: Port 22 listening" || echo "H3: Port 22 not listening"
# Hypothesis 4: Firewall blockingiptables -L INPUT -n 2>/dev/null | grep -q "dpt:22" && echo "H4: SSH in firewall rules" || echo "H4: Check firewall"Part 5: Document Your Process
Section titled “Part 5: Document Your Process”# Start loggingLOGFILE=/tmp/troubleshooting-$(date +%Y%m%d-%H%M).log
# Function to log with timestamplog() { echo "[$(date +%H:%M:%S)] $*" | tee -a $LOGFILE}
# Example sessionlog "Starting investigation: high load average"log "Current load: $(uptime)"log "Checking top processes..."ps aux --sort=-%cpu | head -5 >> $LOGFILElog "Hypothesis: runaway process"log "Action: None yet, gathering more info"
# Review logcat $LOGFILESuccess Criteria
Section titled “Success Criteria”- Created and ran triage script
- Applied scientific method to simulated problem
- Performed timeline analysis
- Tested multiple hypotheses systematically
- Documented troubleshooting steps
Key Takeaways
Section titled “Key Takeaways”-
Methodology beats guesswork — Systematic approach finds root causes faster
-
Gather info before acting — Don’t lose diagnostic data by restarting
-
What changed? — Something always changed; find it
-
One change at a time — Know what actually fixed it
-
Document everything — For yourself and others
What’s Next?
Section titled “What’s Next?”In Module 6.2: Log Analysis, you’ll learn how to effectively use system logs to diagnose issues—the most common source of troubleshooting information.