Module 6.2: Log Analysis
Linux Troubleshooting | Complexity:
[MEDIUM]| Time: 25-30 min
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: Module 6.1: Systematic Troubleshooting
- Required: Module 1.2: Processes & Systemd
- Helpful: Basic regex knowledge
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After this module, you will be able to:
- Query logs efficiently with journalctl, grep, awk, and timestamp-based filtering
- Correlate events across multiple log sources to build an incident timeline
- Identify common error patterns and their root causes from log entries
- Design a log aggregation strategy for a multi-service environment
Why This Module Matters
Section titled “Why This Module Matters”Logs are the first source of truth for debugging. Every application, service, and the kernel itself writes logs. Knowing how to find, read, and analyze logs is fundamental to troubleshooting.
Understanding log analysis helps you:
- Find error messages — The exact cause of failures
- Correlate events — What happened before the problem?
- Debug across services — Trace requests through systems
- Build monitoring — Know what to alert on
If you can’t read logs effectively, you’re debugging blind.
Did You Know?
Section titled “Did You Know?”-
journald stores logs in binary format — This allows indexing, filtering, and compression. Text logs lose this capability.
-
Log levels have standards — RFC 5424 defines syslog severity levels: Emergency, Alert, Critical, Error, Warning, Notice, Info, Debug. Applications use these inconsistently.
-
Logs can fill disks — A misconfigured debug log can fill a disk in minutes. Log rotation exists for a reason.
-
Kubernetes loses pod logs on restart — Container stdout goes to journald or log files on the node. When pods are deleted, logs go too unless forwarded elsewhere.
Log Sources
Section titled “Log Sources”System Logs
Section titled “System Logs”┌─────────────────────────────────────────────────────────────────┐│ LOG SOURCES ││ ││ Traditional (syslog) Modern (journald) ││ /var/log/syslog journalctl ││ /var/log/messages journalctl -u service ││ /var/log/auth.log journalctl _COMM=sshd ││ /var/log/kern.log journalctl -k ││ ││ Application-specific ││ /var/log/nginx/access.log Custom locations ││ /var/log/mysql/error.log Check app documentation ││ /var/log/apache2/error.log ││ ││ Container logs ││ docker logs <container> journalctl CONTAINER_NAME=... ││ kubectl logs <pod> Node: /var/log/pods/... │└─────────────────────────────────────────────────────────────────┘Key Log Files
Section titled “Key Log Files”| Log File | Purpose |
|---|---|
/var/log/syslog or /var/log/messages | General system logs |
/var/log/auth.log or /var/log/secure | Authentication events |
/var/log/kern.log | Kernel messages |
/var/log/dmesg | Boot messages |
/var/log/apt/ or /var/log/dnf.log | Package manager logs |
journalctl
Section titled “journalctl”Basic Usage
Section titled “Basic Usage”# All logsjournalctl
# Follow mode (like tail -f)journalctl -f
# Last 100 linesjournalctl -n 100
# Since bootjournalctl -b
# Previous bootjournalctl -b -1
# No pager (for piping)journalctl --no-pagerFiltering by Time
Section titled “Filtering by Time”Pause and predict: If you need to correlate a database error with a web server error, what is the most reliable piece of information to use across both log sources?
# Last hourjournalctl --since "1 hour ago"
# Todayjournalctl --since today
# Specific time rangejournalctl --since "2024-01-15 10:00" --until "2024-01-15 12:00"
# Relative timejournalctl --since "10 minutes ago"Filtering by Service/Unit
Section titled “Filtering by Service/Unit”# Specific servicejournalctl -u nginxjournalctl -u sshd
# Multiple servicesjournalctl -u nginx -u php-fpm
# Kernel messages onlyjournalctl -kjournalctl --dmesgFiltering by Priority
Section titled “Filtering by Priority”# Errors and abovejournalctl -p err
# Warnings and abovejournalctl -p warning
# Priority levels:# 0: emerg, 1: alert, 2: crit, 3: err# 4: warning, 5: notice, 6: info, 7: debug
# Rangejournalctl -p warning..errAdvanced Filtering
Section titled “Advanced Filtering”# By process IDjournalctl _PID=1234
# By executablejournalctl _COMM=nginx
# By userjournalctl _UID=1000
# Combine filtersjournalctl -u sshd _UID=0 --since "1 hour ago"
# JSON outputjournalctl -o json-pretty -n 5Text Log Analysis
Section titled “Text Log Analysis”Common Tools
Section titled “Common Tools”# View fileless /var/log/syslogcat /var/log/syslog
# Tail (follow)tail -f /var/log/syslogtail -n 100 /var/log/syslog
# Search with grepgrep "error" /var/log/sysloggrep -i "error" /var/log/syslog # Case insensitivegrep -v "DEBUG" /var/log/app.log # Exclude
# Multiple patternsgrep -E "error|warning|failed" /var/log/syslog
# Context around matchesgrep -B 5 -A 5 "error" /var/log/syslog # 5 lines before/aftergrep -C 3 "error" /var/log/syslog # 3 lines context
# Count occurrencesgrep -c "error" /var/log/syslogPattern Extraction
Section titled “Pattern Extraction”# Extract IPsgrep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' access.log | sort | uniq -c | sort -rn
# Extract timestampsgrep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}' app.log
# Extract error codesgrep -oE 'HTTP [0-9]{3}' access.log | sort | uniq -cAWK for Log Processing
Section titled “AWK for Log Processing”# Print specific columnsawk '{print $1, $4}' access.log
# Sum valuesawk '{sum+=$10} END {print sum}' access.log
# Filter and countawk '$9 == 500 {count++} END {print count}' access.log
# Group by fieldawk '{count[$1]++} END {for (ip in count) print ip, count[ip]}' access.logLog Patterns
Section titled “Log Patterns”Common Error Patterns
Section titled “Common Error Patterns”# Connection errorsgrep -iE "connection refused|connection reset|timeout" /var/log/syslog
# Permission errorsgrep -iE "permission denied|access denied|forbidden" /var/log/syslog
# Resource errorsgrep -iE "out of memory|no space|too many open files" /var/log/syslog
# Service failuresgrep -iE "failed|error|fatal|critical" /var/log/syslog
# Authentication failuresgrep -iE "authentication failure|invalid user|failed password" /var/log/auth.logTime-Based Analysis
Section titled “Time-Based Analysis”# Errors per minutegrep "error" app.log | \ awk '{print $1, $2}' | \ cut -d: -f1-2 | \ sort | uniq -c
# First and last occurrencegrep "error" app.log | head -1 # Firstgrep "error" app.log | tail -1 # Last
# Error rate over timegrep "error" app.log | \ awk '{print $1}' | \ sort | uniq -c | \ awk '{print $2, $1}'Correlation
Section titled “Correlation”# Find what happened before an error# (search for 10 lines before the error)grep -B 10 "FATAL" app.log
# Find related events by timestamp# 1. Find error timestampgrep "ERROR" app.log | head -1# Jan 15 10:23:45 ...
# 2. Search all logs for that timejournalctl --since "10:23:40" --until "10:23:50"
# 3. Check multiple servicesjournalctl -u nginx -u app -u database --since "10:23:00" --until "10:24:00"Kubernetes Logs
Section titled “Kubernetes Logs”Pod Logs
Section titled “Pod Logs”# Current pod logskubectl logs pod-name
# Previous container (after restart)kubectl logs pod-name --previous
# Specific containerkubectl logs pod-name -c container-name
# Followkubectl logs -f pod-name
# Last 100 lineskubectl logs --tail=100 pod-name
# Since timekubectl logs --since=1h pod-namekubectl logs --since-time="2024-01-15T10:00:00Z" pod-nameMulti-Pod Logs
Section titled “Multi-Pod Logs”# All pods with labelkubectl logs -l app=nginx
# Multiple containerskubectl logs pod-name --all-containers
# All pods in deploymentkubectl logs deployment/my-deploymentNode-Level Logs
Section titled “Node-Level Logs”# Kubelet logsjournalctl -u kubelet
# Container runtimejournalctl -u containerdjournalctl -u docker
# Logs on disk (varies by setup)ls /var/log/pods/ls /var/log/containers/Log Management
Section titled “Log Management”Stop and think: What happens to the system if
/var/logfills up completely because logs weren’t rotated? How would this affect running services?
Log Rotation
Section titled “Log Rotation”# Check logrotate configcat /etc/logrotate.confls /etc/logrotate.d/
# Example configcat /etc/logrotate.d/nginx# /var/log/nginx/*.log {# daily# missingok# rotate 14# compress# notifempty# create 0640 nginx nginx# sharedscripts# postrotate# systemctl reload nginx# endscript# }
# Force rotationsudo logrotate -f /etc/logrotate.d/nginx
# Debug rotationsudo logrotate -d /etc/logrotate.confjournald Configuration
Section titled “journald Configuration”# Config filecat /etc/systemd/journald.conf
# Key settings:# Storage=persistent # Keep logs across reboots# Compress=yes# SystemMaxUse=500M # Max disk usage# MaxRetentionSec=1month
# Current disk usagejournalctl --disk-usage
# Clean old logssudo journalctl --vacuum-time=7dsudo journalctl --vacuum-size=500MCommon Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Not checking timestamps | Looking at wrong time period | Always verify log time |
| Case-sensitive search | Missing errors | Use grep -i |
| Ignoring previous boot | Problem happened before reboot | journalctl -b -1 |
| No log forwarding | Logs lost when pod dies | Set up log aggregation |
| Searching too broadly | Too much noise | Filter by service, priority |
| Not checking all logs | Missing correlation | Check multiple sources |
Question 1
Section titled “Question 1”A user reports that the web application started throwing 500 errors about 45 minutes ago. You need to quickly isolate the system-level error messages from that specific timeframe to identify the root cause without being overwhelmed by info-level noise. Which command should you run?
Show Answer
journalctl -p err --since "1 hour ago"Filtering by priority is essential when a system is generating a massive volume of informational logs. By using the -p err flag, you instruct journald to only display messages with a severity of error (level 3) or higher, immediately cutting through the noise. The --since "1 hour ago" parameter scopes the search down to the relevant incident window, ensuring you don’t waste time investigating old, unrelated issues.
For warnings and errors combined, you can widen the priority slightly:
journalctl -p warning --since "1 hour ago"Question 2
Section titled “Question 2”Your Kubernetes node experienced a sudden kernel panic and automatically rebooted. You SSH into the node after it comes back online, but the current logs only show the successful startup sequence. How can you retrieve the logs from right before the crash?
Show Answer
journalctl -b -1By default, running journalctl without arguments shows logs from the current boot, which isn’t helpful if you are investigating a crash that caused a restart. The -b flag targets a specific boot session, and appending -1 explicitly requests the logs from the immediately preceding boot. This allows you to inspect the system’s exact state and read the kernel messages that were recorded right before the panic occurred.
To list all available boot sessions and their IDs, you can run:
journalctl --list-bootsThis is particularly useful when a system has crashed and restarted multiple times, as you may need to go back further than just the previous boot (e.g., -b -2).
Question 3
Section titled “Question 3”You suspect a newly deployed microservice is occasionally failing to connect to the database. You want to quantify the impact by counting the exact number of times the “database connection timeout” message appears in the application log file. What approaches can you use?
Show Answer
# Count occurrencesgrep -c "specific error message" /var/log/app.log
# With journalctljournalctl -u service --no-pager | grep -c "error message"
# Group by timegrep "error" app.log | awk '{print $1}' | sort | uniq -cCounting the raw number of errors helps establish the severity and frequency of an issue. Using the -c flag with grep is the most efficient way to get a total count because it avoids printing the matching lines to standard output, simply returning the integer tally. When you need to understand if the errors are a continuous stream or isolated spikes, piping the output to awk, sort, and uniq -c allows you to group the occurrences by timestamp, revealing the pattern of the failures over time.
Question 4
Section titled “Question 4”You found a critical “Out of Memory” error in the /var/log/app.log file, but the error message itself doesn’t specify which transaction caused it. You need to see the log lines immediately preceding and following this error to reconstruct the sequence of events. How can you retrieve this context?
Show Answer
# 5 lines before and aftergrep -C 5 "error message" /var/log/app.log
# Or separately:grep -B 5 "error" # 5 lines beforegrep -A 5 "error" # 5 lines after
# With journalctl, use time range around the eventjournalctl --since "10:23:40" --until "10:23:50"An isolated error message rarely tells the full story of why a failure occurred. The context flags in grep (-B for before, -A for after, and -C for context in both directions) allow you to see the application’s state leading up to the crash, such as the specific user request being processed. Alternatively, if you are using journalctl, extracting the exact timestamp of the error and querying a narrow time window around it lets you correlate events across multiple system services simultaneously.
Question 5
Section titled “Question 5”A developer asks for your help because their newly deployed application is failing, but when they run kubectl logs pod-name, the output is completely empty. The pod status shows it has been running for 10 minutes. What are the most likely architectural or configurational reasons for this missing log output?
Show Answer
When kubectl logs returns nothing, it generally means the container engine isn’t capturing the application’s standard output. The most common reason is that the application is hardcoded to write its logs directly to a file inside the container’s filesystem (e.g., /var/log/app.log) instead of streaming to stdout and stderr. Furthermore, if the pod contains multiple containers, you might be querying a sidecar container that hasn’t logged anything yet instead of the main application container.
Several specific possibilities to investigate include:
- Application writes to files, not stdout: Container logs only capture stdout/stderr. Check if the app logs to a specific file inside the container.
- Container restarted: A new container starts with fresh logs. Use the
--previousflag to view logs from the crashed instance. - Logging to wrong container: In a multi-container pod, you must specify the target using
-c container-name. - Application hasn’t logged anything: The application framework might be buffering logs in memory, or the log level might be set too high (e.g., only logging critical errors).
- Log rotation: If the application generates massive logs, old logs may have already been rotated out by system policies.
Hands-On Exercise
Section titled “Hands-On Exercise”Log Analysis Practice
Section titled “Log Analysis Practice”Objective: Use journalctl and traditional log tools to analyze system logs.
Environment: Any Linux system with systemd
Part 1: journalctl Basics
Section titled “Part 1: journalctl Basics”# 1. View recent logsjournalctl -n 20
# 2. Check disk usagejournalctl --disk-usage
# 3. List bootsjournalctl --list-boots
# 4. Current boot onlyjournalctl -b -n 50Part 2: Filtering
Section titled “Part 2: Filtering”# 1. Filter by servicejournalctl -u sshd -n 20# Try other services: systemd, NetworkManager, etc.
# 2. Filter by priorityjournalctl -p err -n 20journalctl -p warning..err -n 20
# 3. Filter by timejournalctl --since "30 minutes ago" -n 50journalctl --since "09:00" --until "10:00"
# 4. Combine filtersjournalctl -u sshd -p warning --since todayPart 3: Text Log Analysis
Section titled “Part 3: Text Log Analysis”# 1. Find a log file to analyzels -la /var/log/LOG_FILE="/var/log/syslog" # or /var/log/messages
# 2. Basic viewingtail -20 $LOG_FILEhead -20 $LOG_FILE
# 3. Search for errorsgrep -i error $LOG_FILE | tail -10grep -c -i error $LOG_FILE
# 4. Search with contextgrep -C 3 -i error $LOG_FILE | tail -30Part 4: Pattern Analysis
Section titled “Part 4: Pattern Analysis”# 1. Find unique error typesgrep -i error /var/log/syslog 2>/dev/null | \ awk '{$1=$2=$3=$4=$5=""; print}' | \ sort | uniq -c | sort -rn | head -10
# 2. Errors by hourjournalctl -p err --since today --no-pager | \ awk '{print $3}' | \ cut -d: -f1 | \ sort | uniq -c
# 3. Find authentication failuresgrep -i "authentication failure\|failed password" /var/log/auth.log 2>/dev/null | tail -10# Orjournalctl _COMM=sshd | grep -i "failed\|invalid" | tail -10Part 5: Correlation Practice
Section titled “Part 5: Correlation Practice”# 1. Generate an eventlogger "TEST: Exercise event at $(date)"
# 2. Find itjournalctl --since "1 minute ago" | grep TEST
# 3. Find related events (same timestamp)journalctl --since "1 minute ago"
# 4. Export for analysisjournalctl -u sshd --since "1 hour ago" -o json > /tmp/sshd_logs.jsonhead -5 /tmp/sshd_logs.jsonPart 6: Log Maintenance
Section titled “Part 6: Log Maintenance”# 1. Check journal sizejournalctl --disk-usage
# 2. View rotation config (if exists)cat /etc/logrotate.d/* 2>/dev/null | head -30
# 3. See what would be cleaned# (Do not actually clean without understanding)sudo journalctl --vacuum-time=7d --dry-run 2>/dev/null || \ echo "dry-run not supported, skip cleanup"Success Criteria
Section titled “Success Criteria”- Viewed logs with journalctl using various filters
- Filtered by service, priority, and time
- Used grep to search text logs
- Found patterns and counted occurrences
- Correlated events across time
- Checked log maintenance settings
Key Takeaways
Section titled “Key Takeaways”-
journalctl is powerful — Use filters:
-u,-p,--since, field matches -
grep with context —
-B,-A,-Cshow surrounding lines -
Time matters — Always verify you’re looking at the right time period
-
Correlate across services — Problems often span multiple components
-
Set up log forwarding — Ephemeral containers lose logs
What’s Next?
Section titled “What’s Next?”In Module 6.3: Process Debugging, you’ll learn how to trace process behavior with strace, examine /proc, and debug hung or misbehaving processes.