Skip to content

Module 6.2: Log Analysis

Hands-On Lab Available
Ubuntu intermediate 30 min
Launch Lab ↗

Opens in Killercoda in a new tab

Linux Troubleshooting | Complexity: [MEDIUM] | Time: 25-30 min

Before starting this module:


After this module, you will be able to:

  • Query logs efficiently with journalctl, grep, awk, and timestamp-based filtering
  • Correlate events across multiple log sources to build an incident timeline
  • Identify common error patterns and their root causes from log entries
  • Design a log aggregation strategy for a multi-service environment

Logs are the first source of truth for debugging. Every application, service, and the kernel itself writes logs. Knowing how to find, read, and analyze logs is fundamental to troubleshooting.

Understanding log analysis helps you:

  • Find error messages — The exact cause of failures
  • Correlate events — What happened before the problem?
  • Debug across services — Trace requests through systems
  • Build monitoring — Know what to alert on

If you can’t read logs effectively, you’re debugging blind.


  • journald stores logs in binary format — This allows indexing, filtering, and compression. Text logs lose this capability.

  • Log levels have standards — RFC 5424 defines syslog severity levels: Emergency, Alert, Critical, Error, Warning, Notice, Info, Debug. Applications use these inconsistently.

  • Logs can fill disks — A misconfigured debug log can fill a disk in minutes. Log rotation exists for a reason.

  • Kubernetes loses pod logs on restart — Container stdout goes to journald or log files on the node. When pods are deleted, logs go too unless forwarded elsewhere.


┌─────────────────────────────────────────────────────────────────┐
│ LOG SOURCES │
│ │
│ Traditional (syslog) Modern (journald) │
│ /var/log/syslog journalctl │
│ /var/log/messages journalctl -u service │
│ /var/log/auth.log journalctl _COMM=sshd │
│ /var/log/kern.log journalctl -k │
│ │
│ Application-specific │
│ /var/log/nginx/access.log Custom locations │
│ /var/log/mysql/error.log Check app documentation │
│ /var/log/apache2/error.log │
│ │
│ Container logs │
│ docker logs <container> journalctl CONTAINER_NAME=... │
│ kubectl logs <pod> Node: /var/log/pods/... │
└─────────────────────────────────────────────────────────────────┘
Log FilePurpose
/var/log/syslog or /var/log/messagesGeneral system logs
/var/log/auth.log or /var/log/secureAuthentication events
/var/log/kern.logKernel messages
/var/log/dmesgBoot messages
/var/log/apt/ or /var/log/dnf.logPackage manager logs

Terminal window
# All logs
journalctl
# Follow mode (like tail -f)
journalctl -f
# Last 100 lines
journalctl -n 100
# Since boot
journalctl -b
# Previous boot
journalctl -b -1
# No pager (for piping)
journalctl --no-pager

Pause and predict: If you need to correlate a database error with a web server error, what is the most reliable piece of information to use across both log sources?

Terminal window
# Last hour
journalctl --since "1 hour ago"
# Today
journalctl --since today
# Specific time range
journalctl --since "2024-01-15 10:00" --until "2024-01-15 12:00"
# Relative time
journalctl --since "10 minutes ago"
Terminal window
# Specific service
journalctl -u nginx
journalctl -u sshd
# Multiple services
journalctl -u nginx -u php-fpm
# Kernel messages only
journalctl -k
journalctl --dmesg
Terminal window
# Errors and above
journalctl -p err
# Warnings and above
journalctl -p warning
# Priority levels:
# 0: emerg, 1: alert, 2: crit, 3: err
# 4: warning, 5: notice, 6: info, 7: debug
# Range
journalctl -p warning..err
Terminal window
# By process ID
journalctl _PID=1234
# By executable
journalctl _COMM=nginx
# By user
journalctl _UID=1000
# Combine filters
journalctl -u sshd _UID=0 --since "1 hour ago"
# JSON output
journalctl -o json-pretty -n 5

Terminal window
# View file
less /var/log/syslog
cat /var/log/syslog
# Tail (follow)
tail -f /var/log/syslog
tail -n 100 /var/log/syslog
# Search with grep
grep "error" /var/log/syslog
grep -i "error" /var/log/syslog # Case insensitive
grep -v "DEBUG" /var/log/app.log # Exclude
# Multiple patterns
grep -E "error|warning|failed" /var/log/syslog
# Context around matches
grep -B 5 -A 5 "error" /var/log/syslog # 5 lines before/after
grep -C 3 "error" /var/log/syslog # 3 lines context
# Count occurrences
grep -c "error" /var/log/syslog
Terminal window
# Extract IPs
grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' access.log | sort | uniq -c | sort -rn
# Extract timestamps
grep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}' app.log
# Extract error codes
grep -oE 'HTTP [0-9]{3}' access.log | sort | uniq -c
Terminal window
# Print specific columns
awk '{print $1, $4}' access.log
# Sum values
awk '{sum+=$10} END {print sum}' access.log
# Filter and count
awk '$9 == 500 {count++} END {print count}' access.log
# Group by field
awk '{count[$1]++} END {for (ip in count) print ip, count[ip]}' access.log

Terminal window
# Connection errors
grep -iE "connection refused|connection reset|timeout" /var/log/syslog
# Permission errors
grep -iE "permission denied|access denied|forbidden" /var/log/syslog
# Resource errors
grep -iE "out of memory|no space|too many open files" /var/log/syslog
# Service failures
grep -iE "failed|error|fatal|critical" /var/log/syslog
# Authentication failures
grep -iE "authentication failure|invalid user|failed password" /var/log/auth.log
Terminal window
# Errors per minute
grep "error" app.log | \
awk '{print $1, $2}' | \
cut -d: -f1-2 | \
sort | uniq -c
# First and last occurrence
grep "error" app.log | head -1 # First
grep "error" app.log | tail -1 # Last
# Error rate over time
grep "error" app.log | \
awk '{print $1}' | \
sort | uniq -c | \
awk '{print $2, $1}'
Terminal window
# Find what happened before an error
# (search for 10 lines before the error)
grep -B 10 "FATAL" app.log
# Find related events by timestamp
# 1. Find error timestamp
grep "ERROR" app.log | head -1
# Jan 15 10:23:45 ...
# 2. Search all logs for that time
journalctl --since "10:23:40" --until "10:23:50"
# 3. Check multiple services
journalctl -u nginx -u app -u database --since "10:23:00" --until "10:24:00"

Terminal window
# Current pod logs
kubectl logs pod-name
# Previous container (after restart)
kubectl logs pod-name --previous
# Specific container
kubectl logs pod-name -c container-name
# Follow
kubectl logs -f pod-name
# Last 100 lines
kubectl logs --tail=100 pod-name
# Since time
kubectl logs --since=1h pod-name
kubectl logs --since-time="2024-01-15T10:00:00Z" pod-name
Terminal window
# All pods with label
kubectl logs -l app=nginx
# Multiple containers
kubectl logs pod-name --all-containers
# All pods in deployment
kubectl logs deployment/my-deployment
Terminal window
# Kubelet logs
journalctl -u kubelet
# Container runtime
journalctl -u containerd
journalctl -u docker
# Logs on disk (varies by setup)
ls /var/log/pods/
ls /var/log/containers/

Stop and think: What happens to the system if /var/log fills up completely because logs weren’t rotated? How would this affect running services?

Terminal window
# Check logrotate config
cat /etc/logrotate.conf
ls /etc/logrotate.d/
# Example config
cat /etc/logrotate.d/nginx
# /var/log/nginx/*.log {
# daily
# missingok
# rotate 14
# compress
# notifempty
# create 0640 nginx nginx
# sharedscripts
# postrotate
# systemctl reload nginx
# endscript
# }
# Force rotation
sudo logrotate -f /etc/logrotate.d/nginx
# Debug rotation
sudo logrotate -d /etc/logrotate.conf
Terminal window
# Config file
cat /etc/systemd/journald.conf
# Key settings:
# Storage=persistent # Keep logs across reboots
# Compress=yes
# SystemMaxUse=500M # Max disk usage
# MaxRetentionSec=1month
# Current disk usage
journalctl --disk-usage
# Clean old logs
sudo journalctl --vacuum-time=7d
sudo journalctl --vacuum-size=500M

MistakeProblemSolution
Not checking timestampsLooking at wrong time periodAlways verify log time
Case-sensitive searchMissing errorsUse grep -i
Ignoring previous bootProblem happened before rebootjournalctl -b -1
No log forwardingLogs lost when pod diesSet up log aggregation
Searching too broadlyToo much noiseFilter by service, priority
Not checking all logsMissing correlationCheck multiple sources

A user reports that the web application started throwing 500 errors about 45 minutes ago. You need to quickly isolate the system-level error messages from that specific timeframe to identify the root cause without being overwhelmed by info-level noise. Which command should you run?

Show Answer
Terminal window
journalctl -p err --since "1 hour ago"

Filtering by priority is essential when a system is generating a massive volume of informational logs. By using the -p err flag, you instruct journald to only display messages with a severity of error (level 3) or higher, immediately cutting through the noise. The --since "1 hour ago" parameter scopes the search down to the relevant incident window, ensuring you don’t waste time investigating old, unrelated issues.

For warnings and errors combined, you can widen the priority slightly:

Terminal window
journalctl -p warning --since "1 hour ago"

Your Kubernetes node experienced a sudden kernel panic and automatically rebooted. You SSH into the node after it comes back online, but the current logs only show the successful startup sequence. How can you retrieve the logs from right before the crash?

Show Answer
Terminal window
journalctl -b -1

By default, running journalctl without arguments shows logs from the current boot, which isn’t helpful if you are investigating a crash that caused a restart. The -b flag targets a specific boot session, and appending -1 explicitly requests the logs from the immediately preceding boot. This allows you to inspect the system’s exact state and read the kernel messages that were recorded right before the panic occurred.

To list all available boot sessions and their IDs, you can run:

Terminal window
journalctl --list-boots

This is particularly useful when a system has crashed and restarted multiple times, as you may need to go back further than just the previous boot (e.g., -b -2).

You suspect a newly deployed microservice is occasionally failing to connect to the database. You want to quantify the impact by counting the exact number of times the “database connection timeout” message appears in the application log file. What approaches can you use?

Show Answer
Terminal window
# Count occurrences
grep -c "specific error message" /var/log/app.log
# With journalctl
journalctl -u service --no-pager | grep -c "error message"
# Group by time
grep "error" app.log | awk '{print $1}' | sort | uniq -c

Counting the raw number of errors helps establish the severity and frequency of an issue. Using the -c flag with grep is the most efficient way to get a total count because it avoids printing the matching lines to standard output, simply returning the integer tally. When you need to understand if the errors are a continuous stream or isolated spikes, piping the output to awk, sort, and uniq -c allows you to group the occurrences by timestamp, revealing the pattern of the failures over time.

You found a critical “Out of Memory” error in the /var/log/app.log file, but the error message itself doesn’t specify which transaction caused it. You need to see the log lines immediately preceding and following this error to reconstruct the sequence of events. How can you retrieve this context?

Show Answer
Terminal window
# 5 lines before and after
grep -C 5 "error message" /var/log/app.log
# Or separately:
grep -B 5 "error" # 5 lines before
grep -A 5 "error" # 5 lines after
# With journalctl, use time range around the event
journalctl --since "10:23:40" --until "10:23:50"

An isolated error message rarely tells the full story of why a failure occurred. The context flags in grep (-B for before, -A for after, and -C for context in both directions) allow you to see the application’s state leading up to the crash, such as the specific user request being processed. Alternatively, if you are using journalctl, extracting the exact timestamp of the error and querying a narrow time window around it lets you correlate events across multiple system services simultaneously.

A developer asks for your help because their newly deployed application is failing, but when they run kubectl logs pod-name, the output is completely empty. The pod status shows it has been running for 10 minutes. What are the most likely architectural or configurational reasons for this missing log output?

Show Answer

When kubectl logs returns nothing, it generally means the container engine isn’t capturing the application’s standard output. The most common reason is that the application is hardcoded to write its logs directly to a file inside the container’s filesystem (e.g., /var/log/app.log) instead of streaming to stdout and stderr. Furthermore, if the pod contains multiple containers, you might be querying a sidecar container that hasn’t logged anything yet instead of the main application container.

Several specific possibilities to investigate include:

  1. Application writes to files, not stdout: Container logs only capture stdout/stderr. Check if the app logs to a specific file inside the container.
  2. Container restarted: A new container starts with fresh logs. Use the --previous flag to view logs from the crashed instance.
  3. Logging to wrong container: In a multi-container pod, you must specify the target using -c container-name.
  4. Application hasn’t logged anything: The application framework might be buffering logs in memory, or the log level might be set too high (e.g., only logging critical errors).
  5. Log rotation: If the application generates massive logs, old logs may have already been rotated out by system policies.

Objective: Use journalctl and traditional log tools to analyze system logs.

Environment: Any Linux system with systemd

Terminal window
# 1. View recent logs
journalctl -n 20
# 2. Check disk usage
journalctl --disk-usage
# 3. List boots
journalctl --list-boots
# 4. Current boot only
journalctl -b -n 50
Terminal window
# 1. Filter by service
journalctl -u sshd -n 20
# Try other services: systemd, NetworkManager, etc.
# 2. Filter by priority
journalctl -p err -n 20
journalctl -p warning..err -n 20
# 3. Filter by time
journalctl --since "30 minutes ago" -n 50
journalctl --since "09:00" --until "10:00"
# 4. Combine filters
journalctl -u sshd -p warning --since today
Terminal window
# 1. Find a log file to analyze
ls -la /var/log/
LOG_FILE="/var/log/syslog" # or /var/log/messages
# 2. Basic viewing
tail -20 $LOG_FILE
head -20 $LOG_FILE
# 3. Search for errors
grep -i error $LOG_FILE | tail -10
grep -c -i error $LOG_FILE
# 4. Search with context
grep -C 3 -i error $LOG_FILE | tail -30
Terminal window
# 1. Find unique error types
grep -i error /var/log/syslog 2>/dev/null | \
awk '{$1=$2=$3=$4=$5=""; print}' | \
sort | uniq -c | sort -rn | head -10
# 2. Errors by hour
journalctl -p err --since today --no-pager | \
awk '{print $3}' | \
cut -d: -f1 | \
sort | uniq -c
# 3. Find authentication failures
grep -i "authentication failure\|failed password" /var/log/auth.log 2>/dev/null | tail -10
# Or
journalctl _COMM=sshd | grep -i "failed\|invalid" | tail -10
Terminal window
# 1. Generate an event
logger "TEST: Exercise event at $(date)"
# 2. Find it
journalctl --since "1 minute ago" | grep TEST
# 3. Find related events (same timestamp)
journalctl --since "1 minute ago"
# 4. Export for analysis
journalctl -u sshd --since "1 hour ago" -o json > /tmp/sshd_logs.json
head -5 /tmp/sshd_logs.json
Terminal window
# 1. Check journal size
journalctl --disk-usage
# 2. View rotation config (if exists)
cat /etc/logrotate.d/* 2>/dev/null | head -30
# 3. See what would be cleaned
# (Do not actually clean without understanding)
sudo journalctl --vacuum-time=7d --dry-run 2>/dev/null || \
echo "dry-run not supported, skip cleanup"
  • Viewed logs with journalctl using various filters
  • Filtered by service, priority, and time
  • Used grep to search text logs
  • Found patterns and counted occurrences
  • Correlated events across time
  • Checked log maintenance settings

  1. journalctl is powerful — Use filters: -u, -p, --since, field matches

  2. grep with context-B, -A, -C show surrounding lines

  3. Time matters — Always verify you’re looking at the right time period

  4. Correlate across services — Problems often span multiple components

  5. Set up log forwarding — Ephemeral containers lose logs


In Module 6.3: Process Debugging, you’ll learn how to trace process behavior with strace, examine /proc, and debug hung or misbehaving processes.