Module 5.7: Logging & Monitoring
Complexity:
[MEDIUM]- Essential observability skillsTime to Complete: 40-50 minutes
Prerequisites: Module 5.1 (Methodology), Modules 5.2-5.6 (troubleshooting specifics)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After this module, you will be able to:
- Query container logs using kubectl logs with container selection, previous logs, and follow mode
- Monitor cluster resource usage with kubectl top (nodes and pods) and explain metrics-server
- Implement sidecar-based logging for applications that write to files instead of stdout
- Debug metrics-server issues and explain how resource metrics flow from kubelet to API server
Why This Module Matters
Section titled “Why This Module Matters”Logs and metrics are your eyes into what’s happening in a cluster. Without them, troubleshooting is guesswork. Understanding how to access container logs, interpret events, and use metrics to identify resource issues is fundamental to effective Kubernetes operations.
The Security Camera Analogy
Logs are like security camera footage - they record what happened and when. Events are like a security guard’s incident reports - notable occurrences written down. Metrics are like building sensors - temperature, occupancy, power usage. Together, they tell the complete story of what’s happening in your cluster.
What You’ll Learn
Section titled “What You’ll Learn”By the end of this module, you’ll be able to:
- Access and filter container logs effectively
- Understand Kubernetes events and their significance
- Use kubectl top for resource metrics
- Navigate log locations on nodes
- Apply logging strategies for troubleshooting
Did You Know?
Section titled “Did You Know?”- Logs go to stdout/stderr: Kubernetes captures whatever containers write to stdout and stderr - that’s the only logging “magic”
- Events are stored in etcd: Events are regular Kubernetes objects with a 1-hour default TTL
- Metrics Server is not installed by default: kubectl top requires Metrics Server to be running
- Log rotation is kubelet’s job: kubelet rotates container logs based on size and count settings
Part 1: Container Logs
Section titled “Part 1: Container Logs”1.1 How Container Logging Works
Section titled “1.1 How Container Logging Works”┌──────────────────────────────────────────────────────────────┐│ CONTAINER LOGGING FLOW ││ ││ Container ││ ┌────────────────────────────────────────────────────┐ ││ │ Application │ ││ │ │ │ ││ │ ├── stdout ──────┐ │ ││ │ │ │ │ ││ │ └── stderr ──────┼────▶ Container runtime │ ││ │ │ captures output │ ││ └────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ Node filesystem ││ /var/log/containers/<pod>_<ns>_<container>-<id>.log ││ │ ││ ▼ ││ kubectl logs (reads these files via kubelet API) ││ │└──────────────────────────────────────────────────────────────┘1.2 Basic Log Commands
Section titled “1.2 Basic Log Commands”# View logs from a pod (single container)k logs <pod>
# View logs from specific container (multi-container pod)k logs <pod> -c <container>
# Follow logs in real-timek logs <pod> -f
# Show last N linesk logs <pod> --tail=50
# Show logs since timek logs <pod> --since=1hk logs <pod> --since=30m
# Show logs with timestampsk logs <pod> --timestamps
# Combine optionsk logs <pod> --tail=100 --timestamps -fStop and think: If a pod has multiple containers and you run
kubectl logs <pod>, what happens? How does Kubernetes know which container’s logs to show?
1.3 Multi-Container Pod Logs
Section titled “1.3 Multi-Container Pod Logs”# List containers in a podk get pod <pod> -o jsonpath='{.spec.containers[*].name}'
# Get logs from specific containerk logs <pod> -c <container>
# Get logs from all containersk logs <pod> --all-containers=true
# Get logs from init containersk logs <pod> -c <init-container>1.4 Previous Container Logs
Section titled “1.4 Previous Container Logs”Crucial for CrashLoopBackOff troubleshooting:
# Get logs from previous container instance (after crash)k logs <pod> --previousk logs <pod> -c <container> --previous
# This shows what was logged before the container died# Essential for understanding why it crashed1.5 Logs from Labels/Selectors
Section titled “1.5 Logs from Labels/Selectors”# Logs from all pods with a labelk logs -l app=nginx
# Logs from all pods in a deploymentk logs deployment/my-deployment
# Follow logs from all matching podsk logs -l app=nginx -f
# With container name for multi-container podsk logs -l app=nginx -c <container>Part 2: Kubernetes Events
Section titled “Part 2: Kubernetes Events”2.1 Understanding Events
Section titled “2.1 Understanding Events”┌──────────────────────────────────────────────────────────────┐│ KUBERNETES EVENTS ││ ││ Events are generated by: ││ • Scheduler (scheduling decisions) ││ • kubelet (container lifecycle) ││ • Controllers (resource management) ││ • API server (API operations) ││ ││ Event Types: ││ • Normal: Informational, things working as expected ││ • Warning: Something unexpected, might need attention ││ ││ Important: Events expire after ~1 hour by default! ││ │└──────────────────────────────────────────────────────────────┘Pause and predict: If a pod was created 3 days ago and has been CrashLoopBackOff ever since, will
kubectl get eventsshow the original scheduling and creation events? Why or why not?
2.2 Viewing Events
Section titled “2.2 Viewing Events”# All events in current namespacek get events
# All events cluster-widek get events -A
# Sort by time (most recent last)k get events --sort-by='.lastTimestamp'
# Sort by time (most recent first)k get events --sort-by='.lastTimestamp' | tac
# Filter by typek get events --field-selector type=Warning
# Events for specific objectk get events --field-selector involvedObject.name=<pod-name>
# Watch events in real-timek get events -w2.3 Common Event Reasons
Section titled “2.3 Common Event Reasons”| Reason | Type | What It Means |
|---|---|---|
| Scheduled | Normal | Pod assigned to node |
| Pulled | Normal | Image successfully pulled |
| Created | Normal | Container created |
| Started | Normal | Container started |
| Killing | Normal | Container being terminated |
| FailedScheduling | Warning | Couldn’t find suitable node |
| FailedMount | Warning | Volume mount failed |
| Unhealthy | Warning | Probe failed |
| BackOff | Warning | Container crashing, backing off |
| FailedCreate | Warning | Controller couldn’t create pod |
| Evicted | Warning | Pod evicted from node |
| OOMKilling | Warning | Container killed for OOM |
2.4 Events in Describe Output
Section titled “2.4 Events in Describe Output”# Events appear in describe outputk describe pod <pod># Look for the Events section at the bottom
k describe node <node># Shows node-level events
k describe pvc <pvc># Shows volume binding eventsPart 3: Resource Metrics
Section titled “Part 3: Resource Metrics”3.1 Metrics Server
Section titled “3.1 Metrics Server”┌──────────────────────────────────────────────────────────────┐│ METRICS SERVER ││ ││ Nodes Metrics Server ││ ┌──────────┐ ┌──────────────┐ ││ │ kubelet │────────────│ Collects │ ││ │ /metrics │ │ aggregates │ ││ └──────────┘ │ exposes │ ││ └──────┬───────┘ ││ │ ││ metrics.k8s.io API ││ │ ││ ▼ ││ kubectl top ││ ││ Without Metrics Server → kubectl top fails ││ │└──────────────────────────────────────────────────────────────┘3.2 Checking Metrics Server
Section titled “3.2 Checking Metrics Server”# Check if Metrics Server is installedk -n kube-system get pods | grep metrics-server
# Check metrics APIk get apiservices | grep metrics
# If not installed, top commands will failk top nodes # Error: Metrics API not available3.3 Using kubectl top
Section titled “3.3 Using kubectl top”# Node resource usagek top nodes
# Pod resource usage (current namespace)k top pods
# Pod resource usage (all namespaces)k top pods -A
# Sort by CPUk top pods --sort-by=cpu
# Sort by memoryk top pods --sort-by=memory
# Per-container usagek top pods --containers
# Specific podk top pod <pod-name>Stop and think: If
kubectl top podsshows a pod using 200m CPU, but its limit is 100m, what is likely happening to the application running inside that pod?
3.4 Interpreting Metrics
Section titled “3.4 Interpreting Metrics”┌──────────────────────────────────────────────────────────────┐│ METRIC INTERPRETATION ││ ││ NAME CPU(cores) MEMORY(bytes) ││ my-pod 100m 256Mi ││ ││ CPU: 100m = 100 millicores = 0.1 CPU core ││ 1000m = 1 core ││ ││ Memory: Mi = mebibytes (1024 * 1024 bytes) ││ Gi = gibibytes ││ ││ Compare against requests/limits: ││ If usage >> requests: might get OOMKilled ││ If usage >> limit: will get OOMKilled or CPU throttled ││ │└──────────────────────────────────────────────────────────────┘3.5 Resource Comparison
Section titled “3.5 Resource Comparison”# Compare actual usage vs requests# Step 1: Get requestsk get pod <pod> -o jsonpath='{.spec.containers[0].resources.requests}'
# Step 2: Get actual usagek top pod <pod>
# If actual >> requests, pod is under-requested# If actual << requests, pod is over-requestedPart 4: Node-Level Logs
Section titled “Part 4: Node-Level Logs”4.1 Log Locations on Nodes
Section titled “4.1 Log Locations on Nodes”┌──────────────────────────────────────────────────────────────┐│ NODE LOG LOCATIONS ││ ││ Container logs: ││ /var/log/containers/<pod>_<ns>_<container>-<id>.log ││ ││ Pod logs (symlinks): ││ /var/log/pods/<ns>_<pod>_<uid>/ ││ ││ kubelet logs: ││ journalctl -u kubelet ││ ││ Container runtime logs: ││ journalctl -u containerd ││ journalctl -u docker (if using docker) ││ ││ System logs: ││ /var/log/syslog or /var/log/messages ││ journalctl ││ │└──────────────────────────────────────────────────────────────┘4.2 Accessing Node Logs
Section titled “4.2 Accessing Node Logs”# SSH to node firstssh <node>
# Container logs directlyls /var/log/containers/tail -f /var/log/containers/<pod>*.log
# kubelet logsjournalctl -u kubelet -fjournalctl -u kubelet --since "10 minutes ago"journalctl -u kubelet | grep -i error
# Container runtime logsjournalctl -u containerd -f
# System messagesdmesg | tail -50journalctl -xe4.3 Control Plane Component Logs
Section titled “4.3 Control Plane Component Logs”On control plane nodes:
# If using static pods (kubeadm)# Logs available via kubectlk -n kube-system logs kube-apiserver-<node>k -n kube-system logs kube-scheduler-<node>k -n kube-system logs kube-controller-manager-<node>k -n kube-system logs etcd-<node>
# Or directly on node via journalctl (if systemd services)journalctl -u kube-apiserverjournalctl -u kube-schedulerjournalctl -u kube-controller-managerjournalctl -u etcdPart 5: Logging Strategies for Troubleshooting
Section titled “Part 5: Logging Strategies for Troubleshooting”5.1 The Log Analysis Workflow
Section titled “5.1 The Log Analysis Workflow”┌──────────────────────────────────────────────────────────────┐│ LOG ANALYSIS WORKFLOW ││ ││ 1. Start with events ││ k describe <resource> | grep -A 20 Events ││ ││ 2. Check recent events cluster-wide ││ k get events --sort-by='.lastTimestamp' | tail ││ ││ 3. Get container logs ││ k logs <pod> ││ k logs <pod> --previous (if crashed) ││ ││ 4. Filter for errors ││ k logs <pod> | grep -i error ││ k logs <pod> | grep -i exception ││ ││ 5. Check timing ││ k logs <pod> --timestamps --since=10m ││ ││ 6. Check related components ││ If pod issues: check node ││ If network issues: check CNI, kube-proxy ││ If DNS issues: check CoreDNS ││ │└──────────────────────────────────────────────────────────────┘5.2 Filtering Log Output
Section titled “5.2 Filtering Log Output”# Search for errorsk logs <pod> | grep -i errork logs <pod> | grep -i exceptionk logs <pod> | grep -i fatal
# Exclude noisek logs <pod> | grep -v "INFO"k logs <pod> | grep -v "health check"
# Complex filtersk logs <pod> | grep -E "error|warning|failed"
# With timestamps and filteringk logs <pod> --timestamps | grep "2024-01-15T10:3"
# Count error occurrencesk logs <pod> | grep -c error5.3 Multi-Pod Log Analysis
Section titled “5.3 Multi-Pod Log Analysis”# Logs from all pods in deploymentk logs deployment/<name> --all-containers
# Aggregate logs from multiple pods with labelsk logs -l app=frontend --all-containers
# Using stern (not built-in, but useful)# stern <pod-name-pattern>
# Workaround: loop through podsfor pod in $(k get pods -l app=nginx -o name); do echo "=== $pod ===" k logs $pod --tail=5done5.4 Correlating Events and Logs
Section titled “5.4 Correlating Events and Logs”# Get event timek get events --field-selector involvedObject.name=my-pod
# Note the timestamp, then check logs around that timek logs my-pod --since-time="2024-01-15T10:30:00Z"
# Or use relative timek logs my-pod --since=5mPart 6: Monitoring Patterns
Section titled “Part 6: Monitoring Patterns”6.1 Proactive Monitoring Commands
Section titled “6.1 Proactive Monitoring Commands”# Quick cluster health checkk get nodesk get pods -A | grep -v Runningk top nodesk get events -A --field-selector type=Warning
# Create a simple monitoring scriptwatch -n 5 'kubectl get pods -A | grep -v Running | grep -v Completed'6.2 Resource Pressure Detection
Section titled “6.2 Resource Pressure Detection”# Check for node pressurek describe nodes | grep -E "MemoryPressure|DiskPressure|PIDPressure"
# Check for pods using excessive resourcesk top pods -A --sort-by=memory | head -10k top pods -A --sort-by=cpu | head -10
# Check for pending pods (might indicate resource shortage)k get pods -A --field-selector=status.phase=Pending6.3 Debugging with Temporary Pods
Section titled “6.3 Debugging with Temporary Pods”# Create debug pod with networking toolsk run debug --image=nicolaka/netshoot --rm -it --restart=Never -- bash
# Simple debug podk run debug --image=busybox:1.36 --rm -it --restart=Never -- sh
# Debug with specific service accountk run debug --image=busybox:1.36 --rm -it --restart=Never --serviceaccount=<sa> -- sh
# Debug in specific namespacek run debug -n <namespace> --image=busybox:1.36 --rm -it --restart=Never -- shCommon Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
Forgetting --previous | Can’t see crash logs | Use --previous for CrashLoopBackOff |
| Not filtering logs | Too much noise | Use grep, --since, --tail |
| Ignoring events | Miss obvious clues | Always check events first |
| Missing multi-container | Logs from wrong container | Use -c <container> or --all-containers |
| Events expired | No historical data | Check logs immediately after incident |
| No Metrics Server | kubectl top fails | Install Metrics Server |
Q1: Investigating a Crash
Section titled “Q1: Investigating a Crash”You deploy a new version of your application. The pod enters a CrashLoopBackOff state. When you run kubectl logs my-app, you only see a single line saying “Starting application…”, but the application keeps crashing. How do you find the actual error that caused the crash, and why does the standard log command hide it?
Answer
You must use the command kubectl logs my-app --previous to see the actual error. When a pod is in a CrashLoopBackOff state, it means the container is repeatedly dying and being restarted by the kubelet. The standard kubectl logs command only shows the output of the currently running container instance. Since the current instance just started and hasn’t crashed yet, you only see the initial startup message. By appending the --previous flag, you instruct Kubernetes to retrieve the logs from the container instance that most recently terminated, which will contain the stack trace or error message explaining why it died.
Q2: Missing Historical Context
Section titled “Q2: Missing Historical Context”A developer reports that their batch job failed sometime over the weekend, about 48 hours ago. They ask you to check the Kubernetes events to see if there were any node scheduling issues or image pull errors at that time. When you run kubectl get events --field-selector involvedObject.name=batch-job, it returns “No resources found”. What happened to the events, and how could you have preserved them?
Answer
Kubernetes events have a default time-to-live (TTL) of exactly one hour. They are not designed for long-term auditing or historical logging; rather, they are transient objects stored in etcd to provide immediate feedback about cluster operations. After the one-hour window expires, the Kubernetes garbage collector automatically deletes them to prevent etcd from running out of storage space. To preserve events for long-term analysis, you must export them to an external logging system or observability platform using a dedicated event-shipping tool (like eventrouter or a fluentd plugin) before they expire.
Q3: Missing Resource Metrics
Section titled “Q3: Missing Resource Metrics”You are trying to determine if a specific deployment needs its memory limits increased. You run the command kubectl top pods -n production, but the API server returns an error stating “metrics not available”. You verify that your kubeconfig is correct and you have the necessary RBAC permissions. What cluster component is likely missing or failing, and why does this command depend on it?
Answer
The cluster is likely missing the Metrics Server, or the Metrics Server deployment is currently unhealthy. The kubectl top command does not directly query the nodes or the kubelet for resource utilization data. Instead, it queries the metrics.k8s.io API, which must be served by an aggregation layer component. The Metrics Server acts as this component; it continuously polls the kubelet API on each node, aggregates the CPU and memory usage data, and exposes it through the metrics API. Without this server running and correctly registered with the main API server, the metrics endpoint will not exist, causing the command to fail.
Q4: Aggregating Sidecar Logs
Section titled “Q4: Aggregating Sidecar Logs”You have a pod named web-app that runs your main application container and a sidecar container running a log forwarding agent. You suspect the log forwarder is failing to authenticate, but you also want to see if the main application is logging any connection errors at the same time. How can you view the logs from both containers simultaneously in a single command, and why is this useful?
Answer
You can view the combined output by using the command kubectl logs web-app --all-containers=true. By default, if you run kubectl logs against a multi-container pod without specifying a container, Kubernetes will either prompt you to specify one or just pick the first one listed in the pod spec. Using the --all-containers flag instructs the API server to fetch the logs from every container within the specified pod and interleave them in the output. This is highly useful for correlating events across tightly coupled containers, as it allows you to see the exact sequence of events occurring across both the main application and its sidecar.
Q5: Bypassing the API Server
Section titled “Q5: Bypassing the API Server”The Kubernetes API server is currently unresponsive due to a certificate expiration issue. You have SSH access to the worker node where a critical database pod is running, and you need to check its logs immediately to ensure it hasn’t corrupted its data volume. How do you view the logs for this specific container directly on the node, and what component manages these files?
Answer
You can view the logs directly by navigating to the /var/log/containers/ directory on the worker node. Inside this directory, you will find log files named using the pattern <pod-name>_<namespace>_<container-name>-<container-id>.log. These files are actually symbolic links that point to the actual log files generated by the container runtime (such as containerd or CRI-O). The kubelet is responsible for managing these symlinks and also handles the log rotation based on its configured maximum size and file count limits, ensuring the node’s disk does not fill up.
Q6: Diagnosis Strategy
Section titled “Q6: Diagnosis Strategy”A pod is stuck in the Pending state. A junior administrator suggests running kubectl logs my-pending-pod to see what the application is complaining about. Why is this the wrong approach, and what command should they run instead to understand why the pod is stuck?
Answer
Running kubectl logs is the wrong approach because a pod in the Pending state has not yet been scheduled to a node, or its containers have not yet started executing. Since no container runtime is running the application process, there is absolutely no standard output or standard error to capture, meaning the logs command will return an error or nothing at all. Instead, the administrator should check the Kubernetes events by running kubectl describe pod my-pending-pod or kubectl get events. Events will reveal cluster-level scheduling issues, such as a lack of node resources, failed persistent volume claims, or unmet node affinity rules that are preventing the pod from starting.
Hands-On Exercise: Log and Metric Analysis
Section titled “Hands-On Exercise: Log and Metric Analysis”Scenario
Section titled “Scenario”Practice using logs and metrics for troubleshooting.
# Create test namespacek create ns logging-lab
# Create a pod that generates logscat <<'EOF' | k apply -f -apiVersion: v1kind: Podmetadata: name: log-generator namespace: logging-labspec: containers: - name: logger image: busybox:1.36 command: - sh - -c - | i=0 while true; do echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Log message $i" if [ $((i % 5)) -eq 0 ]; then echo "$(date '+%Y-%m-%d %H:%M:%S') ERROR: Something went wrong at iteration $i" >&2 fi i=$((i+1)) sleep 2 doneEOF
# Create a crashy pod for --previous democat <<'EOF' | k apply -f -apiVersion: v1kind: Podmetadata: name: crashy-pod namespace: logging-labspec: containers: - name: crasher image: busybox:1.36 command: - sh - -c - | echo "Starting up..." sleep 5 echo "About to crash!" exit 1EOFTask 1: Basic Log Operations
Section titled “Task 1: Basic Log Operations”# View logsk logs -n logging-lab log-generator
# Follow logsk logs -n logging-lab log-generator -f# (Ctrl+C to stop)
# Last 10 linesk logs -n logging-lab log-generator --tail=10
# With timestampsk logs -n logging-lab log-generator --tail=10 --timestampsTask 2: Filtering Logs
Section titled “Task 2: Filtering Logs”# Find errors onlyk logs -n logging-lab log-generator | grep ERROR
# Count errorsk logs -n logging-lab log-generator | grep -c ERROR
# Exclude INFO messagesk logs -n logging-lab log-generator | grep -v INFOTask 3: Previous Container Logs
Section titled “Task 3: Previous Container Logs”# Wait for crashy-pod to crash and restartk get pod -n logging-lab crashy-pod -w
# When it shows CrashLoopBackOff or restarts, check previous logsk logs -n logging-lab crashy-pod --previousTask 4: Events Analysis
Section titled “Task 4: Events Analysis”# All events in namespacek get events -n logging-lab --sort-by='.lastTimestamp'
# Describe pod for eventsk describe pod -n logging-lab crashy-pod | grep -A 10 Events
# Watch for new eventsk get events -n logging-lab -wTask 5: Metrics (if Metrics Server installed)
Section titled “Task 5: Metrics (if Metrics Server installed)”# Node metricsk top nodes
# Pod metricsk top pods -n logging-lab
# All pods by memoryk top pods -A --sort-by=memory | headSuccess Criteria
Section titled “Success Criteria”- Viewed live logs with follow
- Filtered logs for errors
- Retrieved previous container logs
- Analyzed events for crash information
- Used kubectl top (if Metrics Server available)
Cleanup
Section titled “Cleanup”k delete ns logging-labPractice Drills
Section titled “Practice Drills”Drill 1: View Last N Logs (30 sec)
Section titled “Drill 1: View Last N Logs (30 sec)”# Task: Show last 20 log linesk logs <pod> --tail=20Drill 2: Logs with Timestamps (30 sec)
Section titled “Drill 2: Logs with Timestamps (30 sec)”# Task: Show logs with timestampsk logs <pod> --timestampsDrill 3: Previous Container Logs (30 sec)
Section titled “Drill 3: Previous Container Logs (30 sec)”# Task: Get logs from crashed containerk logs <pod> --previousDrill 4: Multi-Container Logs (1 min)
Section titled “Drill 4: Multi-Container Logs (1 min)”# Task: Get logs from specific containerk get pod <pod> -o jsonpath='{.spec.containers[*].name}'k logs <pod> -c <container-name>Drill 5: Recent Events (30 sec)
Section titled “Drill 5: Recent Events (30 sec)”# Task: Show events sorted by timek get events --sort-by='.lastTimestamp'Drill 6: Warning Events (30 sec)
Section titled “Drill 6: Warning Events (30 sec)”# Task: Show only warning eventsk get events --field-selector type=WarningDrill 7: Node Metrics (30 sec)
Section titled “Drill 7: Node Metrics (30 sec)”# Task: Show node resource usagek top nodesDrill 8: Pod Metrics Sorted (30 sec)
Section titled “Drill 8: Pod Metrics Sorted (30 sec)”# Task: Show top memory-consuming podsk top pods -A --sort-by=memory | headPart 5 Summary
Section titled “Part 5 Summary”Congratulations on completing Part 5: Troubleshooting! You’ve learned:
- Methodology: Systematic approach to diagnosis
- Application Failures: Pod, container, and deployment issues
- Control Plane: API server, scheduler, controller manager, etcd
- Worker Nodes: kubelet, runtime, and node resources
- Networking: Pod connectivity, DNS, and CNI
- Services: ClusterIP, NodePort, LoadBalancer, Ingress
- Logging & Monitoring: Logs, events, and metrics
With 30% of the CKA exam weight, troubleshooting is critical. Practice the drills until they’re second nature.
Next Steps
Section titled “Next Steps”Continue to Part 6: Mock Exams to test your knowledge with realistic exam scenarios.