Skip to content

Module 0.3: Process & Resource Survival Guide

Hands-On Lab Available
Ubuntu intermediate 35 min
Launch Lab ↗

Opens in Killercoda in a new tab

Everyday Use | Complexity: [QUICK] | Time: 40 min

Before starting this module:


After this module, you will be able to:

  • Monitor system resources using top, htop, free, df, and du
  • Identify resource-hungry processes and decide whether to optimize or kill them
  • Explain load average, CPU steal, and memory pressure in the context of K8s node health
  • Diagnose a slow system by systematically checking CPU, memory, disk, and network

Your Linux machine is running dozens — sometimes hundreds — of programs right now, all at the same time. Your web browser, your terminal, your SSH session, that forgotten download script from two days ago. These running programs are called processes, and knowing how to find them, watch them, and stop them is one of the most important skills in DevOps.

Here is why:

  • Debugging starts with processes — When something is slow or broken, the first question is always “what is running and how much is it eating?”
  • Servers do not fix themselves — A runaway process eating 100% CPU at 3am will not politely stop. You need to know how to kill it
  • Disk space vanishes — Containers, logs, and temp files will fill your disks. Knowing how to find what is eating space saves your production systems
  • Kubernetes runs processes — Every container is a process. kubectl exec, pod termination, resource limits — they all map back to what you will learn here

Think of it this way: Module 0.1 taught you to navigate the house. This module teaches you to check who is home, what they are doing, and politely (or forcefully) ask them to leave.


  • Your system has a family tree of processes — Every process has a parent. The very first process (PID 1) is the ancestor of everything else. When you open a terminal and run ls, your shell is the parent and ls is the child. Kill PID 1 and the entire system goes down. In Kubernetes, your app becomes PID 1 inside its container — which is why signal handling matters.

  • kill does not always kill — Despite the name, the kill command actually sends signals. kill without options sends SIGTERM, which is more like a polite “please shut down.” The process can ignore it entirely! Only kill -9 (SIGKILL) is truly fatal, because the kernel handles it and the process never even sees it coming.

  • Linux invented “zombie” processes — A zombie process is one that has finished running but is still hanging around in the process table because its parent has not picked up its exit status. Zombies use no CPU or memory — they are literally just a name in a list. But if thousands accumulate, you run out of process IDs and nothing new can start.

  • The /proc directory is a window into every process — Each running process gets a folder at /proc/<PID>/ full of live information. Want to know exactly what command started process 1234? Read /proc/1234/cmdline. Want its environment variables? /proc/1234/environ. This is not a log file — it is the kernel telling you what is happening right now.


A process is simply a program that is currently running. When you type ls in your terminal, the ls program loads into memory, does its work, and exits. While it is running, it is a process.

Every process gets:

  • A PID (Process ID) — a unique number that identifies it
  • A parent (PPID) — the process that started it
  • Resources — memory, CPU time, open files

Here is a simple analogy: a program on disk is like a recipe in a cookbook. A process is what happens when you actually start cooking — you have ingredients on the counter, pots on the stove, and timers running.


The ps command takes a snapshot of running processes — think of it as a photograph, not a video.

Terminal window
# Show processes in your current terminal session
ps

Output looks like:

PID TTY TIME CMD
1234 pts/0 00:00:00 bash
5678 pts/0 00:00:00 ps

That is just your shell and the ps command itself. Not very exciting. To see everything:

Terminal window
# Show ALL processes from ALL users
ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 171584 13324 ? Ss Mar20 0:15 /sbin/init
root 2 0.0 0.0 0 0 ? S Mar20 0:00 [kthreadd]
www-data 1500 0.5 2.0 500000 40000 ? S Mar20 1:30 nginx: worker
you 3456 0.0 0.1 23456 5678 pts/0 Ss 10:00 0:00 -bash

Here is what each column means:

ColumnWhat It Tells You
USERWho owns the process
PIDThe process ID number
%CPUHow much CPU it is using right now
%MEMHow much memory it is using
VSZVirtual memory size (includes shared libraries — often misleadingly large)
RSSResident Set Size — actual physical memory used (the number you care about)
TTYWhich terminal it is attached to (? means it is a background service)
STATProcess state (more on this below)
TIMETotal CPU time consumed
COMMANDThe command that started the process

Pause and predict: Run ps aux now. Find the process using the most memory. What is its COMMAND? Is the RSS what you expected?

Do not scroll through hundreds of lines. Filter:

Terminal window
# Find all processes with "nginx" in the name
ps aux | grep nginx
# Better: use pgrep (no grep noise in results)
pgrep -a nginx

You will see letters in the STAT column:

CodeMeaning
SSleeping — waiting for something (most processes)
RRunning — actively using the CPU right now
TStopped — paused (you will learn how below)
ZZombie — finished but parent has not cleaned up
DDisk sleep — waiting for I/O, cannot be interrupted

A lowercase s after the state (like Ss) means it is a session leader, and + means it is in the foreground.


While ps is a snapshot, top is a live dashboard that updates every few seconds.

Terminal window
# Launch top
top

You will see something like:

top - 10:30:00 up 3 days, 2:15, 2 users, load average: 0.52, 0.58, 0.59
Tasks: 143 total, 2 running, 140 sleeping, 0 stopped, 1 zombie
%Cpu(s): 5.0 us, 2.0 sy, 0.0 ni, 92.0 id, 1.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 7953.5 total, 2345.2 free, 3210.1 used, 2398.2 buff/cache
MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 4320.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1500 www-data 20 0 500000 40000 8192 S 5.0 2.0 1:30.00 nginx
2100 mysql 20 0 1200000 300000 12000 S 3.0 15.0 12:45.00 mysqld

The top section gives you the system overview:

  • load average: Three numbers (1-min, 5-min, 15-min). If these are higher than your CPU count, the system is overloaded. A 4-core machine with load average 4.0 is at capacity
  • Tasks: How many processes exist and in which states
  • %Cpu: us is user programs, sy is kernel, id is idle. Two critical metrics for cloud/K8s are wa (waiting for disk I/O) and st (CPU steal). CPU Steal happens in virtualized environments (like AWS/GCP or VMs) when the hypervisor takes CPU cycles away from your VM to give to another tenant. If st is consistently high, your node is being starved by the cloud provider (“noisy neighbor” problem).
  • MiB Mem / Swap: This shows system memory. Memory pressure occurs when your available memory drops near zero and the system starts using Swap (writing memory pages to the slow hard drive). In Kubernetes, memory pressure is critical: if a node runs out of memory, the kernel’s OOM (Out Of Memory) Killer will forcefully terminate pods (SIGKILL) to save the node.

Useful keys while top is running:

KeyAction
qQuit
MSort by memory usage
PSort by CPU usage
kKill a process (it will ask for the PID)
1Toggle showing individual CPU cores

Pause and predict: Launch top right now. What is your current load average? How many CPUs do you have? Is your system overloaded? (Hint: compare load average to CPU count.)

htop is an improved version of top with colors, mouse support, and a much friendlier interface. It may not be installed by default:

Terminal window
# Install htop
# Debian/Ubuntu:
sudo apt install htop -y
# RHEL/Fedora:
sudo dnf install htop -y
# Run it
htop

Why htop is better for beginners:

  • Color-coded CPU and memory bars at the top
  • You can scroll through the process list with arrow keys
  • Press F5 to see a tree view (which process started which)
  • Press F9 to send a signal to a process
  • Press / to search for a process by name
  • Press F6 to choose how to sort

For now, know that htop exists and try it. In your day-to-day work, most people reach for htop over top.

While top shows memory usage per process, the free command gives you the system-wide memory picture instantly.

Terminal window
# Show memory in human-readable format (megabytes/gigabytes)
free -h
total used free shared buff/cache available
Mem: 7.8Gi 3.1Gi 2.3Gi 45Mi 2.3Gi 4.2Gi
Swap: 2.0Gi 0B 2.0Gi

The most important column is available, not free. Linux intentionally uses “free” memory to cache disk files (buff/cache) to speed up the system. If an application needs memory, Linux instantly drops the cache and hands over the RAM. The available column tells you how much memory can actually be given to new processes before the system starts swapping to disk.


Killing Processes: Signals, kill, and killall

Section titled “Killing Processes: Signals, kill, and killall”

Sometimes a process needs to stop. Maybe it is stuck, eating too much memory, or you just do not need it anymore. This is where signals come in.

Signals are messages the operating system can deliver to a process. Think of them as tapping someone on the shoulder (SIGTERM) vs. physically dragging them out of the room (SIGKILL).

The two most important signals:

SignalNumberWhat Happens
SIGTERM15”Please shut down gracefully.” The process receives this and can clean up — save files, close connections, finish writes. This is the polite way.
SIGKILL9”You are done. Now.” The kernel terminates the process instantly. No cleanup, no saving, no last words. Use this only when SIGTERM fails.

Other useful signals:

SignalNumberWhat Happens
SIGHUP1”Hangup” — many services reload their config when they receive this
SIGINT2Interrupt — this is what Ctrl+C sends
SIGSTOP19Pause the process (cannot be caught or ignored)
SIGCONT18Resume a paused process

Stop and think: Before reading on — if you run kill on a sleep process, will it terminate or ignore the signal? Why? Try it and check.

Despite the name, kill just sends a signal. By default, it sends SIGTERM:

Terminal window
# Send SIGTERM (the default — always try this first)
kill 1234
# Explicitly send SIGTERM
kill -15 1234
kill -TERM 1234
# Send SIGKILL (the nuclear option — use only if SIGTERM fails)
kill -9 1234
kill -KILL 1234
# Reload a service's config
kill -HUP 1234

If you know the process name but not the PID:

Terminal window
# Kill all processes named "nginx"
killall nginx
# Kill by partial name match
pkill sleep
# Kill all processes by a specific user
pkill -u username

The golden rule: SIGTERM first, SIGKILL second

Section titled “The golden rule: SIGTERM first, SIGKILL second”

Always follow this order:

Step 1: kill <PID> # Sends SIGTERM — give it a few seconds
Step 2: (wait 5 seconds)
Step 3: kill -9 <PID> # SIGKILL only if still running

Why? Because SIGTERM lets the process clean up. A database receiving SIGTERM can finish writing transactions and flush to disk. Hit it with SIGKILL and you might corrupt data.

This exact pattern happens in Kubernetes: when a pod is deleted, Kubernetes sends SIGTERM, waits 30 seconds (the terminationGracePeriodSeconds), and then sends SIGKILL if the process is still running.


So far, every command you have run takes over your terminal until it finishes. But what if you want to run something that takes a long time and keep using your terminal?

Running a command in the background with &

Section titled “Running a command in the background with &”

Add & at the end of any command:

Terminal window
# Run sleep in the background
sleep 300 &
# Output: [1] 12345
# [1] is the job number, 12345 is the PID

Your terminal is immediately available again. The process runs silently in the background.

Terminal window
# See all background jobs in this terminal
jobs
# Output:
# [1]+ Running sleep 300 &
Terminal window
# Start a long-running command
sleep 300
# Oh no, it is blocking my terminal! Press Ctrl+Z to PAUSE it
# Output: [1]+ Stopped sleep 300
# Now resume it in the BACKGROUND
bg
# Output: [1]+ sleep 300 &
# Or bring a background job back to the FOREGROUND
fg
# (Now sleep 300 is running in the foreground again)
# If you have multiple jobs, specify the job number
fg %1
bg %2

Here is a trap that catches everyone at least once: you SSH into a server, start a long process, close your laptop, and the process dies. Why? Because when your terminal disconnects, it sends SIGHUP to all its child processes, which kills them.

nohup (short for “no hangup”) prevents this:

Terminal window
# This will survive even if you disconnect from SSH
nohup ./long-running-script.sh &
# Output goes to nohup.out by default
# Redirect it if you prefer
nohup ./long-running-script.sh > /tmp/my-output.log 2>&1 &

A quick reference:

ActionCommand
Run in backgroundcommand &
List background jobsjobs
Pause foreground processCtrl+Z
Resume in backgroundbg or bg %N
Bring to foregroundfg or fg %N
Survive disconnectnohup command &

Stop and think: You accidentally ran a database migration in the foreground and it will take 20 minutes. You need your terminal back but cannot restart the migration. What two keystrokes solve this? Try it with sleep 1200 to verify.


Running out of disk space is one of the most common emergencies in DevOps. Logs grow, container images pile up, and temp files multiply. You need two tools: df to check overall disk space and du to find what is eating it.

Terminal window
# Show disk usage in human-readable format
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 35G 13G 73% /
/dev/sda2 200G 180G 10G 95% /var
tmpfs 3.9G 0 3.9G 0% /dev/shm

The important columns:

ColumnMeaning
SizeTotal size of the filesystem
UsedHow much is consumed
AvailHow much is free
Use%Percentage used — start worrying above 80%, panic above 95%
Mounted onWhere this filesystem is accessible in the directory tree

In the example above, /var is at 95% — that is a problem waiting to happen. Logs are usually stored in /var/log, so this is extremely common on servers.

Pause and predict: Run df -h on your system. Which filesystem has the highest Use%? Predict which directory is the biggest consumer before running du.

df told you which disk is full. du helps you find which directories are the culprits:

Terminal window
# Show the size of the current directory
du -sh .
# Show sizes of immediate subdirectories, sorted by size
du -sh /* 2>/dev/null | sort -rh | head -10
15G /var
8G /usr
5G /home
3G /opt
1G /tmp

Now drill down into the biggest offender:

Terminal window
# Dig into /var
du -sh /var/* 2>/dev/null | sort -rh | head -10
12G /var/log
2G /var/lib
500M /var/cache

Found it! /var/log has 12GB. One more level:

Terminal window
du -sh /var/log/* 2>/dev/null | sort -rh | head -5
10G /var/log/syslog.1
1.5G /var/log/auth.log
500M /var/log/kern.log

A 10GB rotated syslog. That is your culprit.

The pattern: Start wide with df -h, then drill down with du -sh <dir>/* | sort -rh | head.

To see what physical disks and partitions exist on the system:

Terminal window
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 50G 0 disk
├─sda1 8:1 0 49G 0 part /
└─sda2 8:2 0 1G 0 part [SWAP]
sdb 8:16 0 200G 0 disk
└─sdb1 8:17 0 200G 0 part /var

This shows you the physical layout: sda is a 50GB disk with two partitions, and sdb is a 200GB disk mounted at /var. This is useful when you need to understand why a particular mount point has limited space — it might be on a separate, smaller disk.


In a traditional server environment, hundreds of background processes and services run together. In Kubernetes, the environment is radically simplified: a container is essentially just an isolated process (or a small group of them) running on the host node.

  • PID 1 in Containers: The ENTRYPOINT or CMD of your Dockerfile becomes PID 1 inside the container. If that primary process dies, the container terminates. If your PID 1 doesn’t know how to pass signals (like SIGTERM) to its children, graceful pod shutdown fails, resulting in forced SIGKILLs and potential data corruption.
  • Resource Limits: When you set resources.limits.memory on a K8s Pod, you are telling the Linux kernel’s cgroups feature to watch that specific process tree. If the processes exceed the allocated limit, the kernel invokes the OOMKiller to terminate the container immediately.
  • Execing into Pods: When you run kubectl exec -it my-pod -- bash, you are not SSHing into a virtual machine. You are simply asking the container runtime (like containerd) to start a new bash process on the node and attach it to the same isolated namespaces as the pod’s PID 1.

MistakeWhy It Is BadWhat To Do Instead
Using kill -9 as the first optionSkips graceful shutdown — databases can corrupt, files can be half-writtenAlways try kill <PID> (SIGTERM) first, wait a few seconds, then kill -9 only if needed
Forgetting & and Ctrl+C on a long taskKills the process you actually wanted runningUse Ctrl+Z then bg to move it to the background without stopping it
Running du -sh / without 2>/dev/nullPermission-denied errors flood your screenAlways pipe stderr: du -sh /* 2>/dev/null | sort -rh
Ignoring df until the disk is 100% fullServices crash, logs stop writing, databases corruptSet up monitoring or check df -h regularly on servers
Using nohup but forgetting &The command still runs in the foreground and blocks your terminalAlways combine them: nohup command &
Killing a process by PID without checking firstYou might kill the wrong process if the PID was reusedAlways run ps -p <PID> -o cmd first to confirm what you are about to kill

You are writing a bash script to monitor a specific application and restart it if it crashes. Should your script use ps or top to check if the process is running?

Show Answer

Your script should use ps (or pgrep). ps takes a single, instantaneous snapshot of the process table and exits, making it perfect for scripts to parse. top is an interactive, continuously updating dashboard designed for human eyes, which will block your script from continuing and flood the output with terminal escape characters. Using ps ensures your script gets exactly the data it needs in a single pass without hanging.

You notice a runaway Python script consuming 100% CPU. You run kill 5678, but when you check 10 seconds later, the process is still running. What happened, and what is your exact next step?

Show Answer

Running kill without any flags sends a SIGTERM (signal 15) to the process, which is merely a polite request to shut down. The process is allowed to catch this signal to perform cleanup tasks, or it can ignore it entirely if it is stuck in an infinite loop or poorly written. Because the polite request was ignored, your next step is to run kill -9 5678 to send a SIGKILL. This signal is handled directly by the Linux kernel, which forcefully terminates the process immediately without giving it a chance to ignore the command.

You log into a production database server via SSH and start a critical database migration script using ./migrate-db.sh &. You then close your laptop to commute home. When you reconnect later, the migration has failed halfway through. Why did this happen despite running it in the background?

Show Answer

Adding the & at the end of a command only runs it in the background of your current terminal session, allowing you to type other commands, but it still belongs to your session’s process tree. When you closed your laptop, the SSH connection dropped, causing the server to send a SIGHUP (hangup) signal to your terminal session and all of its child processes, terminating the migration instantly. To prevent this, you must run the command with nohup ./migrate-db.sh & (or use a multiplexer like tmux), which instructs the process to ignore the SIGHUP signal and keep running after you disconnect.

An alerting system pages you at 2 AM because a Kubernetes worker node is entirely unresponsive. You manage to SSH in, run df -h, and notice the /var partition is at 99% capacity. How do you systematically identify the exact file causing the issue?

Show Answer

You need to drill down into the file system hierarchically to locate the space consumer because df only shows top-level partitions, not individual directories. Start broadly by running du -sh /var/* 2>/dev/null | sort -rh | head -10 to find the largest directories within /var. Once you identify the largest directory (often /var/log), repeat the command on that specific sub-directory to narrow it down further. You continue this pattern of investigating the largest path until you isolate the specific massive file. Once found, you can safely truncate the runaway log file (e.g., > /var/log/huge-file.log) to instantly restore node health without breaking file handles.

You are viewing a massive, continuously scrolling application log using tail -f, but you urgently need to run a curl command to test the application API without losing your place in the logs. How do you temporarily get your prompt back and then return to the logs?

Show Answer

You should press Ctrl+Z to send a SIGTSTP (suspend) signal to the active tail process. Unlike Ctrl+C (SIGINT), which terminates the process entirely and loses your state, Ctrl+Z simply freezes the process in memory and returns control of the terminal to you. You can then comfortably run your curl command to test the API. Once finished, type the fg command to bring the frozen tail process back to the foreground, resuming the log output exactly where you left off.

You run top on a Kubernetes node that is performing extremely poorly. The CPU usage shows %Cpu(s): 0.5 us, 1.0 sy, 0.0 id, 0.0 wa, 98.5 st. Memory is mostly available. What is the root cause of the slowness, and can you fix it by killing processes?

Show Answer

The root cause of the slowness is severe CPU Steal (st), which is sitting at 98.5%. This means the underlying cloud provider’s hypervisor is taking almost all of the physical CPU cycles away from your virtual machine to serve other tenants on the same shared hardware (the “noisy neighbor” problem). You cannot fix this issue by killing processes on your Linux node because the constraint is external to your operating system. Your only resolutions are to wait for the hypervisor to allocate resources back, cordon and drain the node to move your pods elsewhere, or upgrade your instance type to one with dedicated CPU cores.


Objective: Practice finding, monitoring, and killing processes, plus investigating disk usage.

Environment: Any Linux system (VM, WSL, or native installation)

Terminal window
# Start a sleep process in the background
sleep 600 &
# The shell tells you the job number and PID:
# [1] 12345
# Save the PID for later (your number will be different)
MY_PID=$!
echo "My background process PID is: $MY_PID"
Terminal window
# Find your sleep process using ps
ps aux | grep "sleep 600"
# Cleaner: use ps to show just that PID
ps -p $MY_PID -o pid,user,stat,cmd
# See it in the process tree
ps auxf | grep sleep

Verify that you can see the PID, that the STAT shows S (sleeping), and that the COMMAND is sleep 600.

Terminal window
# Launch top
top
# While top is running:
# 1. Press 'P' to sort by CPU
# 2. Press 'M' to sort by memory
# 3. Look for your sleep process (it will have near-zero CPU/MEM)
# 4. Press 'q' to quit

If you have htop installed, try that too — press / and type sleep to search for it.

Terminal window
# First, confirm the process is still running
ps -p $MY_PID -o pid,cmd
# Send SIGTERM (the polite way)
kill $MY_PID
# Verify it is gone
ps -p $MY_PID -o pid,cmd 2>/dev/null || echo "Process $MY_PID has been terminated."

If it were a stubborn process that did not respond to SIGTERM, you would follow up with kill -9 $MY_PID.

Terminal window
# Get the big picture
df -h
# Find the largest directories at the root level
du -sh /* 2>/dev/null | sort -rh | head -10
# Drill into the largest directory (adjust path based on your output)
du -sh /var/* 2>/dev/null | sort -rh | head -5
# Check what physical disks and partitions exist
lsblk
Terminal window
# Start three background sleeps
sleep 100 &
sleep 200 &
sleep 300 &
# List all jobs
jobs
# Bring the second one to the foreground
fg %2
# Pause it with Ctrl+Z
# Then resume it in the background
bg %2
# Kill all of them
killall sleep
# Verify they are all gone
jobs
  • Started a background sleep process and noted its PID
  • Found the process using ps aux | grep and confirmed its STAT code
  • Opened top (or htop) and located the process in the list
  • Killed the process with kill and verified it was gone with ps
  • Ran df -h and identified which filesystem has the most usage
  • Used du -sh to find the largest directory on the system
  • (Bonus) Used jobs, fg, bg, and killall to manage multiple background jobs

You can find and kill processes, and you know where your disk space went. But who is managing all the services that start at boot — your web servers, databases, and schedulers? Time to learn about the system that orchestrates everything.

Next: Module 0.4: Services & Logs Demystified