Module 7.2: Text Processing
Shell Scripting | Complexity:
[MEDIUM]| Time: 30-35 min
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: Module 7.1: Bash Fundamentals
- Required: Basic regex understanding
- Helpful: Module 6.2: Log Analysis
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After this module, you will be able to:
- Transform text using cut, sort, uniq, tr, and paste for log analysis
- Write awk one-liners for column extraction, filtering, and calculations
- Use sed for search-and-replace, line deletion, and in-place file editing
- Build text processing pipelines that combine multiple tools for complex transformations
Why This Module Matters
Section titled “Why This Module Matters”Linux is a text-based operating system. Configurations, logs, and data are mostly text. Mastering text processing tools lets you extract, transform, and analyze data without writing programs.
Understanding text processing helps you:
- Parse logs — Extract errors, patterns, metrics
- Transform data — Convert between formats
- Process command output — Parse kubectl, docker, git
- Automate analysis — Build reporting scripts
grep, sed, awk, and jq are the Swiss Army knives of DevOps.
Did You Know?
Section titled “Did You Know?”-
grep was created in 1973 — Ken Thompson wrote it at Bell Labs. The name comes from the ed command
g/re/p(global regex print). -
awk is a programming language — Named after its creators (Aho, Weinberger, Kernighan), awk has variables, functions, and control flow. Most people only use 1% of its features.
-
sed is non-interactive ed — Created for batch editing, sed processes text line by line. The cryptic syntax (
s/old/new/g) comes directly from ed. -
jq is “like sed for JSON” — Created in 2012, jq fills the gap for structured data that grep/sed/awk can’t handle elegantly.
grep: Pattern Matching
Section titled “grep: Pattern Matching”Basic Usage
Section titled “Basic Usage”# Search for patterngrep "error" file.txt
# Case insensitivegrep -i "error" file.txt
# Show line numbersgrep -n "error" file.txt
# Count matchesgrep -c "error" file.txt
# Files with matchesgrep -l "error" *.log
# Files without matchesgrep -L "error" *.logRegex Patterns
Section titled “Regex Patterns”# Basic patternsgrep "^Start" # Lines starting with "Start"grep "end$" # Lines ending with "end"grep "^$" # Empty linesgrep "." # Any character
# Extended regex (-E)grep -E "error|warning" file.txt # ORgrep -E "[0-9]{3}" # Three digitsgrep -E "https?" # http or https
# Perl regex (-P)grep -P "\d{4}-\d{2}-\d{2}" # Date patterngrep -P "(?<=error: ).+" # LookbehindContext and Inversion
Section titled “Context and Inversion”# Lines before/after matchgrep -B 3 "error" # 3 lines beforegrep -A 3 "error" # 3 lines aftergrep -C 3 "error" # 3 lines both sides
# Invert matchgrep -v "debug" file.txt # Lines WITHOUT "debug"
# Only matching partgrep -o "[0-9]*" file.txt # Extract numbers onlyStop and think:
grep -B 3shows the lines immediately preceding a match in the file. If multiple processes are writing to the same aggregate log stream asynchronously, does the physical line preceding the error guarantee chronological or causal relation to the error itself?
Recursive Search
Section titled “Recursive Search”# Search in directorygrep -r "TODO" /path/to/code/
# With file patterngrep -r --include="*.py" "import" .
# Exclude patternsgrep -r --exclude="*.log" "error" .grep -r --exclude-dir=".git" "pattern" .sed: Stream Editor
Section titled “sed: Stream Editor”Substitution
Section titled “Substitution”# Basic substitutionsed 's/old/new/' file.txt # First occurrencesed 's/old/new/g' file.txt # All occurrences
# In-place editingsed -i 's/old/new/g' file.txt
# Backup before editingsed -i.bak 's/old/new/g' file.txt
# Case insensitivesed 's/old/new/gi' file.txtPause and predict: When using
sed -ito edit a file in-place,sedactually creates a temporary file, writes the modified content to it, and then renames it over the original file. How might this behavior affect file ownership, permissions, or symlinks compared to using a shell redirect (>)?
Addressing
Section titled “Addressing”# Line numberssed '5s/old/new/' file.txt # Only line 5sed '1,10s/old/new/g' file.txt # Lines 1-10
# Patternssed '/error/s/old/new/' file.txt # Lines matching "error"sed '/^#/d' file.txt # Delete comment lines
# Rangessed '/start/,/end/d' file.txt # Delete from start to endCommon Operations
Section titled “Common Operations”# Delete linessed '/pattern/d' file.txt # Lines matching patternsed '1d' file.txt # First linesed '$d' file.txt # Last linesed '1,5d' file.txt # Lines 1-5
# Print specific linessed -n '5p' file.txt # Only line 5sed -n '1,10p' file.txt # Lines 1-10sed -n '/pattern/p' file.txt # Lines matching pattern
# Insert/Appendsed '1i\Header Line' file.txt # Insert before line 1sed '1a\After Line 1' file.txt # Append after line 1
# Multiple commandssed -e 's/a/A/g' -e 's/b/B/g' file.txtsed 's/a/A/g; s/b/B/g' file.txtCapture Groups
Section titled “Capture Groups”# Capture and reusesed 's/\(.*\):\(.*\)/\2:\1/' file.txt # Swap around colon
# Extended regex (-E)sed -E 's/([0-9]+)-([0-9]+)/\2-\1/' file.txt # Swap numbers
# Named groups (GNU sed)sed -E 's/([a-z]+)@([a-z]+)/User: \1, Domain: \2/' emails.txtawk: Pattern Processing
Section titled “awk: Pattern Processing”Basic Syntax
Section titled “Basic Syntax”# Print entire lineawk '{print}' file.txt
# Print specific fieldsawk '{print $1}' file.txt # First fieldawk '{print $1, $3}' file.txt # First and thirdawk '{print $NF}' file.txt # Last field
# Field separatorawk -F: '{print $1}' /etc/passwdawk -F',' '{print $2}' data.csvBuilt-in Variables
Section titled “Built-in Variables”# Variables$0 # Entire line$1-$n # FieldsNF # Number of fieldsNR # Record (line) numberFS # Field separatorOFS # Output field separatorRS # Record separator
# Examplesawk '{print NR, $0}' file.txt # Line numbersawk -F: '{print NF, $0}' /etc/passwd # Field countawk 'END {print NR}' file.txt # Total linesPattern Matching
Section titled “Pattern Matching”# Pattern actionawk '/error/ {print}' file.txtawk '/error/ {print $1}' file.txt
# Conditionsawk '$3 > 100 {print}' file.txtawk '$1 == "root" {print}' /etc/passwdawk 'NR > 10 {print}' file.txt # Skip first 10 lines
# BEGIN and ENDawk 'BEGIN {print "Header"} {print} END {print "Footer"}' file.txtCalculations
Section titled “Calculations”# Sum columnawk '{sum += $1} END {print sum}' file.txt
# Averageawk '{sum += $1; count++} END {print sum/count}' file.txt
# Max/Minawk 'NR==1 || $1>max {max=$1} END {print max}' file.txt
# Formatted outputawk '{printf "%-10s %5d\n", $1, $2}' file.txtGrouping and Counting
Section titled “Grouping and Counting”# Count by fieldawk '{count[$1]++} END {for (k in count) print k, count[k]}' file.txt
# Sum by groupawk '{sum[$1] += $2} END {for (k in sum) print k, sum[k]}' file.txt
# Unique valuesawk '!seen[$1]++' file.txtStop and think: The
awkcommandawk '!seen[$1]++'elegantly filters out duplicate lines based on the first column. Since it relies on the associative arrayseento track every unique value encountered, what happens to the system’s memory if you run this against a 50GB access log with highly randomized, unique data points in that column?
jq: JSON Processing
Section titled “jq: JSON Processing”Basic Navigation
Section titled “Basic Navigation”# Pretty printecho '{"name":"John"}' | jq .
# Get fieldecho '{"name":"John"}' | jq '.name'# "John"
# Raw output (no quotes)echo '{"name":"John"}' | jq -r '.name'# John
# Nestedecho '{"user":{"name":"John"}}' | jq '.user.name'Pause and predict: While tools like
grepandawkprocess text sequentially line-by-line using constant memory,jqtypically parses the entire JSON structure into an internal tree before filtering it. If you pipe a 5GB monolithic JSON file into a standardjqfilter, what is the likely outcome on a container with a 512MB memory limit?
Arrays
Section titled “Arrays”# Array elementecho '[1,2,3]' | jq '.[0]'# 1
# All elementsecho '[1,2,3]' | jq '.[]'# 1# 2# 3
# Lengthecho '[1,2,3]' | jq 'length'# 3
# Array of objectsecho '[{"name":"a"},{"name":"b"}]' | jq '.[].name'# "a"# "b"Filtering
Section titled “Filtering”# Selectecho '[{"name":"a","val":1},{"name":"b","val":2}]' | jq '.[] | select(.val > 1)'# {"name":"b","val":2}
# Mapecho '[1,2,3]' | jq 'map(. * 2)'# [2,4,6]
# Sortecho '[3,1,2]' | jq 'sort'# [1,2,3]
# Uniqueecho '[1,1,2,2,3]' | jq 'unique'# [1,2,3]Construction
Section titled “Construction”# Create objectecho '{"a":1,"b":2}' | jq '{x: .a, y: .b}'# {"x":1,"y":2}
# Create arrayecho '{"a":1,"b":2}' | jq '[.a, .b]'# [1,2]
# Keys and valuesecho '{"a":1,"b":2}' | jq 'keys'# ["a","b"]echo '{"a":1,"b":2}' | jq 'to_entries'# [{"key":"a","value":1},{"key":"b","value":2}]kubectl with jq
Section titled “kubectl with jq”# Get pod nameskubectl get pods -o json | jq -r '.items[].metadata.name'
# Get image for each podkubectl get pods -o json | jq -r '.items[] | "\(.metadata.name): \(.spec.containers[0].image)"'
# Filter by statuskubectl get pods -o json | jq '.items[] | select(.status.phase == "Running")'
# Count pods per nodekubectl get pods -o json | jq -r '.items[].spec.nodeName' | sort | uniq -cCombining Tools
Section titled “Combining Tools”Pipelines
Section titled “Pipelines”# Common patternscat file.txt | grep "error" | wc -l # Count errorsps aux | awk '{print $1}' | sort | uniq -c | sort -rn # Process count by userkubectl get pods | grep -v Running | awk '{print $1}' # Non-running pods
# Complex pipelinecat access.log | \ grep -E "^[0-9]" | \ awk '{print $1}' | \ sort | \ uniq -c | \ sort -rn | \ head -10# Execute command for each lineecho "file1 file2" | xargs rm
# One argument at a timecat files.txt | xargs -I {} cp {} /backup/
# Parallel executioncat urls.txt | xargs -P 4 -I {} curl -s {}
# With findfind . -name "*.tmp" | xargs rmfind . -name "*.log" -print0 | xargs -0 rm # Handle spacesCommon Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| `cat file | grep` | Useless use of cat |
| Unquoted variables in awk | Word splitting | Use "$var" |
| sed without -i backup | Data loss | Use -i.bak |
| Not escaping in regex | Pattern doesn’t match | Escape special chars |
| jq without -r | Extra quotes in output | Use -r for raw |
| grep binary files | Garbled output | Use --text or skip |
Question 1
Section titled “Question 1”Scenario: You are auditing system accounts on a legacy Linux server. The security team needs a plain list of all usernames (the first field in /etc/passwd) to cross-reference with their active directory. The file uses a colon : to separate fields. Which command efficiently extracts just the usernames?
Show Answer
awk -F: '{print $1}' /etc/passwd# Orcut -d: -f1 /etc/passwdWhy this works:
Both awk and cut are designed for column-based text extraction. By default, they split fields based on whitespace, but /etc/passwd uses colons. By passing -F: to awk or -d: to cut, you explicitly redefine the field delimiter. The $1 or -f1 then targets the first logical column, which corresponds to the username. This avoids the need for complex regular expressions and cleanly extracts exactly what the security team requested without modifying the underlying system file.
Question 2
Section titled “Question 2”Scenario: Your team is migrating an application to a new database cluster. You need to update the configuration file db.conf, changing every instance of db-old.local to db-new.local. You want to do this across the entire file, but you must ensure you have a fallback in case the substitution messes up other settings. How do you accomplish this safely?
Show Answer
sed -i.bak 's/db-old\.local/db-new\.local/g' db.confWhy this works:
The sed command is perfect for automated search and replace operations across text streams. The s/old/new/g syntax performs a global substitution, meaning it will replace every occurrence on every line, not just the first one it encounters. Critically, the -i.bak flag tells sed to edit the file “in-place” while simultaneously creating a backup of the original file named db.conf.bak. If the regular expression accidentally matched and altered unintended lines, you can instantly restore the system state from the backup, adhering to safe and defensive operational practices.
Question 3
Section titled “Question 3”Scenario: Your web server is experiencing a sudden spike in traffic, potentially a DDoS attack. You have an access log where the first column contains the IP addresses of the clients. You need to quickly identify which IPs are making the most requests by generating a sorted count of unique IP addresses from this log. How do you construct this pipeline?
Show Answer
awk '{print $1}' access.log | sort | uniq -c | sort -rnWhy this works:
This pipeline chains together four specialized tools to transform the raw log into a prioritized list. First, awk '{print $1}' isolates the IP addresses, discarding the rest of the log line to reduce the payload for subsequent commands. Second, sort groups identical IPs together, which is a strict requirement because the uniq command only deduplicates adjacent identical lines. Third, uniq -c collapses the adjacent duplicates while prepending a count of how many times they appeared. Finally, sort -rn sorts this new list numerically (-n) and in reverse order (-r), placing the IP addresses with the highest request counts at the very top of your terminal for immediate investigation.
Question 4
Section titled “Question 4”Scenario: You are writing an automation script that needs to gracefully restart specific pods in a Kubernetes cluster. To do this, you first need to query the API for all pods and extract a clean, raw list of just the pod names from the JSON output, without any JSON quotes or brackets, so the script can iterate over them. How do you use jq to parse the kubectl output?
Show Answer
kubectl get pods -o json | jq -r '.items[].metadata.name'Why this works:
When Kubernetes outputs JSON, it returns a List object where the actual pod data is nested inside an array called items. The syntax .items[] tells jq to iterate over every object within that array individually. For each object, .metadata.name navigates down the JSON tree to extract the specific string value containing the pod’s name. The crucial part for scripting is the -r (raw) flag; without it, jq would output valid JSON strings enclosed in double quotes. The -r flag strips these quotes, providing clean text that a bash for loop or xargs command can consume directly without syntax errors.
Question 5
Section titled “Question 5”Scenario: You are troubleshooting a failing application and looking at a massive, noisy application log. You need to find all lines indicating a failure by searching for the word “Exception”. However, the log is flooded with “TimeoutException” warnings that you already know about and want to ignore. How do you filter the log to show exceptions while filtering out the timeouts?
Show Answer
grep "Exception" app.log | grep -v "TimeoutException"Why this works:
Text processing in Linux is heavily reliant on the philosophy of chaining small, single-purpose utilities together. The first grep acts as an inclusive filter, reducing the massive log file down to only the lines that contain the specific word “Exception”. This smaller, filtered stream of text is then piped directly into the second grep command. The -v flag inverts the matching behavior of the second grep, causing it to act as an exclusive filter that drops any line containing “TimeoutException”. This two-stage pipeline is often much faster and easier to read than attempting to construct a single, complex regular expression with negative lookarounds.
Hands-On Exercise
Section titled “Hands-On Exercise”Text Processing Practice
Section titled “Text Processing Practice”Objective: Use grep, sed, awk, and jq to process text and JSON data.
Environment: Any Linux system
Part 1: grep Practice
Section titled “Part 1: grep Practice”# Create sample datacat > /tmp/logs.txt << 'EOF'2024-01-15 10:00:00 INFO Starting application2024-01-15 10:00:01 DEBUG Loading config2024-01-15 10:00:02 INFO Connected to database2024-01-15 10:00:03 WARNING Slow query detected2024-01-15 10:00:04 ERROR Connection timeout2024-01-15 10:00:05 INFO Retry successful2024-01-15 10:00:06 DEBUG Cache hit2024-01-15 10:00:07 ERROR Failed to authenticate2024-01-15 10:00:08 INFO Shutdown completeEOF
# 1. Find all errorsgrep "ERROR" /tmp/logs.txt
# 2. Find errors and warningsgrep -E "ERROR|WARNING" /tmp/logs.txt
# 3. Count each log levelgrep -oE "(INFO|DEBUG|WARNING|ERROR)" /tmp/logs.txt | sort | uniq -c
# 4. Show context around errorsgrep -C 1 "ERROR" /tmp/logs.txt
# 5. Extract just the messagegrep "ERROR" /tmp/logs.txt | grep -oE "[A-Z]+ .*$"Part 2: sed Practice
Section titled “Part 2: sed Practice”# Create config filecat > /tmp/config.txt << 'EOF'# Database confighost=localhostport=5432database=myappuser=adminpassword=secret123EOF
# 1. Remove commentssed '/^#/d' /tmp/config.txt
# 2. Change portsed 's/port=5432/port=5433/' /tmp/config.txt
# 3. Extract just valuessed -n 's/.*=//p' /tmp/config.txt
# 4. Convert to export statementssed 's/^/export /' /tmp/config.txt | sed '/^export #/d'
# 5. Multiple substitutionssed -e 's/localhost/db.example.com/' -e 's/5432/5433/' /tmp/config.txtPart 3: awk Practice
Section titled “Part 3: awk Practice”# Create data filecat > /tmp/data.txt << 'EOF'Alice 100 EngineeringBob 150 SalesCarol 120 EngineeringDavid 90 MarketingEve 200 SalesFrank 110 EngineeringEOF
# 1. Print names and salariesawk '{print $1, $2}' /tmp/data.txt
# 2. Total salaryawk '{sum += $2} END {print "Total:", sum}' /tmp/data.txt
# 3. Average salaryawk '{sum += $2; n++} END {print "Average:", sum/n}' /tmp/data.txt
# 4. Salary by departmentawk '{dept[$3] += $2; count[$3]++} END {for (d in dept) print d, dept[d], count[d]}' /tmp/data.txt
# 5. Filter high earnersawk '$2 > 100 {print $1, $2}' /tmp/data.txtPart 4: jq Practice
Section titled “Part 4: jq Practice”# Create JSON filecat > /tmp/data.json << 'EOF'{ "users": [ {"name": "Alice", "age": 30, "role": "admin"}, {"name": "Bob", "age": 25, "role": "user"}, {"name": "Carol", "age": 35, "role": "admin"} ], "version": "1.0"}EOF
# 1. Pretty printcat /tmp/data.json | jq .
# 2. Get versioncat /tmp/data.json | jq -r '.version'
# 3. List all namescat /tmp/data.json | jq -r '.users[].name'
# 4. Filter adminscat /tmp/data.json | jq '.users[] | select(.role == "admin")'
# 5. Create new structurecat /tmp/data.json | jq '.users | map({username: .name, isAdmin: (.role == "admin")})'Part 5: Combining Tools
Section titled “Part 5: Combining Tools”# Create access logcat > /tmp/access.log << 'EOF'192.168.1.1 GET /api/users 200 150ms192.168.1.2 POST /api/login 200 50ms192.168.1.1 GET /api/data 500 300ms192.168.1.3 GET /api/users 200 100ms192.168.1.2 GET /api/data 200 80ms192.168.1.1 GET /api/health 200 10ms192.168.1.4 POST /api/login 401 20ms192.168.1.1 GET /api/data 500 250msEOF
# 1. Count requests per IPawk '{print $1}' /tmp/access.log | sort | uniq -c | sort -rn
# 2. Find all errors (5xx)awk '$4 ~ /^5/ {print}' /tmp/access.log
# 3. Average response timeawk '{gsub(/ms/, "", $5); sum += $5; n++} END {print sum/n, "ms"}' /tmp/access.log
# 4. Requests per endpointawk '{count[$3]++} END {for (e in count) print e, count[e]}' /tmp/access.log | sort -k2 -rn
# 5. Slow requests (>100ms)awk '{gsub(/ms/, "", $5); if ($5 > 100) print}' /tmp/access.logSuccess Criteria
Section titled “Success Criteria”- Used grep with patterns and context
- Used sed for substitution and deletion
- Used awk for field extraction and aggregation
- Used jq for JSON parsing and filtering
- Combined tools in pipelines
Key Takeaways
Section titled “Key Takeaways”-
grep for finding — Pattern matching in text
-
sed for transforming — Search and replace, line operations
-
awk for processing — Column extraction, calculations, grouping
-
jq for JSON — Like sed/awk but for structured data
-
Pipelines combine power — Chain tools for complex processing
What’s Next?
Section titled “What’s Next?”In Module 7.3: Practical Scripts, you’ll learn how to write production-quality scripts with proper error handling, logging, and common patterns.