Complete Guide to Bash awk Command

Table of Contents

Introduction to `awk`

awk is a powerful text-processing language and command-line tool designed for pattern scanning and processing. Named after its creators (Aho, Weinberger, and Kernighan), it excels at manipulating structured data, generating reports, and performing complex text transformations. Understanding awk is essential for anyone working with text files, log analysis, data extraction, and system administration.

Key Concepts

Pattern-Action Paradigm: awk scripts consist of patterns and associated actions
Record-Oriented: Processes text line by line (records) by default
Field-Oriented: Splits each record into fields
Built-in Variables: Provides numerous variables for text processing
Programmable: Full programming language with variables, loops, and functions

1. Basic Syntax and Structure

Command Structure

#!/bin/bash
# Basic syntax
awk 'pattern { action }' file.txt
# Common patterns
awk '{ print }' file.txt                    # Print all lines
awk '/pattern/ { print }' file.txt          # Print lines matching pattern
awk '$1 == "value" { print }' file.txt      # Print lines where field 1 equals "value"
# Multiple patterns and actions
awk '
/error/ { print "ERROR:", $0 }
/warning/ { print "WARNING:", $0 }
/info/ { print "INFO:", $0 }
' logfile.txt
# BEGIN and END blocks
awk '
BEGIN { print "Start Processing" }
{ print $0 }
END { print "End Processing" }
' file.txt

Basic Examples

#!/bin/bash
# Create sample data file
cat > data.txt << 'EOF'
John 25 75000
Alice 30 85000
Bob 28 65000
Carol 35 95000
EOF
# Print entire file
awk '{ print }' data.txt
# Print specific fields
awk '{ print $1, $2 }' data.txt           # Name and age
awk '{ print $1, $3 }' data.txt           # Name and salary
# Print with formatting
awk '{ printf "Name: %-10s Age: %2d Salary: $%d\n", $1, $2, $3 }' data.txt
# Field separator customization
cat > data.csv << 'EOF'
John,25,75000
Alice,30,85000
Bob,28,65000
EOF
awk -F, '{ print $1, $2 }' data.csv       # CSV processing
# Multiple field separators
awk -F'[,:]' '{ print $1, $2 }' file.txt

2. Built-in Variables

Common Built-in Variables

#!/bin/bash
# NR - Number of current record (line number)
awk '{ print NR, $0 }' data.txt            # Number each line
# NF - Number of fields in current record
awk '{ print "Fields:", NF, "Content:", $0 }' data.txt
# $0 - Entire current record
awk '{ print "Line:", $0 }' data.txt
# $1, $2, $3... - Individual fields
awk '{ print "First field:", $1, "Last field:", $NF }' data.txt
# FS - Field separator (input)
awk 'BEGIN { FS="," } { print $1 }' data.csv
# OFS - Output field separator
awk 'BEGIN { OFS=" | " } { print $1, $2, $3 }' data.txt
# RS - Record separator (input)
awk 'BEGIN { RS=""; FS="\n" } { print $1 }' paragraphs.txt
# ORS - Output record separator
awk 'BEGIN { ORS="\n---\n" } { print }' data.txt
# FILENAME - Current input file name
awk '{ print FILENAME, NR, $0 }' data.txt
# FNR - Record number in current file
awk '{ print FILENAME, FNR, $0 }' file1.txt file2.txt
# Examples
cat > employees.txt << 'EOF'
John Doe|Sales|75000
Jane Smith|Engineering|85000
Bob Wilson|Marketing|65000
EOF
awk -F'|' '
BEGIN { 
print "Employee Report"
print "==============="
}
{
printf "Name: %-15s Dept: %-12s Salary: $%d\n", $1, $2, $3
total += $3
}
END {
print "==============="
printf "Total Employees: %d\n", NR
printf "Average Salary: $%.2f\n", total/NR
}
' employees.txt

Field and Record Separators

#!/bin/bash
# Multiple field separators
awk -F'[ \t,:|]' '{ print $1, $2, $3 }' complex.txt
# Dynamic field separator
awk '{ FS=","; print $1 }' file.csv
# Changing OFS
echo "a:b:c:d" | awk 'BEGIN { FS=":"; OFS="-" } { $1=$1; print }'
# Multi-character RS
awk 'BEGIN { RS="\n\n"; FS="\n" } { print "Record:", $1 }' file.txt
# RT - Record terminator (gawk)
gawk 'BEGIN { RS="[0-9]+"; RT } { print "Record:", $0, "Terminator:", RT }' file.txt
# FIELDWIDTHS - Fixed-width fields (gawk)
gawk 'BEGIN { FIELDWIDTHS="5 3 8" } { print $1, $2, $3 }' fixed_width.txt
# Example: Parse fixed-width file
cat > fixed.txt << 'EOF'
John 2550000
Alice3085000
Bob  2865000
EOF
gawk 'BEGIN { FIELDWIDTHS="4 2 5" } { print "Name:", $1, "Age:", $2, "Salary:", $3 }' fixed.txt

3. Patterns and Actions

Pattern Types

#!/bin/bash
# Regular expression patterns
awk '/^J/' data.txt                 # Lines starting with J
awk '/[0-9]+/' data.txt              # Lines containing numbers
awk '/^[A-Z]/' data.txt              # Lines starting with uppercase
# Relational patterns
awk '$2 > 30' data.txt               # Age > 30
awk '$3 > 70000 && $3 < 90000' data.txt
awk '$1 == "John"' data.txt
# Range patterns
awk '/start/,/end/' file.txt         # Lines between start and end
# BEGIN and END patterns
awk 'BEGIN { print "Header" } { print } END { print "Footer" }' data.txt
# Empty pattern (matches all lines)
awk '{ print }' data.txt
# Pattern negation
awk '!/^#/' config.txt               # Skip comments
awk '!($2 > 30)' data.txt             # Not age > 30
# Combining patterns
awk '$2 > 30 && $1 ~ /^J/' data.txt
awk '$2 > 30 || $3 > 80000' data.txt
# Pattern ranges with line numbers
awk 'NR >= 5 && NR <= 10' data.txt    # Lines 5-10
# Examples
cat > logfile.txt << 'EOF'
2024-01-01 10:15:30 INFO  Application started
2024-01-01 10:15:35 ERROR Database connection failed
2024-01-01 10:15:40 WARN  Retry attempt 1
2024-01-01 10:15:45 ERROR Connection timeout
2024-01-01 10:15:50 INFO  Reconnected successfully
EOF
# Extract errors
awk '/ERROR/ { print NR, $0 }' logfile.txt
# Extract time range
awk '$2 >= "10:15:30" && $2 <= "10:15:45"' logfile.txt
# Summary report
awk '
$3 == "ERROR" { errors++ }
$3 == "WARN" { warnings++ }
$3 == "INFO" { infos++ }
END {
print "Log Summary"
print "INFO:", infos
print "WARN:", warnings
print "ERROR:", errors
}
' logfile.txt

Action Blocks

#!/bin/bash
# Single action
awk '{ print $1 }' data.txt
# Multiple actions
awk '{
print "Record:", NR
print "Name:", $1
print "Age:", $2
print "---"
}' data.txt
# Conditional actions
awk '{
if ($2 > 30) {
print $1, "is senior"
} else {
print $1, "is junior"
}
}' data.txt
# Loop actions
awk '{
for (i = 1; i <= NF; i++) {
print "Field", i, ":", $i
}
}' data.txt
# Compound actions
awk '{
total = 0
for (i = 2; i <= NF; i++) {
total += $i
}
avg = total / (NF - 1)
print $1, "Average:", avg
}' data.txt

4. Built-in Functions

String Functions

#!/bin/bash
# length() - String length
awk '{ print $1, length($1) }' data.txt
# index() - Position of substring
awk '{ print index($0, "John") }' data.txt
# substr() - Extract substring
awk '{ print substr($1, 1, 2) }' data.txt    # First 2 chars
# split() - Split string into array
awk '{
split($0, arr, ":")
print arr[1], arr[2]
}' data.txt
# gsub() - Global substitute
awk '{ gsub(/John/, "Jonathan"); print }' data.txt
# sub() - First substitute only
awk '{ sub(/[0-9]+/, "XXX"); print }' data.txt
# match() - Match regular expression
awk '{
if (match($0, /[0-9]+/)) {
print "Found number at position", RSTART, "length", RLENGTH
}
}' data.txt
# toupper() / tolower() - Case conversion
awk '{ print toupper($1), tolower($2) }' data.txt
# Examples
cat > strings.txt << 'EOF'
John:30:75000
Alice:25:85000
Bob:35:65000
EOF
# Extract username from email
echo "[email protected]" | awk '{
split($0, parts, "@")
split(parts[1], user, ".")
print "First:", user[1], "Last:", user[2]
}'
# String manipulation
awk '{
name = $1
gsub(/[aeiou]/, "", name)  # Remove vowels
printf "%-10s -> %s\n", $1, name
}' data.txt

Numeric Functions

#!/bin/bash
# int() - Integer part
awk '{ print int($3) }' data.txt
# sqrt() - Square root
awk '{ print sqrt($2) }' data.txt
# log() - Natural logarithm
awk '{ print log($3) }' data.txt
# exp() - Exponential
awk '{ print exp($2) }' data.txt
# sin() / cos() - Trigonometric
awk '{ print sin($2 * 3.14159/180) }' data.txt
# rand() - Random number
awk 'BEGIN { print rand(), rand() }'
# srand() - Seed random number
awk 'BEGIN { srand(); print rand() }'
# Examples
echo "10 20 30 40 50" | awk '{
sum = 0
for (i = 1; i <= NF; i++) {
sum += $i
}
avg = sum / NF
print "Average:", avg
variance = 0
for (i = 1; i <= NF; i++) {
variance += ($i - avg) ^ 2
}
variance /= NF
print "Variance:", variance
print "Std Dev:", sqrt(variance)
}'
# Generate random data
awk 'BEGIN {
srand()
for (i = 1; i <= 10; i++) {
printf "%d %f\n", int(rand() * 100), rand()
}
}'

Time Functions (gawk)

#!/bin/bash
# strftime() - Format timestamp
gawk 'BEGIN { print strftime("%Y-%m-%d %H:%M:%S") }'
# systime() - Current timestamp
gawk 'BEGIN { print systime() }'
# mktime() - Create timestamp
gawk 'BEGIN { print mktime("2024 01 15 10 30 00") }'
# Examples
gawk '
BEGIN {
now = systime()
print "Current time:", strftime("%Y-%m-%d %H:%M:%S", now)
# Calculate time difference
tomorrow = now + 86400
print "Tomorrow:", strftime("%Y-%m-%d", tomorrow)
# Parse date
date = "2024-12-31"
split(date, d, "-")
timestamp = mktime(d[1] " " d[2] " " d[3] " 00 00 00")
print "Timestamp:", timestamp
}'
# Process timestamped logs
cat > timelog.txt << 'EOF'
1705310700 INFO Started
1705310760 ERROR Failed
1705310820 WARN Retry
EOF
gawk '{
print strftime("%Y-%m-%d %H:%M:%S", $1), $2, $3
}' timelog.txt

5. Arrays in awk

Associative Arrays

#!/bin/bash
# Basic array usage
awk '
BEGIN {
fruits["apple"] = 10
fruits["banana"] = 5
fruits["orange"] = 8
print "Apple count:", fruits["apple"]
# Iterate over array
for (fruit in fruits) {
print fruit, fruits[fruit]
}
}'
# Count occurrences
cat > words.txt << 'EOF'
apple
banana
apple
orange
banana
apple
EOF
awk '
{
count[$1]++
}
END {
print "Word counts:"
for (word in count) {
print word, count[word]
}
}' words.txt
# Delete array elements
awk '
{
data[$1] = $2
}
END {
delete data["John"]
for (name in data) {
print name, data[name]
}
}' data.txt
# Check if key exists
awk '
BEGIN {
data["John"] = 30
if ("John" in data) {
print "John exists"
}
if ("Alice" in data) {
print "Alice exists"
} else {
print "Alice not found"
}
}'
# Multi-dimensional arrays (gawk)
gawk '
BEGIN {
matrix[1][1] = 10
matrix[1][2] = 20
matrix[2][1] = 30
matrix[2][2] = 40
for (i = 1; i <= 2; i++) {
for (j = 1; j <= 2; j++) {
printf "matrix[%d][%d] = %d\n", i, j, matrix[i][j]
}
}
}'

Array Sorting

#!/bin/bash
# asort() - Sort values (gawk)
gawk '
BEGIN {
data[1] = "banana"
data[2] = "apple"
data[3] = "cherry"
data[4] = "date"
n = asort(data, sorted)
for (i = 1; i <= n; i++) {
print i, sorted[i]
}
}'
# asorti() - Sort indices (gawk)
gawk '
BEGIN {
fruits["banana"] = 3
fruits["apple"] = 5
fruits["cherry"] = 2
n = asorti(fruits, indices)
for (i = 1; i <= n; i++) {
print indices[i], fruits[indices[i]]
}
}'
# Manual sort (using array traversal)
awk '
{
data[NR] = $1
}
END {
# Simple bubble sort
for (i = 1; i <= NR; i++) {
for (j = i + 1; j <= NR; j++) {
if (data[i] > data[j]) {
temp = data[i]
data[i] = data[j]
data[j] = temp
}
}
}
for (i = 1; i <= NR; i++) {
print data[i]
}
}' data.txt

6. Control Flow

Conditional Statements

#!/bin/bash
# if-else statement
awk '{
if ($2 > 30) {
print $1, "is senior"
} else if ($2 > 25) {
print $1, "is mid-level"
} else {
print $1, "is junior"
}
}' data.txt
# Ternary operator
awk '{ print $1, ($2 > 30 ? "senior" : "junior") }' data.txt
# switch statement (gawk)
gawk '{
switch ($2) {
case 25:
print $1, "entry level"
break
case 30:
print $1, "experienced"
break
case 35:
print $1, "senior"
break
default:
print $1, "other level"
}
}' data.txt

Loops

#!/bin/bash
# for loop
awk '{
sum = 0
for (i = 1; i <= NF; i++) {
sum += $i
}
print "Sum:", sum
}' data.txt
# while loop
awk '{
i = 1
while (i <= NF) {
printf "%d ", $i
i++
}
print ""
}' data.txt
# do-while loop
awk '{
i = 1
do {
printf "%d ", $i
i++
} while (i <= NF)
print ""
}' data.txt
# break and continue
awk '{
for (i = 1; i <= NF; i++) {
if ($i == 0) continue
if ($i > 100) break
print $i
}
}' data.txt
# next - skip to next record
awk '{
if ($1 ~ /^#/) next
print $0
}' config.txt
# nextfile - skip to next file (gawk)
gawk '{
if (FNR == 1 && /^ERROR/) nextfile
print FILENAME, $0
}' file1.txt file2.txt
# exit - exit awk program
awk '{
if ($0 ~ /FATAL/) {
print "Fatal error found"
exit 1
}
print $0
}' logfile.txt

7. Advanced Examples

Data Processing

#!/bin/bash
# Create sample sales data
cat > sales.csv << 'EOF'
date,product,region,sales
2024-01-01,Laptop,North,1200
2024-01-01,Phone,North,800
2024-01-01,Laptop,South,1500
2024-01-02,Phone,East,900
2024-01-02,Laptop,West,1100
2024-01-02,Tablet,North,500
EOF
# Sales report by product
awk -F, '
NR > 1 {
product[$2] += $4
region[$3] += $4
total += $4
}
END {
print "Sales by Product"
for (p in product) {
printf "  %-10s $%d\n", p, product[p]
}
print "\nSales by Region"
for (r in region) {
printf "  %-10s $%d\n", r, region[r]
}
print "\nTotal Sales: $" total
}' sales.csv
# Moving average
awk '{
sum += $2
if (NR > 5) sum -= a[NR-5]
a[NR] = $2
if (NR >= 5) {
print $1, sum/5
}
}' data.txt
# Group by date
awk -F, '
NR > 1 {
sales[$1] += $4
}
END {
for (date in sales) {
printf "%s: $%d\n", date, sales[date]
}
}' sales.csv

Log Analysis

#!/bin/bash
# Apache log analyzer
cat > access.log << 'EOF'
192.168.1.1 - - [15/Jan/2024:10:15:30 +0000] "GET /index.html HTTP/1.1" 200 1024
192.168.1.2 - - [15/Jan/2024:10:15:35 +0000] "POST /api/data HTTP/1.1" 500 512
192.168.1.1 - - [15/Jan/2024:10:15:40 +0000] "GET /about.html HTTP/1.1" 200 2048
192.168.1.3 - - [15/Jan/2024:10:15:45 +0000] "GET /index.html HTTP/1.1" 404 256
EOF
# Parse Apache logs
awk '
{
# Extract fields
ip = $1
date = $4
gsub(/\[/, "", date)
method = substr($6, 2)
url = $7
status = $9
size = $10
# Count by IP
ip_count[ip]++
# Count by status
status_count[status]++
# Total bytes
if (size ~ /^[0-9]+$/) {
bytes += size
}
# URL access count
url_count[url]++
}
END {
print "Access Log Summary"
print "=================="
print "Unique IPs:", length(ip_count)
print "Total requests:", NR
print "Total bytes:", bytes
print "\nTop IPs:"
for (ip in ip_count) {
printf "  %-15s %d\n", ip, ip_count[ip]
}
print "\nStatus Codes:"
for (status in status_count) {
printf "  %s: %d\n", status, status_count[status]
}
}' access.log
# Extract hourly statistics
awk '
{
match($4, /[0-9]{2}:[0-9]{2}:[0-9]{2}/)
time = substr($4, RSTART, 2)  # hour
hour_count[time]++
}
END {
for (h = 0; h < 24; h++) {
printf "%02d:00 - %02d:59: %d requests\n", h, h, hour_count[h]
}
}' access.log

CSV Processing

#!/bin/bash
# Parse CSV with quoted fields
cat > complex.csv << 'EOF'
id,name,email,notes
1,"John Doe","[email protected]","Employee of the month"
2,"Jane Smith","[email protected]","On leave until March"
3,"Bob Wilson","[email protected]","Needs new laptop"
EOF
# Handle quoted fields (simple version)
awk -F, '
{
gsub(/"/, "")
printf "ID: %s\n", $1
printf "Name: %s\n", $2
printf "Email: %s\n", $3
printf "Notes: %s\n\n", $4
}' complex.csv
# Proper CSV parsing with FPAT (gawk)
gawk 'BEGIN { FPAT = "([^,]*)|(\"[^\"]+\")" } {
for (i = 1; i <= NF; i++) {
gsub(/"/, "", $i)
printf "Field %d: %s\n", i, $i
}
print "---"
}' complex.csv
# CSV to JSON converter
awk -F, '
NR == 1 {
for (i = 1; i <= NF; i++) {
headers[i] = $i
}
next
}
{
printf "{"
for (i = 1; i <= NF; i++) {
printf "\"%s\":\"%s\"", headers[i], $i
if (i < NF) printf ","
}
printf "}\n"
}' sales.csv

System Administration

#!/bin/bash
# Process monitoring
ps aux | awk '
NR > 1 {
cpu[$11] += $3
mem[$11] += $4
count[$11]++
}
END {
print "Process Summary by Command"
print "=========================="
for (cmd in cpu) {
printf "%-20s %3d procs CPU: %5.1f%% MEM: %5.1f%%\n", 
substr(cmd, 1, 20), count[cmd], cpu[cmd], mem[cmd]
}
}'
# Disk usage analysis
df -h | awk '
NR > 1 {
usage = $5
gsub(/%/, "", usage)
if (usage > 80) {
printf "WARNING: %s is %d%% full\n", $6, usage
}
}'
# Network connection summary
netstat -an | awk '
$6 == "ESTABLISHED" { est++ }
$6 == "LISTEN" { listen++ }
$6 == "TIME_WAIT" { timewait++ }
END {
print "Connection States:"
print "  ESTABLISHED:", est
print "  LISTEN:", listen
print "  TIME_WAIT:", timewait
}'
# User login report
last | awk '
$1 != "reboot" && $1 != "wtmp" {
users[$1]++
}
END {
print "User Login Counts:"
for (user in users) {
printf "  %-10s %d\n", user, users[user]
}
}'

8. One-liners and Common Patterns

Useful One-liners

#!/bin/bash
# Print specific columns
awk '{ print $1, $3 }' file.txt
# Sum a column
awk '{ sum += $1 } END { print sum }' file.txt
# Average of column
awk '{ sum += $1 } END { print sum/NR }' file.txt
# Count lines
awk 'END { print NR }' file.txt
# Print lines matching pattern
awk '/pattern/' file.txt
# Print lines not matching pattern
awk '!/pattern/' file.txt
# Print unique lines
awk '!seen[$0]++' file.txt
# Print duplicates
awk 'seen[$0]++' file.txt
# Remove empty lines
awk 'NF' file.txt
# Add line numbers
awk '{ print NR, $0 }' file.txt
# Print last field
awk '{ print $NF }' file.txt
# Print fields in reverse order
awk '{ for (i=NF; i>0; i--) printf "%s ", $i; print "" }' file.txt
# Convert to uppercase
awk '{ print toupper($0) }' file.txt
# Extract email addresses
awk '{ for (i=1; i<=NF; i++) if ($i ~ /@/) print $i }' file.txt
# Print lines between patterns
awk '/start/,/end/' file.txt
# Print column with header
awk 'NR==1; NR>1 {print $1}' file.txt
# Calculate percentage
awk '{ total += $2 } END { for (i=1; i<=NR; i++) print $1, $2/total*100 }' file.txt

Common Patterns

#!/bin/bash
# Field validation
awk '
{
if (NF != 3) {
print "Invalid line:", $0 > "/dev/stderr"
next
}
if ($2 !~ /^[0-9]+$/) {
print "Invalid age:", $0 > "/dev/stderr"
next
}
print $0
}' data.txt
# Data transformation
awk '{
gsub(/[[:space:]]+/, " ")
gsub(/^[[:space:]]+|[[:space:]]+$/, "")
print
}' messy.txt
# Generate report
awk '
BEGIN {
printf "%-20s %10s %10s %10s\n", "Name", "Age", "Salary", "Bonus"
printf "%-20s %10s %10s %10s\n", "----", "---", "------", "-----"
}
{
bonus = $3 * 0.1
printf "%-20s %10d %10d %10.2f\n", $1, $2, $3, bonus
total_salary += $3
total_bonus += bonus
}
END {
printf "%-20s %10s %10d %10.2f\n", "TOTAL", "", total_salary, total_bonus
}' data.txt
# Pivot table
awk '
{
data[$1][$2] = $3
}
END {
# Print header
printf "%-10s", "Name"
for (col in data) {
for (subcol in data[col]) {
cols[subcol] = 1
}
}
for (col in cols) {
printf " %10s", col
}
print ""
# Print data
for (row in data) {
printf "%-10s", row
for (col in cols) {
printf " %10d", data[row][col]
}
print ""
}
}' sales_pivot.txt

9. Integration with Other Commands

Piping with awk

#!/bin/bash
# Get data from other commands
ps aux | awk '$3 > 50 { print $2, $11 }'
df -h | awk 'NR>1 {print $1, $5}' | sort -k2 -rn
netstat -an | awk '/ESTABLISHED/ {print $5}' | cut -d: -f1 | sort -u
# Process with grep and awk
grep "ERROR" logfile.txt | awk '{print $1, $2, $NF}'
# Use xargs with awk
find . -name "*.log" | xargs awk '/ERROR/ {print FILENAME, $0}'
# Awk in shell pipelines
cat data.txt | awk '{print $1}' | sort | uniq -c | sort -rn
# Process substitution
awk '{print $1}' <(ps aux) <(who)
# Multiple files
awk '{print FILENAME, $0}' file1.txt file2.txt file3.txt
# Using getline
awk '
BEGIN {
while ("ps aux" | getline) {
if ($3 > 50) {
print $2, $11
}
}
close("ps aux")
}'
# Write to multiple files
awk '
{
print $0 > $1 ".txt"
}' data.txt

System Calls

#!/bin/bash
# system() function
awk '{
if ($2 > 100) {
system("echo " $1 " > high_values.txt")
}
}'
# Get command output
awk '
BEGIN {
cmd = "date +%Y%m%d"
cmd | getline today
close(cmd)
print "Today:", today
}'
# Execute shell command for each line
awk '{
cmd = "wc -l " $1
cmd | getline count
close(cmd)
print $1, count
}' filelist.txt
# Interactive prompts
awk '
BEGIN {
printf "Enter filename: "
getline file < "-"
close("-")
while ((getline < file) > 0) {
print
}
close(file)
}'
# Send email
awk '
BEGIN {
report = "/tmp/report.txt"
print "Report" > report
# Append data
while ("ps aux" | getline) {
print >> report
}
close("ps aux")
# Send email
system("mail -s 'System Report' [email protected] < " report)
system("rm " report)
}'

10. Advanced Features (gawk)

GAWK Extensions

#!/bin/bash
# Two-dimensional arrays
gawk '
BEGIN {
matrix[1][1] = 10
matrix[1][2] = 20
matrix[2][1] = 30
matrix[2][2] = 40
for (i in matrix) {
for (j in matrix[i]) {
printf "matrix[%d][%d] = %d\n", i, j, matrix[i][j]
}
}
}'
# Arrays of arrays
gawk '
BEGIN {
data["group1"][1] = "apple"
data["group1"][2] = "banana"
data["group2"][1] = "carrot"
data["group2"][2] = "date"
for (group in data) {
print "Group:", group
for (i in data[group]) {
print "  ", data[group][i]
}
}
}'
# Bit manipulation
gawk '
BEGIN {
a = 13   # 1101 binary
b = 11   # 1011 binary
print "a =", a
print "b =", b
print "AND:", and(a, b)
print "OR:", or(a, b)
print "XOR:", xor(a, b)
print "NOT a:", compl(a)
print "a << 2:", lshift(a, 2)
print "a >> 2:", rshift(a, 2)
}'
# Time functions
gawk '
BEGIN {
now = systime()
print "Current:", strftime("%Y-%m-%d %H:%M:%S", now)
# Parse date
date = "2024-01-15"
split(date, d, "-")
timestamp = mktime(d[1] " " d[2] " " d[3] " 12 00 00")
print "Timestamp:", timestamp
# Date arithmetic
tomorrow = now + 86400
print "Tomorrow:", strftime("%Y-%m-%d", tomorrow)
}'
# Match with RSTART/RLENGTH
gawk '{
if (match($0, /[0-9]+/)) {
print "Match at position", RSTART
print "Length:", RLENGTH
print "Matched:", substr($0, RSTART, RLENGTH)
}
}'
# IGNORECASE flag
gawk 'BEGIN { IGNORECASE = 1 } /error/' logfile.txt
# FIELDWIDTHS for fixed-width files
gawk 'BEGIN { FIELDWIDTHS = "5 3 8 4" } {
print "Field1:", $1
print "Field2:", $2
print "Field3:", $3
print "Field4:", $4
}' fixed_width.txt
# FPAT for CSV parsing
gawk 'BEGIN { FPAT = "([^,]*)|(\"[^\"]*\")" } {
for (i = 1; i <= NF; i++) {
gsub(/"/, "", $i)
printf "Field %d: %s\n", i, $i
}
}' complex.csv

Profiling and Debugging

#!/bin/bash
# Debug mode
gawk --debug '
BEGIN {
x = 10
y = 20
z = x + y
print z
}'
# Profile awk program
gawk --profile '
{
count[$1]++
}
END {
for (i in count) {
print i, count[i]
}
}' data.txt
# After running, check awkprof.out
# Lint checking
gawk --lint '
BEGIN {
x = 10
print x
print y  # Lint warning: y uninitialized
}' /dev/null
# Pretty print
gawk -o- 'BEGIN { print "Hello" }'
# Dump variables
gawk '
BEGIN {
PROCINFO["sorted_in"] = "@ind_str_asc"
data["z"] = 26
data["a"] = 1
data["m"] = 13
for (i in data) {
print i, data[i]
}
}'

11. Best Practices and Tips

Code Organization

#!/bin/bash
# Use BEGIN for initialization
awk '
BEGIN {
FS = ","
OFS = " | "
total = 0
count = 0
}
{
total += $3
count++
print $1, $2, $3
}
END {
print "Average:", total/count
}' data.csv
# Use functions for reuse
awk '
function max(a, b) {
return (a > b) ? a : b
}
function min(a, b) {
return (a < b) ? a : b
}
{
mx = max(mx, $2)
mn = min(mn, $2)
sum += $2
}
END {
print "Max:", mx
print "Min:", mn
print "Avg:", sum/NR
}' data.txt
# Comment complex logic
awk '
# Count occurrences of each value in column 2
{
count[$2]++  # Increment counter for this value
}
END {
print "Value Frequency Report"
print "======================"
# Sort by count descending
PROCINFO["sorted_in"] = "@val_num_desc"
for (val in count) {
printf "%10s: %d\n", val, count[val]
}
}' data.txt

Performance Optimization

#!/bin/bash
# Avoid unnecessary operations
# Slow
awk '{ for (i=1; i<=NF; i++) if ($i ~ /pattern/) count++ }'
# Fast
awk '{ if ($0 ~ /pattern/) count++ }'
# Use arrays for lookups
awk '
BEGIN {
# Predefine lookup table
valid[1] = 1
valid[3] = 1
valid[5] = 1
}
{
if ($2 in valid) {
print $0
}
}' data.txt
# Process large files efficiently
awk '
# Use pattern matching before field splitting
/ERROR|WARN/ {
# Only split fields when needed
split($0, fields)
print fields[1], fields[2]
}' large.log
# Avoid regex in loops
# Slow
awk '{ for (i=1; i<=NF; i++) if ($i ~ /^[0-9]+$/) sum += $i }'
# Fast
awk '{ for (i=1; i<=NF; i++) if ($i+0 == $i) sum += $i }'
# Use next to skip processing
awk '
/^#/ { next }           # Skip comments
/^$/ { next }           # Skip empty lines
{
# Process data lines
print $0
}' config.txt

Common Pitfalls and Solutions

#!/bin/bash
# Pitfall 1: Field separator confusion
# Wrong
awk -F: '{ print $1, $2 }' /etc/passwd  # Works for passwd
# Right - specify multiple separators
awk -F'[:/]' '{ print $1, $5, $6 }' /etc/passwd
# Pitfall 2: String vs numeric comparison
awk '{
# String comparison (lexical)
if ($2 > "100") print "String compare"
# Numeric comparison
if ($2 + 0 > 100) print "Numeric compare"
}'
# Pitfall 3: Missing BEGIN for header
# Wrong
awk '{ print "Name", "Age"; print $1, $2 }' data.txt  # Prints header every line
# Right
awk 'BEGIN { print "Name", "Age" } { print $1, $2 }' data.txt
# Pitfall 4: Assuming field order
# Wrong
awk '{ print $1, $3 }' data.txt  # Might not be what you expect
# Right - use named fields with header
awk 'NR==1 { for (i=1; i<=NF; i++) col[$i]=i }
NR>1 { print $(col["Name"]), $(col["Salary"]) }' data.txt
# Pitfall 5: Not handling missing fields
awk '{
# Safe access with default
value = ($3 ? $3 : 0)
print value
}' data.txt
# Pitfall 6: Variable scope
awk '
function increment() {
# Local variable
local_var++
# Global variable - accessible outside
global_var++
}
{
increment()
print global_var, local_var  # local_var undefined here
}'

12. Real-World Examples

System Monitoring Script

#!/bin/bash
# Comprehensive system monitoring with awk
MONITOR_LOG="/var/log/system_monitor.log"
REPORT_EMAIL="[email protected]"
# Collect system data
collect_data() {
# CPU info
top -bn1 | awk '
/%Cpu/ {
user = $2
system = $4
idle = $8
printf "CPU: User=%.1f%% System=%.1f%% Idle=%.1f%%\n", user, system, idle
}' > /tmp/cpu.tmp
# Memory info
free -m | awk '
NR==2 {
printf "Memory: Total=%dMB Used=%dMB Free=%dMB\n", $2, $3, $4
}' > /tmp/mem.tmp
# Disk info
df -h | awk '
NR>1 {
if ($5+0 > 80) warn = "WARNING"
else warn = "OK"
printf "Disk: %s %s Used=%s %s\n", $1, $6, $5, warn
}' > /tmp/disk.tmp
# Process info
ps aux | awk '
NR>1 {
cpu[$11] += $3
mem[$11] += $4
}
END {
print "Top Processes by CPU:"
for (p in cpu) {
printf "  %s: CPU=%.1f%% MEM=%.1f%%\n", substr(p,1,20), cpu[p], mem[p]
}
}' > /tmp/proc.tmp
}
# Generate report
generate_report() {
awk '
BEGIN {
print "System Monitoring Report"
print "========================"
print "Time:", strftime("%Y-%m-%d %H:%M:%S")
print ""
}
FILENAME == "/tmp/cpu.tmp" { cpu = $0 }
FILENAME == "/tmp/mem.tmp" { mem = $0 }
FILENAME == "/tmp/disk.tmp" { disk[++d] = $0 }
FILENAME == "/tmp/proc.tmp" { proc = proc ? proc "\n" $0 : $0 }
END {
print "CPU Status:"
print "  " cpu
print ""
print "Memory Status:"
print "  " mem
print ""
print "Disk Status:"
for (i in disk) print "  " disk[i]
print ""
print "Process Summary:"
print proc
}' /tmp/cpu.tmp /tmp/mem.tmp /tmp/disk.tmp /tmp/proc.tmp
}
# Check thresholds
check_thresholds() {
awk '
/CPU/ && /Idle/ {
idle = $NF
gsub(/%/, "", idle)
if (idle < 20) print "WARNING: CPU idle below 20%"
}
/Memory/ && /UsedMB/ {
used = $4
total = $2
pct = used * 100 / total
if (pct > 90) print "WARNING: Memory usage above 90%"
}
/Disk/ && /WARNING/ {
print "WARNING: " $0
}' /tmp/cpu.tmp /tmp/mem.tmp /tmp/disk.tmp
}
# Main
collect_data
generate_report | tee "$MONITOR_LOG"
if [ -n "$REPORT_EMAIL" ]; then
check_thresholds | while read warning; do
echo "$warning" | mail -s "System Alert" "$REPORT_EMAIL"
done
fi
# Cleanup
rm -f /tmp/cpu.tmp /tmp/mem.tmp /tmp/disk.tmp /tmp/proc.tmp

Log Analysis Framework

#!/bin/bash
# Generic log analysis framework
LOG_FILE="$1"
REPORT_TYPE="${2:-summary}"
# Configuration
declare -A PATTERNS=(
["ERROR"]="ERROR|Error|error|Failed|failed"
["WARN"]="WARN|Warn|warn|Warning|warning"
["INFO"]="INFO|Info|info|Notice|notice"
["DEBUG"]="DEBUG|Debug|debug"
)
# Basic statistics
basic_stats() {
awk '
{
total_lines++
bytes += length($0) + 1
}
END {
print "Basic Statistics"
print "================"
print "Lines:", total_lines
print "Bytes:", bytes
print "Average line length:", bytes/total_lines
}' "$LOG_FILE"
}
# Pattern matching
pattern_analysis() {
awk -v patterns="${PATTERNS[*]}" '
BEGIN {
split(patterns, p, " ")
for (i in p) count[p[i]] = 0
}
{
for (level in count) {
if (match($0, "@" level)) {
count[level]++
lines[level][++lines[level]] = $0
}
}
}
END {
print "\nPattern Analysis"
print "================="
for (level in count) {
printf "%-10s: %d\n", level, count[level]
}
print "\nSample Errors:"
for (i = 1; i <= 5 && i <= lines["ERROR"]; i++) {
print "  " lines["ERROR"][i]
}
}' "$LOG_FILE"
}
# Time-based analysis
time_analysis() {
awk '
{
# Extract timestamp (assuming first field is timestamp)
timestamp = $1
hour = substr(timestamp, 12, 2)
hourly[hour]++
# Extract date
date = substr(timestamp, 1, 10)
daily[date]++
}
END {
print "\nTime-based Analysis"
print "==================="
print "Hourly Distribution:"
for (h = 0; h < 24; h++) {
printf "  %02d:00 - %02d:00: %d\n", h, (h+1)%24, hourly[sprintf("%02d", h)]
}
print "\nDaily Distribution:"
for (d in daily) {
printf "  %s: %d\n", d, daily[d]
}
}' "$LOG_FILE"
}
# Performance metrics
performance_metrics() {
awk '
/response time|duration|took/ {
match($0, /[0-9]+\.[0-9]+/)
if (RSTART) {
time = substr($0, RSTART, RLENGTH)
times[++count] = time + 0
total += time
}
}
END {
if (count > 0) {
print "\nPerformance Metrics"
print "===================="
printf "Average response: %.2f ms\n", total/count
# Sort times
n = asort(times)
printf "Min response: %.2f ms\n", times[1]
printf "Max response: %.2f ms\n", times[n]
printf "Median response: %.2f ms\n", times[int(n/2)]
printf "95th percentile: %.2f ms\n", times[int(n*0.95)]
}
}' "$LOG_FILE"
}
# Error correlation
error_correlation() {
awk '
/ERROR|error|Failed/ {
errors[NR] = $0
}
/WARN|warn/ {
warnings[NR] = $0
}
END {
print "\nError Correlation"
print "=================="
# Find errors near warnings
for (e in errors) {
for (w in warnings) {
if (w > e-5 && w < e+5) {
printf "Near line %d:\n", e
print "  Error:", errors[e]
print "  Warn: ", warnings[w]
}
}
}
}' "$LOG_FILE"
}
# Main execution
main() {
echo "Log Analysis Report: $LOG_FILE"
echo "Generated: $(date)"
echo "================================"
case "$REPORT_TYPE" in
summary)
basic_stats
pattern_analysis
;;
time)
time_analysis
;;
perf)
performance_metrics
;;
error)
error_correlation
;;
full)
basic_stats
pattern_analysis
time_analysis
performance_metrics
error_correlation
;;
*)
echo "Unknown report type: $REPORT_TYPE"
exit 1
;;
esac
}
# Run main function
main

Conclusion

awk is an incredibly powerful tool for text processing and data manipulation. Mastering it involves understanding:

Key Takeaways

Pattern-Action Paradigm: Core concept of awk programming
Field Processing: Automatic splitting of records into fields
Built-in Variables: NR, NF, FS, OFS, and many others
Control Structures: Conditionals, loops, and functions
Arrays: Associative arrays for complex data structures
String Functions: Powerful text manipulation capabilities
Numeric Functions: Mathematical operations and statistics
Integration: Works seamlessly with shell pipelines
Performance: Efficient for large file processing
Extensibility: GAWK provides advanced features

Best Practices Summary

Use BEGIN/END blocks for initialization and finalization
Leverage built-in variables for flexibility
Write modular code with functions
Comment complex logic for maintainability
Handle edge cases (missing fields, empty lines)
Optimize for performance with large files
Use appropriate patterns to filter early
Validate input data before processing
Document field layouts for clarity
Test with sample data before production

awk is not just a command but a complete programming language designed for text processing. Its combination of simplicity for common tasks and power for complex operations makes it an indispensable tool in any system administrator's or developer's toolkit.

Introduction to awk