<span>awk</span> is a powerful text processing tool and a data stream processing programming language in Linux/Unix systems. It is particularly suitable for handling structured data (such as logs, CSV, etc.) and supports programming features like pattern matching, variables, and functions.
1. Basic Syntax
awk [option] 'pattern { action }' input_file
- • pattern: Matching pattern (optional), used to filter lines.
- • action: Operation to perform on matching lines (required), enclosed in
<span>{}</span>. When<span>pattern</span>exists and<span>{ action }</span>does not exist, it defaults to printing the entire line. - • input_file: Input file (optional, defaults to reading from standard input).
2. Basic Usage Examples
1. Print the entire line
awk '{ print }' file.txt # Equivalent to cat file.txt
awk '{ print $0 }' file.txt # $0 represents the entire line
2. Print specified fields
awk '{ print $1, $3 }' file.txt # Print the 1st and 3rd columns
- • By default, fields are separated by spaces or tabs.
- •
<span>$1</span>,<span>$2</span>, etc. represent the 1st, 2nd, etc. columns, and<span>$NF</span>represents the last column.
3. Specify delimiters
awk -F',' '{ print $1 }' file.csv # Input delimiter is a comma
awk -F: '{ print $1 }' /etc/passwd # Delimited by colon
awk -F'[ ,]' '{ print $1 }' file.txt # Multiple delimiters (space or comma)
awk -F'[,:;]' '{ print $1, $3 }' file.txt # Comma/colon/semicolon as delimiters
4. Output delimiters
awk 'BEGIN { OFS="\t" } { print $1, $3 }' file.txt # Output separated by tabs
awk '{ print $1 "|" $3 }' file.txt # Custom output format
3. Pattern Matching
1. Regular matching
awk '/error/' log.txt # Match lines containing "error"
awk '!/error/ { print }' log.txt # Lines not matching "error"
awk '/^[0-9]/ { print }' file.txt # Match lines starting with a number
2. Field matching
Fields can be compared using <span>==</span>, <span>!=</span>, <span>></span>, <span><</span>, <span>>=</span>, <span><=</span>.
awk '$1 == "root" { print }' /etc/passwd # 1st column is "root"
awk '$3 > 1000 { print }' /etc/passwd # 3rd column greater than 1000
awk '$1 ~ /^a/ { print }' file.txt # 1st column starts with "a"
3. Line range matching
awk 'NR==1, NR==5 { print }' file.txt # Print lines 1 to 5
awk 'NR>=10 &&& NR<=20' file.txt # Print lines 10 to 20
4. Combined patterns
Use <span>&&</span> (and), <span>||</span> (or), <span>!</span> (not) to combine conditions.
Assuming we have a file <span>employees.txt</span> with the following content:
Alice Johnson 54000 HR
Bob Smith 60000 Engineering
Carol Davis 58000 Marketing
David Lee 52000 HR
Eve Wilson 65000 Engineering
Print employees in the Engineering department with a salary greater than 62000:
awk '$4 == "Engineering" &&& $3 > 62000 { print $0 }' employees.txt
4. Built-in Variables
<span>awk</span> provides many useful built-in variables:
| Variable | Description | Default Value |
|---|---|---|
<span>FS</span> |
Input field separator | Space |
<span>OFS</span> |
Output field separator | Space |
<span>RS</span> |
Input record (line) separator | Newline <span>\n</span> |
<span>ORS</span> |
Output record (line) separator | Newline <span>\n</span> |
<span>NF</span> |
Number of fields in the current line | – |
<span>NR</span> |
Current line number being processed (starting from 1, cumulative across all files) | – |
<span>FNR</span> |
Line number in the current file (independent for each file) | – |
<span>FILENAME</span> |
Filename of the current input file | – |
Example:
awk '{ print NR, NF, $0 }' file.txt # Print line number, number of fields, entire line
awk -F',' '{ print $1, $NF }' file.csv # Print the 1st and last columns
5. BEGIN and END Blocks
- •
<span>BEGIN</span>: Executes once before processing input. - •
<span>END</span>: Executes once after all input has been processed.
awk 'BEGIN { sum=0 } { sum+=$1 } END { print sum }' numbers.txt
Calculate the sum of the 1st column in the file.
# Calculate average salary
awk 'BEGIN { sum=0; count=0 } { sum += $3; count++ } END { print "Average Salary:", sum/count }' employees.txt
6. Conditions and Loops
1. Conditional Statements
awk '{ if ($1 > 100) print "Large:", $0; else print "Small:", $0 }' file.txt
Give employees a raise and categorize them:
awk '{
if ($3 > 60000) {
status = "Senior";
new_salary = $3 * 1.05;
} else {
status = "Junior";
new_salary = $3 * 1.10;
}
print $1, $3, "->", new_salary, "(" status ")";
}' employees.txt
Output:
Alice 54000 -> 59400 (Junior)
Bob 60000 -> 66000 (Junior)
Carol 58000 -> 63800 (Junior)
David 52000 -> 57200 (Junior)
Eve 65000 -> 68250 (Senior)
2. Loop Statements
awk '{ for(i=1; i<=NF; i++) print $i }' file.txt # Print each field line by line
7. Arrays
1. Count word frequency
awk '{ for(i=1; i<=NF; i++) count[$i]++ } END { for(word in count) print word, count[word] }' file.txt
2. Group statistics by column
awk -F',' '{ sum[$1] += $2 } END { for(key in sum) print key, sum[key] }' data.csv
8. Functions
1. Built-in Functions
- • String Functions:
length(str) // Length of the string substr(s, m, n) // Extract substring (n characters starting from m) index(s, t) // Returns the position of t in s, returns 0 if not found split(s, arr, sep) // Split string s into array arr by sepawk '{ print length($1), substr($1, 1, 3), tolower($1) }' employees.txtOutput:
5 Ali alice 3 Bob bob 5 Car carol 5 Dav david 3 Eve eve - • Mathematical Functions:
<span>sqrt()</span>,<span>log()</span>,<span>rand()</span>, etc.awk '{ print sqrt($3), log($3), int($3), rand() }' employees.txtOutput:
232.379 10.8967 54000 0.924046 244.949 11.0021 60000 0.593909 240.832 10.9682 58000 0.306394 228.035 10.859 52000 0.578941 254.951 11.0821 65000 0.740133
2. Custom Functions
awk 'function double(x) { return x*2 } { print double($1) }' numbers.txt
9. Common Options
| Option | Function |
|---|---|
<span>-F</span> |
Specify field separator |
<span>-v var=value</span> |
Pass variable |
<span>-f script.awk</span> |
Read AWK script from file |
Example of passing variables:
awk -v name="Alice" 'BEGIN {print "Hello, " name}'
# Output: Hello, Alice
When the logic is complex, it can be written as a script file:
BEGIN {
FS = ":"
print "System Users Report"
print "--------------------"
}
$3 >= 1000 &&& $3 < 65534 {
print "User:", $1, "Home:", $6
}
END {
print "Report generated on", strftime("%Y-%m-%d")
}
Run:
awk -f script.awk /etc/passwd
10. Practical Cases
1. Remove duplicates
awk '!seen[$0]++' file.txt # Remove duplicate lines (keep the first occurrence)
2. Extract specific lines:
awk 'NR==10' file.txt # Extract the 10th line
awk 'FNR==1' *.txt # Extract the 1st line from each file
3. Analyze Nginx access logs
Assuming the log format is:<span>127.0.0.1 - - [10/Oct/2023:14:32:01 +0800] "GET /index.html HTTP/1.1" 200 1024 "-" "Mozilla/5.0..."</span>
# 1. Count the number of accesses for each IP
awk '{ ip_count[$1]++ } END { for(ip in ip_count) print ip, ip_count[ip] }' access.log
# 2. Find the top 5 IPs with the most accesses
awk '{ ip_count[$1]++ } END { for(ip in ip_count) print ip_count[ip], ip | "sort -nr | head -n 5"}' access.log
# 3. Count all requests with a 404 status code
awk '$9 == 404 { print $7 }' access.log
# 4. Calculate total traffic consumption (last column is byte count)
awk '{ sum += $10 } END { print "Total Bytes Sent:", sum }' access.log
11. Advanced Techniques
<span>awk</span>’s multi-file processing is a very powerful and practical feature.
When processing multiple files, these variables are particularly useful:
- • FILENAME: The name of the currently processed file
- • NR: The current line number being processed (cumulative across all files)
- • FNR: The line number in the current file
- • ARGV: Command line argument array
- • ARGC: Number of command line arguments
- • ARGIND: The index of the currently processed file in the command line argument list (starting from 1)
1. Simple Example
# Print filename and each line content
awk '{ print FILENAME ": " $0 }' file1.txt file2.txt file3.txt
Output:
file1.txt: First line file1
file1.txt: Second line file1
file2.txt: First line file2
file2.txt: Second line file2
file3.txt: First line file3
file3.txt: Second line file3
When FNR == 1, it indicates the start of processing a new file:
awk 'FNR == 1 { print "--- Start processing: " FILENAME " ---" } { print FNR ": " $0 }' file1.txt file2.txt
Output:
--- Start processing: file1.txt ---
1: First line file1
2: Second line file1
--- Start processing: file2.txt ---
1: First line file2
2: Second line file2
Using BEGINFILE and ENDFILE:
# Requires GNU awk (gawk)
awk 'BEGINFILE { print "Start processing: " FILENAME } { print FNR ": " $0 } ENDFILE { print "Finished processing: " FILENAME "\n" }' file1.txt file2.txt
Output:
Start processing: file1.txt
1: First line file1
2: Second line file1
Finished processing: file1.txt
Start processing: file2.txt
1: First line file2
2: Second line file2
Finished processing: file2.txt
2. Classic Application Patterns
Basic Syntax
# Using FILENAME (all awk)
awk 'FILENAME == "file1.txt" { action1 } FILENAME == "file2.txt" { action2 }' file1.txt file2.txt
awk 'FILENAME == "file1.txt" { action1; next } { action2 }' file1.txt file2.txt
# Using FNR and NR (all awk)
awk 'NR==FNR { action1 } NR>FNR { action2 }' file1.txt file2.txt
awk 'NR==FNR { action1; next } { action2 }' file1.txt file2.txt
# Using ARGIND (GNU awk)
awk 'ARGIND==1 { action1 } ARGIND==2 { action2 }' file1.txt file2.txt
awk 'ARGIND==1 { action1; next } { action2 }' file1.txt file2.txt
3. Practical Cases
(1) Assume there are two files:
staff.txt (Employee Information):
101 Alice Johnson
102 Bob Smith
103 Carol Davis
salary.txt (Salary Information):
101 54000
103 58000
102 60000
# Load staff.txt into an array, then match with salary.txt
awk 'ARGIND==1 { emp[$1] = $2 " " $3; next } $1 in emp { print emp[$1] " - Salary: $" $2 }' staff.txt salary.txt
Output:
Alice Johnson - Salary: $54000
Carol Davis - Salary: $58000
Bob Smith - Salary: $60000
(2) Merge two files
file1.txt:
Apple 10
Banana 20
Orange 15
file2.txt:
Apple 5
Banana 3
Pear 8
# Merge values from two files
awk 'ARGIND==1 { count[$1] += $2; next } { count[$1] += $2 } END { for(item in count) print item, count[item] }' file1.txt file2.txt
Output:
Apple 15
Pear 8
Banana 23
Orange 15
(3) Get the actual filename being processed
awk 'FNR==1 { print "Current file index:", ARGIND, "Current filename:", ARGV[ARGIND], "Complete command line arguments:", ARGC, "arguments" }' file1.txt file2.txt
Output:
Current file index: 1 Current filename: file1.txt Complete command line arguments: 3 arguments
Current file index: 2 Current filename: file2.txt Complete command line arguments: 3 arguments
(4) Process multiple reference files
# Process three files: the first and second as references, the third as main data
awk '
ARGIND==1 {ref1[$1]=$2; next}
ARGIND==2 {ref2[$1]=$3; next}
{
# Process the third file, using data from the first two files
if ($1 in ref1 &&& $1 in ref2) {
print $0, "Reference1:" ref1[$1], "Reference2:" ref2[$1]
}
}' ref1.txt ref2.txt main_data.txt
(5) Merge multiple log files
# Merge multiple log files by timestamp
awk '{print FILENAME, $0}' *.log | sort -k2,3
(6) Data association query
users.txt:
1 Alice 25
2 Bob 30
3 Carol 28
emails.txt:
1 [email protected]
2 [email protected]
3 [email protected]
main_data.txt:
1 Project A Completed
2 Project B In Progress
3 Project C Completed
awk '
ARGIND==1 { user_name[$1] = $2; user_age[$1] = $3; next }
ARGIND==2 { user_email[$1] = $2; next }
{
user_id = $1
if (user_id in user_name &&& user_id in user_email) {
print "User:", user_name[user_id], "(" user_age[user_id] " years old)"
print "Email:", user_email[user_id]
print "Project:", $2, "- Status:", $3
print "---"
} else {
print "Error: User ID", user_id, "data is incomplete"
}
}' users.txt emails.txt main_data.txt
Output:
User: Alice (25 years old)
Email: [email protected]
Project: Project A - Status: Completed
---
User: Bob (30 years old)
Email: [email protected]
Project: Project B - Status: In Progress
---
User: Carol (28 years old)
Email: [email protected]
Project: Project C - Status: Completed
---
(7) Configuration file overriding mechanism
base_config.txt:
theme=dark
language=zh-CN
timeout=30
debug=false
user_config.txt:
theme=light
timeout=60
awk '
ARGIND==1 {
# Load base configuration
split($0, parts, "=")
config[parts[1]] = parts[2]
next
}
ARGIND==2 {
# User configuration overrides base configuration
split($0, parts, "=")
config[parts[1]] = parts[2]
next
}
END {
print "Final Configuration:"
for (key in config) {
print " " key "=" config[key]
}
}' base_config.txt user_config.txt
Output:
Final Configuration:
debug=false
language=zh-CN
timeout=60
theme=light
By mastering these usages, you can efficiently handle most text analysis tasks with <span>awk</span>. For more complex functionalities, refer to <span>man awk</span> or the GNU awk documentation.