Usage of the Linux Text Processing Command awk

<span>awk</span> is a powerful text processing tool and a data stream processing programming language in Linux/Unix systems. It is particularly suitable for handling structured data (such as logs, CSV, etc.) and supports programming features like pattern matching, variables, and functions.

1. Basic Syntax

awk [option] 'pattern { action }' input_file
  • pattern: Matching pattern (optional), used to filter lines.
  • action: Operation to perform on matching lines (required), enclosed in <span>{}</span>. When <span>pattern</span> exists and <span>{ action }</span> does not exist, it defaults to printing the entire line.
  • input_file: Input file (optional, defaults to reading from standard input).

2. Basic Usage Examples

1. Print the entire line

awk '{ print }' file.txt      # Equivalent to cat file.txt
awk '{ print $0 }' file.txt   # $0 represents the entire line

2. Print specified fields

awk '{ print $1, $3 }' file.txt  # Print the 1st and 3rd columns
  • • By default, fields are separated by spaces or tabs.
  • <span>$1</span>, <span>$2</span>, etc. represent the 1st, 2nd, etc. columns, and <span>$NF</span> represents the last column.

3. Specify delimiters

awk -F',' '{ print $1 }' file.csv     # Input delimiter is a comma
awk -F: '{ print $1 }' /etc/passwd    # Delimited by colon
awk -F'[ ,]' '{ print $1 }' file.txt  # Multiple delimiters (space or comma)
awk -F'[,:;]' '{ print $1, $3 }' file.txt  # Comma/colon/semicolon as delimiters

4. Output delimiters

awk 'BEGIN { OFS="\t" } { print $1, $3 }' file.txt  # Output separated by tabs
awk '{ print $1 "|" $3 }' file.txt # Custom output format

3. Pattern Matching

1. Regular matching

awk '/error/' log.txt              # Match lines containing "error"
awk '!/error/ { print }' log.txt   # Lines not matching "error"
awk '/^[0-9]/ { print }' file.txt  # Match lines starting with a number

2. Field matching

Fields can be compared using <span>==</span>, <span>!=</span>, <span>></span>, <span><</span>, <span>>=</span>, <span><=</span>.

awk '$1 == "root" { print }' /etc/passwd    # 1st column is "root"
awk '$3 &gt; 1000 { print }' /etc/passwd       # 3rd column greater than 1000
awk '$1 ~ /^a/ { print }' file.txt          # 1st column starts with "a"

3. Line range matching

awk 'NR==1, NR==5 { print }' file.txt   # Print lines 1 to 5
awk 'NR&gt;=10 &amp;&& NR&lt;=20' file.txt         # Print lines 10 to 20

4. Combined patterns

Use <span>&&</span> (and), <span>||</span> (or), <span>!</span> (not) to combine conditions.

Assuming we have a file <span>employees.txt</span> with the following content:

Alice Johnson 54000 HR
Bob Smith 60000 Engineering
Carol Davis 58000 Marketing
David Lee 52000 HR
Eve Wilson 65000 Engineering

Print employees in the Engineering department with a salary greater than 62000:

awk '$4 == "Engineering" &amp;&& $3 &gt; 62000 { print $0 }' employees.txt

4. Built-in Variables

<span>awk</span> provides many useful built-in variables:

Variable Description Default Value
<span>FS</span> Input field separator Space
<span>OFS</span> Output field separator Space
<span>RS</span> Input record (line) separator Newline <span>\n</span>
<span>ORS</span> Output record (line) separator Newline <span>\n</span>
<span>NF</span> Number of fields in the current line
<span>NR</span> Current line number being processed (starting from 1, cumulative across all files)
<span>FNR</span> Line number in the current file (independent for each file)
<span>FILENAME</span> Filename of the current input file

Example:

awk '{ print NR, NF, $0 }' file.txt        # Print line number, number of fields, entire line
awk -F',' '{ print $1, $NF }' file.csv     # Print the 1st and last columns

5. BEGIN and END Blocks

  • <span>BEGIN</span>: Executes once before processing input.
  • <span>END</span>: Executes once after all input has been processed.
awk 'BEGIN { sum=0 } { sum+=$1 } END { print sum }' numbers.txt

Calculate the sum of the 1st column in the file.

# Calculate average salary
awk 'BEGIN { sum=0; count=0 } { sum += $3; count++ } END { print "Average Salary:", sum/count }' employees.txt

6. Conditions and Loops

1. Conditional Statements

awk '{ if ($1 &gt; 100) print "Large:", $0; else print "Small:", $0 }' file.txt

Give employees a raise and categorize them:

awk '{
    if ($3 &gt; 60000) {
        status = "Senior";
        new_salary = $3 * 1.05;
    } else {
        status = "Junior";
        new_salary = $3 * 1.10;
    }
    print $1, $3, "-&gt;", new_salary, "(" status ")";
}' employees.txt

Output:

Alice 54000 -&gt; 59400 (Junior)
Bob 60000 -&gt; 66000 (Junior)
Carol 58000 -&gt; 63800 (Junior)
David 52000 -&gt; 57200 (Junior)
Eve 65000 -&gt; 68250 (Senior)

2. Loop Statements

awk '{ for(i=1; i&lt;=NF; i++) print $i }' file.txt  # Print each field line by line

7. Arrays

1. Count word frequency

awk '{ for(i=1; i&lt;=NF; i++) count[$i]++ } END { for(word in count) print word, count[word] }' file.txt

2. Group statistics by column

awk -F',' '{ sum[$1] += $2 } END { for(key in sum) print key, sum[key] }' data.csv

8. Functions

1. Built-in Functions

  • String Functions:
    length(str)  // Length of the string
    substr(s, m, n)  // Extract substring (n characters starting from m)
    index(s, t)  // Returns the position of t in s, returns 0 if not found
    split(s, arr, sep)  // Split string s into array arr by sep
    awk '{ print length($1), substr($1, 1, 3), tolower($1) }' employees.txt

    Output:

    5 Ali alice
    3 Bob bob
    5 Car carol
    5 Dav david
    3 Eve eve
  • Mathematical Functions:

    <span>sqrt()</span>, <span>log()</span>, <span>rand()</span>, etc.

    awk '{ print sqrt($3), log($3), int($3), rand() }' employees.txt

    Output:

    232.379 10.8967 54000 0.924046
    244.949 11.0021 60000 0.593909
    240.832 10.9682 58000 0.306394
    228.035 10.859 52000 0.578941
    254.951 11.0821 65000 0.740133

2. Custom Functions

awk 'function double(x) { return x*2 } { print double($1) }' numbers.txt

9. Common Options

Option Function
<span>-F</span> Specify field separator
<span>-v var=value</span> Pass variable
<span>-f script.awk</span> Read AWK script from file

Example of passing variables:

awk -v name="Alice" 'BEGIN {print "Hello, " name}'
# Output: Hello, Alice

When the logic is complex, it can be written as a script file:

BEGIN {
    FS = ":"
    print "System Users Report"
    print "--------------------"
}
$3 &gt;= 1000 &amp;&& $3 &lt; 65534 {
    print "User:", $1, "Home:", $6
}
END {
    print "Report generated on", strftime("%Y-%m-%d")
}

Run:

awk -f script.awk /etc/passwd

10. Practical Cases

1. Remove duplicates

awk '!seen[$0]++' file.txt  # Remove duplicate lines (keep the first occurrence)

2. Extract specific lines:

awk 'NR==10' file.txt  # Extract the 10th line
awk 'FNR==1' *.txt     # Extract the 1st line from each file

3. Analyze Nginx access logs

Assuming the log format is:<span>127.0.0.1 - - [10/Oct/2023:14:32:01 +0800] "GET /index.html HTTP/1.1" 200 1024 "-" "Mozilla/5.0..."</span>

# 1. Count the number of accesses for each IP
awk '{ ip_count[$1]++ } END { for(ip in ip_count) print ip, ip_count[ip] }' access.log

# 2. Find the top 5 IPs with the most accesses
awk '{ ip_count[$1]++ } END { for(ip in ip_count) print ip_count[ip], ip | "sort -nr | head -n 5"}' access.log

# 3. Count all requests with a 404 status code
awk '$9 == 404 { print $7 }' access.log

# 4. Calculate total traffic consumption (last column is byte count)
awk '{ sum += $10 } END { print "Total Bytes Sent:", sum }' access.log

11. Advanced Techniques

<span>awk</span>’s multi-file processing is a very powerful and practical feature.

When processing multiple files, these variables are particularly useful:

  • FILENAME: The name of the currently processed file
  • NR: The current line number being processed (cumulative across all files)
  • FNR: The line number in the current file
  • ARGV: Command line argument array
  • ARGC: Number of command line arguments
  • ARGIND: The index of the currently processed file in the command line argument list (starting from 1)

1. Simple Example

# Print filename and each line content
awk '{ print FILENAME ": " $0 }' file1.txt file2.txt file3.txt

Output:

file1.txt: First line file1
file1.txt: Second line file1
file2.txt: First line file2
file2.txt: Second line file2
file3.txt: First line file3
file3.txt: Second line file3

When FNR == 1, it indicates the start of processing a new file:

awk 'FNR == 1 { print "--- Start processing: " FILENAME " ---" } { print FNR ": " $0 }' file1.txt file2.txt

Output:

--- Start processing: file1.txt ---
1: First line file1
2: Second line file1
--- Start processing: file2.txt ---
1: First line file2
2: Second line file2

Using BEGINFILE and ENDFILE:

# Requires GNU awk (gawk)
awk 'BEGINFILE { print "Start processing: " FILENAME } { print FNR ": " $0 } ENDFILE { print "Finished processing: " FILENAME "\n" }' file1.txt file2.txt

Output:

Start processing: file1.txt
1: First line file1
2: Second line file1
Finished processing: file1.txt

Start processing: file2.txt
1: First line file2
2: Second line file2
Finished processing: file2.txt

2. Classic Application Patterns

Basic Syntax

# Using FILENAME (all awk)
awk 'FILENAME == "file1.txt" { action1 } FILENAME == "file2.txt" { action2 }' file1.txt file2.txt
awk 'FILENAME == "file1.txt" { action1; next } { action2 }' file1.txt file2.txt

# Using FNR and NR (all awk)
awk 'NR==FNR { action1 } NR>FNR { action2 }' file1.txt file2.txt
awk 'NR==FNR { action1; next } { action2 }' file1.txt file2.txt

# Using ARGIND (GNU awk)
awk 'ARGIND==1 { action1 } ARGIND==2 { action2 }' file1.txt file2.txt
awk 'ARGIND==1 { action1; next } { action2 }' file1.txt file2.txt

3. Practical Cases

(1) Assume there are two files:

staff.txt (Employee Information):

101 Alice Johnson
102 Bob Smith  
103 Carol Davis

salary.txt (Salary Information):

101 54000
103 58000
102 60000
# Load staff.txt into an array, then match with salary.txt
awk 'ARGIND==1 { emp[$1] = $2 " " $3; next } $1 in emp { print emp[$1] " - Salary: $" $2 }' staff.txt salary.txt

Output:

Alice Johnson - Salary: $54000
Carol Davis - Salary: $58000
Bob Smith - Salary: $60000

(2) Merge two files

file1.txt:

Apple 10
Banana 20
Orange 15

file2.txt:

Apple 5
Banana 3
Pear 8
# Merge values from two files
awk 'ARGIND==1 { count[$1] += $2; next } { count[$1] += $2 } END { for(item in count) print item, count[item] }' file1.txt file2.txt

Output:

Apple 15
Pear 8
Banana 23
Orange 15

(3) Get the actual filename being processed

awk 'FNR==1 { print "Current file index:", ARGIND, "Current filename:", ARGV[ARGIND], "Complete command line arguments:", ARGC, "arguments" }' file1.txt file2.txt

Output:

Current file index: 1 Current filename: file1.txt Complete command line arguments: 3 arguments
Current file index: 2 Current filename: file2.txt Complete command line arguments: 3 arguments

(4) Process multiple reference files

# Process three files: the first and second as references, the third as main data
awk '
ARGIND==1 {ref1[$1]=$2; next}
ARGIND==2 {ref2[$1]=$3; next}
{ 
    # Process the third file, using data from the first two files
    if ($1 in ref1 &amp;&& $1 in ref2) {
        print $0, "Reference1:" ref1[$1], "Reference2:" ref2[$1]
    }
}' ref1.txt ref2.txt main_data.txt

(5) Merge multiple log files

# Merge multiple log files by timestamp
awk '{print FILENAME, $0}' *.log | sort -k2,3

(6) Data association query

users.txt:

1 Alice 25
2 Bob 30
3 Carol 28

emails.txt:

1 [email protected]
2 [email protected]
3 [email protected]

main_data.txt:

1 Project A Completed
2 Project B In Progress
3 Project C Completed
awk '
ARGIND==1 { user_name[$1] = $2; user_age[$1] = $3; next }
ARGIND==2 { user_email[$1] = $2; next }
{
    user_id = $1
    if (user_id in user_name &amp;&& user_id in user_email) {
        print "User:", user_name[user_id], "(" user_age[user_id] " years old)"
        print "Email:", user_email[user_id]
        print "Project:", $2, "- Status:", $3
        print "---"
    } else {
        print "Error: User ID", user_id, "data is incomplete"
    }
}' users.txt emails.txt main_data.txt

Output:

User: Alice (25 years old)
Email: [email protected]
Project: Project A - Status: Completed
---
User: Bob (30 years old)
Email: [email protected]
Project: Project B - Status: In Progress
---
User: Carol (28 years old)
Email: [email protected]
Project: Project C - Status: Completed
---

(7) Configuration file overriding mechanism

base_config.txt:

theme=dark
language=zh-CN
timeout=30
debug=false

user_config.txt:

theme=light
timeout=60
awk '
ARGIND==1 {
    # Load base configuration
    split($0, parts, "=")
    config[parts[1]] = parts[2]
    next
}
ARGIND==2 {
    # User configuration overrides base configuration
    split($0, parts, "=")
    config[parts[1]] = parts[2]
    next
}
END {
    print "Final Configuration:"
    for (key in config) {
        print "  " key "=" config[key]
    }
}' base_config.txt user_config.txt

Output:

Final Configuration:
  debug=false
  language=zh-CN
  timeout=60
  theme=light

By mastering these usages, you can efficiently handle most text analysis tasks with <span>awk</span>. For more complex functionalities, refer to <span>man awk</span> or the GNU awk documentation.

Leave a Comment