Usage of the Linux Text Processing Command awk

awk is a powerful text processing tool and a data stream processing programming language in Linux/Unix systems. It is particularly suitable for handling structured data (such as logs, CSV, etc.) and supports programming features like pattern matching, variables, and functions.

1. Basic Syntax

awk [option] 'pattern { action }' input_file

• pattern: Matching pattern (optional), used to filter lines.
• action: Operation to perform on matching lines (required), enclosed in {}. When pattern exists and { action } does not exist, it defaults to printing the entire line.
• input_file: Input file (optional, defaults to reading from standard input).

2. Basic Usage Examples

1. Print the entire line

awk '{ print }' file.txt      # Equivalent to cat file.txt
awk '{ print $0 }' file.txt   # $0 represents the entire line

2. Print specified fields

awk '{ print $1, $3 }' file.txt  # Print the 1st and 3rd columns

• By default, fields are separated by spaces or tabs.
• $1, $2, etc. represent the 1st, 2nd, etc. columns, and $NF represents the last column.

3. Specify delimiters

awk -F',' '{ print $1 }' file.csv     # Input delimiter is a comma
awk -F: '{ print $1 }' /etc/passwd    # Delimited by colon
awk -F'[ ,]' '{ print $1 }' file.txt  # Multiple delimiters (space or comma)
awk -F'[,:;]' '{ print $1, $3 }' file.txt  # Comma/colon/semicolon as delimiters

4. Output delimiters

awk 'BEGIN { OFS="\t" } { print $1, $3 }' file.txt  # Output separated by tabs
awk '{ print $1 "|" $3 }' file.txt # Custom output format

3. Pattern Matching

1. Regular matching

awk '/error/' log.txt              # Match lines containing "error"
awk '!/error/ { print }' log.txt   # Lines not matching "error"
awk '/^[0-9]/ { print }' file.txt  # Match lines starting with a number

2. Field matching

Fields can be compared using ==, !=, >, <, >=, <=.

awk '$1 == "root" { print }' /etc/passwd    # 1st column is "root"
awk '$3 &gt; 1000 { print }' /etc/passwd       # 3rd column greater than 1000
awk '$1 ~ /^a/ { print }' file.txt          # 1st column starts with "a"

3. Line range matching

awk 'NR==1, NR==5 { print }' file.txt   # Print lines 1 to 5
awk 'NR&gt;=10 &amp;&& NR&lt;=20' file.txt         # Print lines 10 to 20

4. Combined patterns

Use && (and), || (or), ! (not) to combine conditions.

Assuming we have a file employees.txt with the following content:

Alice Johnson 54000 HR
Bob Smith 60000 Engineering
Carol Davis 58000 Marketing
David Lee 52000 HR
Eve Wilson 65000 Engineering

Print employees in the Engineering department with a salary greater than 62000:

awk '$4 == "Engineering" &amp;&& $3 &gt; 62000 { print $0 }' employees.txt

4. Built-in Variables

awk provides many useful built-in variables:

Variable	Description	Default Value
`<span>FS</span>`	Input field separator	Space
`<span>OFS</span>`	Output field separator	Space
`<span>RS</span>`	Input record (line) separator	Newline `<span>\n</span>`
`<span>ORS</span>`	Output record (line) separator	Newline `<span>\n</span>`
`<span>NF</span>`	Number of fields in the current line	–
`<span>NR</span>`	Current line number being processed (starting from 1, cumulative across all files)	–
`<span>FNR</span>`	Line number in the current file (independent for each file)	–
`<span>FILENAME</span>`	Filename of the current input file	–

Example:

awk '{ print NR, NF, $0 }' file.txt        # Print line number, number of fields, entire line
awk -F',' '{ print $1, $NF }' file.csv     # Print the 1st and last columns

5. BEGIN and END Blocks

• BEGIN: Executes once before processing input.
• END: Executes once after all input has been processed.

awk 'BEGIN { sum=0 } { sum+=$1 } END { print sum }' numbers.txt

Calculate the sum of the 1st column in the file.

# Calculate average salary
awk 'BEGIN { sum=0; count=0 } { sum += $3; count++ } END { print "Average Salary:", sum/count }' employees.txt

6. Conditions and Loops

1. Conditional Statements

awk '{ if ($1 &gt; 100) print "Large:", $0; else print "Small:", $0 }' file.txt

Give employees a raise and categorize them:

awk '{
    if ($3 &gt; 60000) {
        status = "Senior";
        new_salary = $3 * 1.05;
    } else {
        status = "Junior";
        new_salary = $3 * 1.10;
    }
    print $1, $3, "-&gt;", new_salary, "(" status ")";
}' employees.txt

Output:

Alice 54000 -&gt; 59400 (Junior)
Bob 60000 -&gt; 66000 (Junior)
Carol 58000 -&gt; 63800 (Junior)
David 52000 -&gt; 57200 (Junior)
Eve 65000 -&gt; 68250 (Senior)

2. Loop Statements

awk '{ for(i=1; i&lt;=NF; i++) print $i }' file.txt  # Print each field line by line

7. Arrays

1. Count word frequency

awk '{ for(i=1; i&lt;=NF; i++) count[$i]++ } END { for(word in count) print word, count[word] }' file.txt

2. Group statistics by column

awk -F',' '{ sum[$1] += $2 } END { for(key in sum) print key, sum[key] }' data.csv

8. Functions

1. Built-in Functions

• String Functions:

length(str)  // Length of the string
substr(s, m, n)  // Extract substring (n characters starting from m)
index(s, t)  // Returns the position of t in s, returns 0 if not found
split(s, arr, sep)  // Split string s into array arr by sep

awk '{ print length($1), substr($1, 1, 3), tolower($1) }' employees.txt

Output:

5 Ali alice
3 Bob bob
5 Car carol
5 Dav david
3 Eve eve

• Mathematical Functions:

sqrt(), log(), rand(), etc.

awk '{ print sqrt($3), log($3), int($3), rand() }' employees.txt

Output:

232.379 10.8967 54000 0.924046
244.949 11.0021 60000 0.593909
240.832 10.9682 58000 0.306394
228.035 10.859 52000 0.578941
254.951 11.0821 65000 0.740133

2. Custom Functions

awk 'function double(x) { return x*2 } { print double($1) }' numbers.txt

9. Common Options

Option	Function
`<span>-F</span>`	Specify field separator
`<span>-v var=value</span>`	Pass variable
`<span>-f script.awk</span>`	Read AWK script from file

Example of passing variables:

awk -v name="Alice" 'BEGIN {print "Hello, " name}'
# Output: Hello, Alice

When the logic is complex, it can be written as a script file:

BEGIN {
    FS = ":"
    print "System Users Report"
    print "--------------------"
}
$3 &gt;= 1000 &amp;&& $3 &lt; 65534 {
    print "User:", $1, "Home:", $6
}
END {
    print "Report generated on", strftime("%Y-%m-%d")
}

Run:

awk -f script.awk /etc/passwd

10. Practical Cases

1. Remove duplicates

awk '!seen[$0]++' file.txt  # Remove duplicate lines (keep the first occurrence)

2. Extract specific lines:

awk 'NR==10' file.txt  # Extract the 10th line
awk 'FNR==1' *.txt     # Extract the 1st line from each file

3. Analyze Nginx access logs

Assuming the log format is:127.0.0.1 - - [10/Oct/2023:14:32:01 +0800] "GET /index.html HTTP/1.1" 200 1024 "-" "Mozilla/5.0..."

# 1. Count the number of accesses for each IP
awk '{ ip_count[$1]++ } END { for(ip in ip_count) print ip, ip_count[ip] }' access.log

# 2. Find the top 5 IPs with the most accesses
awk '{ ip_count[$1]++ } END { for(ip in ip_count) print ip_count[ip], ip | "sort -nr | head -n 5"}' access.log

# 3. Count all requests with a 404 status code
awk '$9 == 404 { print $7 }' access.log

# 4. Calculate total traffic consumption (last column is byte count)
awk '{ sum += $10 } END { print "Total Bytes Sent:", sum }' access.log

11. Advanced Techniques

awk’s multi-file processing is a very powerful and practical feature.

When processing multiple files, these variables are particularly useful:

• FILENAME: The name of the currently processed file
• NR: The current line number being processed (cumulative across all files)
• FNR: The line number in the current file
• ARGV: Command line argument array
• ARGC: Number of command line arguments
• ARGIND: The index of the currently processed file in the command line argument list (starting from 1)

1. Simple Example

# Print filename and each line content
awk '{ print FILENAME ": " $0 }' file1.txt file2.txt file3.txt

Output:

file1.txt: First line file1
file1.txt: Second line file1
file2.txt: First line file2
file2.txt: Second line file2
file3.txt: First line file3
file3.txt: Second line file3

When FNR == 1, it indicates the start of processing a new file:

awk 'FNR == 1 { print "--- Start processing: " FILENAME " ---" } { print FNR ": " $0 }' file1.txt file2.txt

Output:

--- Start processing: file1.txt ---
1: First line file1
2: Second line file1
--- Start processing: file2.txt ---
1: First line file2
2: Second line file2

Using BEGINFILE and ENDFILE:

# Requires GNU awk (gawk)
awk 'BEGINFILE { print "Start processing: " FILENAME } { print FNR ": " $0 } ENDFILE { print "Finished processing: " FILENAME "\n" }' file1.txt file2.txt

Output:

Start processing: file1.txt
1: First line file1
2: Second line file1
Finished processing: file1.txt

Start processing: file2.txt
1: First line file2
2: Second line file2
Finished processing: file2.txt

2. Classic Application Patterns

Basic Syntax

# Using FILENAME (all awk)
awk 'FILENAME == "file1.txt" { action1 } FILENAME == "file2.txt" { action2 }' file1.txt file2.txt
awk 'FILENAME == "file1.txt" { action1; next } { action2 }' file1.txt file2.txt

# Using FNR and NR (all awk)
awk 'NR==FNR { action1 } NR>FNR { action2 }' file1.txt file2.txt
awk 'NR==FNR { action1; next } { action2 }' file1.txt file2.txt

# Using ARGIND (GNU awk)
awk 'ARGIND==1 { action1 } ARGIND==2 { action2 }' file1.txt file2.txt
awk 'ARGIND==1 { action1; next } { action2 }' file1.txt file2.txt

3. Practical Cases

(1) Assume there are two files:

staff.txt (Employee Information):

101 Alice Johnson
102 Bob Smith  
103 Carol Davis

salary.txt (Salary Information):

101 54000
103 58000
102 60000

# Load staff.txt into an array, then match with salary.txt
awk 'ARGIND==1 { emp[$1] = $2 " " $3; next } $1 in emp { print emp[$1] " - Salary: $" $2 }' staff.txt salary.txt

Output:

Alice Johnson - Salary: $54000
Carol Davis - Salary: $58000
Bob Smith - Salary: $60000

(2) Merge two files

file1.txt:

Apple 10
Banana 20
Orange 15

file2.txt:

Apple 5
Banana 3
Pear 8

# Merge values from two files
awk 'ARGIND==1 { count[$1] += $2; next } { count[$1] += $2 } END { for(item in count) print item, count[item] }' file1.txt file2.txt

Output:

Apple 15
Pear 8
Banana 23
Orange 15

(3) Get the actual filename being processed

awk 'FNR==1 { print "Current file index:", ARGIND, "Current filename:", ARGV[ARGIND], "Complete command line arguments:", ARGC, "arguments" }' file1.txt file2.txt

Output:

Current file index: 1 Current filename: file1.txt Complete command line arguments: 3 arguments
Current file index: 2 Current filename: file2.txt Complete command line arguments: 3 arguments

(4) Process multiple reference files

# Process three files: the first and second as references, the third as main data
awk '
ARGIND==1 {ref1[$1]=$2; next}
ARGIND==2 {ref2[$1]=$3; next}
{ 
    # Process the third file, using data from the first two files
    if ($1 in ref1 &amp;&& $1 in ref2) {
        print $0, "Reference1:" ref1[$1], "Reference2:" ref2[$1]
    }
}' ref1.txt ref2.txt main_data.txt

(5) Merge multiple log files

# Merge multiple log files by timestamp
awk '{print FILENAME, $0}' *.log | sort -k2,3

(6) Data association query

users.txt:

1 Alice 25
2 Bob 30
3 Carol 28

emails.txt:

1 [email protected]
2 [email protected]
3 [email protected]

main_data.txt:

1 Project A Completed
2 Project B In Progress
3 Project C Completed

awk '
ARGIND==1 { user_name[$1] = $2; user_age[$1] = $3; next }
ARGIND==2 { user_email[$1] = $2; next }
{
    user_id = $1
    if (user_id in user_name &amp;&& user_id in user_email) {
        print "User:", user_name[user_id], "(" user_age[user_id] " years old)"
        print "Email:", user_email[user_id]
        print "Project:", $2, "- Status:", $3
        print "---"
    } else {
        print "Error: User ID", user_id, "data is incomplete"
    }
}' users.txt emails.txt main_data.txt

Output:

User: Alice (25 years old)
Email: [email protected]
Project: Project A - Status: Completed
---
User: Bob (30 years old)
Email: [email protected]
Project: Project B - Status: In Progress
---
User: Carol (28 years old)
Email: [email protected]
Project: Project C - Status: Completed
---

(7) Configuration file overriding mechanism

base_config.txt:

theme=dark
language=zh-CN
timeout=30
debug=false

user_config.txt:

theme=light
timeout=60

awk '
ARGIND==1 {
    # Load base configuration
    split($0, parts, "=")
    config[parts[1]] = parts[2]
    next
}
ARGIND==2 {
    # User configuration overrides base configuration
    split($0, parts, "=")
    config[parts[1]] = parts[2]
    next
}
END {
    print "Final Configuration:"
    for (key in config) {
        print "  " key "=" config[key]
    }
}' base_config.txt user_config.txt

Output:

Final Configuration:
  debug=false
  language=zh-CN
  timeout=60
  theme=light

By mastering these usages, you can efficiently handle most text analysis tasks with awk. For more complex functionalities, refer to man awk or the GNU awk documentation.

1. Basic Syntax

2. Basic Usage Examples

1. Print the entire line

2. Print specified fields

3. Specify delimiters

4. Output delimiters

3. Pattern Matching

1. Regular matching

2. Field matching

3. Line range matching

4. Combined patterns

4. Built-in Variables

5. BEGIN and END Blocks

6. Conditions and Loops

1. Conditional Statements

2. Loop Statements

7. Arrays

1. Count word frequency

2. Group statistics by column

8. Functions

1. Built-in Functions

2. Custom Functions

9. Common Options

10. Practical Cases

1. Remove duplicates

2. Extract specific lines:

3. Analyze Nginx access logs

11. Advanced Techniques

1. Simple Example

2. Classic Application Patterns

3. Practical Cases

Related posts

Leave a Comment Cancel reply