Mastering the Linux Triad: AWK – The Swiss Army Knife of Data Processing

1. Overview of AWK Basics

# Basic Structure
awk 'BEGIN{preprocessing} {line processing} END{postprocessing}' filename
# Common Variables
NR: line number | NF: number of fields | $0: entire line content | $1: first column

2. High-Frequency Practical Scenarios

1. Data Deduplication

Example: Retaining Unique Lines

# Deduplicate entire lines (keep the first occurrence)
awk '!seen[$0]++' data.log
# Deduplicate by specific column (keep the last occurrence)
awk -F',' '!a[$1] {a[$1]=$0} END{for(i in a) print a[i]}' data.csv

2. Data Statistics

Example: Sales Data Statistics

date,product,sales
2023-08-01,phone,15
2023-08-01,headphones,30
2023-08-02,phone,20
# Total sales by product
awk -F',' 'NR>1 {sum[$2]+=$3} END{for(k in sum) print k,sum[k]}' sales.csv

# Calculate average daily sales
awk -F',' 'NR>1 {total+=$3} END{print "Average daily:",total/(NR-1)}' sales.csv

3. Merging Two Files

Example: Merging Orders and User Information

# orders.txt
1001 2023-08-01 300
1002 2023-08-02 150

# users.txt
1001 Alice
1002 Bob

# Associate by user ID
awk 'NR==FNR {user[$1]=$2; next} $1 in user {print $0, user[$1]}' users.txt orders.txt

4. Generating Outlines

Example: Log Level Statistics

[ERROR] 2023-08-01 DB Connection Failed
[INFO] 2023-08-01 User login
[ERROR] 2023-08-02 API Timeout
# Generate error type statistics outline
awk '
BEGIN {FS="] "; print "## Error Log Statistics\n"}
$1 ~ /ERROR/ {
    split($2, parts, " ");
    date=parts[1];
    message=substr($2, length(date)+2);
    errors[date][message]++
}
END {
    for (d in errors) {
        print "### " d;
        for (msg in errors[d]) {
            print "- " msg " (Count: " errors[d][msg] ")"
        }
    }
}' app.log

3. Advanced Techniques

  1. Handling Multiple Delimiters: <span>-F '[,:]'</span> Use both comma and colon as delimiters
  2. Field Rearrangement: <span>awk '{print $3,$1}'</span> Adjust column order
  3. Regular Expression Filtering: <span>awk '/ERROR/ && $2 > 100'</span> Combine conditional queries
  4. Calling External Commands: <span>system("echo " $1)</span> Execute Shell commands

4. Performance Optimization Suggestions

  1. Prefer using <span>mawk</span> for processing large files (a faster AWK implementation)
  2. Reduce pipeline operations, try to complete data processing within AWK
  3. Pre-split fields that are used repeatedly

Practical Advice: Save the examples in this article, and adjust the <span>-F</span> parameter and <span>$n</span> values according to specific field positions and delimiters when using. Mastering these techniques will increase your text processing efficiency by more than 10 times!

Further Reading:

  • [SED Magic: Advanced Text Replacement Techniques]
  • [18 Weapons of GREP: Precision Search Guide]

This structure conforms to the characteristics of public account dissemination:

  1. Clear hierarchical structure
  2. Immediately usable code snippets
  3. Scenario-based case presentations
  4. Key points highlighted in bold
  5. Appropriate whitespace to enhance readability

Leave a Comment