AWK is an acronym for the names of developers Aho, Weinberger, and Kernighan, and we would like to express our gratitude to these pioneers.The history of AWK can be traced back to the early Unix era. It is part of the POSIX standard and should be available on any Unix-like system, and possibly on other systems as well.When analyzing data, such as obtaining raw BED, VCF, MAF files, or GTF text files, we may want to perform statistical analysis on basic information and some filtering variables in Linux, which showcases the power of AWK. In this tutorial, we will use VCF, MAF, and BED files, which are commonly used formats in bioinformatics, as examples:
In genomic analysis, we often encounter text files like VCF, MAF, BAM, and BED. It is crucial to analyze results promptly after obtaining them; therefore, mastering AWK fluently can significantly enhance the speed and quality of genomic analysis compared to AI.
Example Data
We can store the example data locally for use, and the example files can also be obtained from GitHub: (1) VCF example file:
##fileformat=VCFv4.2
##for test by Hu
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 100 . A T 30 PASS AC=1;AN=2
chr1 200 . C G 25 PASS AC=2;AN=2
chr2 150 . T C 20 PASS AC=1;AN=2
chr10 150 . G C 20 Low AC=1;AN=2
chrY 150 . T AC 20 PASS AC=0;AN=2
(2) MAF example file:
Hugo_Symbol Entrez_Gene_Id NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2
TP53 7157 GRCh38 chr17 7674409 7674409 + Missense_Mutation SNP C A T
BRCA1 672 GRCh38 chr17 41246000 41246000 + Nonsense_Mutation SNP A T T
(3) BED example file:
chr1 100 200 TP53,LOC
chr1 300 400 TTLL
chr2 150 250 TP63,Loc2
Additionally, a complex file test.txt has been added for testing:
# This is a complex example file for learning AWK commands
# Each line represents a record, fields are separated by tabs
Name Age Occupation Salary Location Notes
Alice 25 Engineer 50000 New York Has 5 years of experience
Bob 30 Doctor 70000 Los Angeles Specializes in cardiology
Charlie 28 Teacher 45000 Chicago Teaches mathematics
Diana 35 Lawyer 80000 Houston Works in corporate law
Eve 22 Student N/A Boston Majoring in Computer Science
Frank 40 Entrepreneur 100000 San Francisco Founded a tech startup
# Below are some special records for testing AWK's processing capabilities
SpecialRecord1 99 Unknown 0 Nowhere This is a special record with unusual values
SpecialRecord2 101 Alien -1 Space From another galaxy
# Comment lines and empty lines
# This is a comment line
# Another comment line
# More data
Gina 27 Designer 55000 Seattle Specializes in UI/UX design
Hank 33 Pilot 60000 Miami Commercial airline pilot
Predefined and Automatic Variables in AWK
AWK supports several predefined and automatic variables to help you write programs. You will often encounter: (1) RS – Record Separator AWK processes data line by line. By default, it is a newline character. Therefore, if not specified, a record is one line of the input file.
(2) NR – Current Input Record Number This will match the current line number of your data stream.
(3) FS/OFS – Characters used as field separators When AWK reads a line, it splits it into different fields based on the value of <span>FS</span>. When AWK prints a line in the output, it will use the <span>OFS</span> separator instead of the <span>FS</span> separator to recombine these fields. Typically, <span>FS</span> and <span>OFS</span> are the same, but this is not mandatory. “Space” is the default for both.
(4) NF – Number of Fields in the Current Record If a line of data is split, it will match the number of fields after splitting.
There are others, but these four are commonly used, and mastering them can solve 99% of problems.
Basic Usage of AWK
<span>awk</span> scripts consist of <span>BEGIN</span> block, main processing block, and <span>END</span> block. The execution order is:<span>BEGIN</span> (before processing the file) → main processing block (process the file line by line) → <span>END</span> (after processing the file)
The command structure of AWK is as follows:
awk options 'selection_criteria {action }' input-file > output-file
(1) Options are the predefined functional fields mentioned above. (2) The ‘selection_criteria { action }’ part is used to select what to process, such as which rows and columns, and which fields. (3) Action is what to do when the condition is met. (4) The input file is directly attached to the command, obtaining the output file in a chained manner.
Practical Usage Examples
1. VCF Files
VCF (Variant Call Format) is a standard format for storing genetic variation data, and we will demonstrate how to extract and analyze information from it through multiple examples.
(1): Extract all filtered variants (FILTER=PASS)
awk 'BEGIN {FS="\t"} !/^##/ {if ($7 == "PASS") print $0}' input.vcf
This command:
- Sets the field separator to a tab character
- Skips metadata lines starting with
<span>##</span>(retaining the header line) <span>/^##/</span>indicates “match lines starting with<span>##</span>“- Only prints records where the FILTER field is “PASS”
(2) Extract variants on chromosome chr1 with quality value (QUAL) greater than 25
awk 'BEGIN {FS="\t"} !/^#/ {if ($1 == "chr1" && $6 > 25) print $1 "\t" $2 "\t" $4 "\t" $5 "\t" $6}' input.vcf
This command:
- Skips all comment lines (starting with #)
- Only processes variants on chr1
- Filters records with quality values greater than 25, using && to indicate “and”
- Only outputs chromosome, position, reference allele, variant allele, and quality value, reordering the output after print
- In the previous command, $0 indicates outputting all columns that meet the condition
(3) Count the number of variants on each chromosome
awk 'BEGIN {FS="\t"} !/^#/ {count[$1]++} END {for (chr in count) print chr "\t" count[chr]}' input.vcf | sort
This command:
- Uses an associative array count to tally the occurrences on each chromosome
<span>count</span>is an associative array in<span>awk</span>(similar to a dictionary), with<span>$1</span>(chromosome name) as the key, and<span>++</span>increments the count by 1 for each occurrence of the same key.- After processing all lines, it prints each chromosome and its corresponding variant count
<span>for (chr in count)</span>: iterates through all keys (chromosome names) in the associative array<span>count</span>.- Finally, the sort command is used to sort the output results
(4) Extract the AC (allele count) value from the INFO field
awk 'BEGIN {FS="\t"} !/^#/ {split($8, info, /;/); for (i in info) {if (info[i] ~ /^AC=/) {split(info[i], ac, /=/); print $1 "\t" $2 "\t" ac[2]}}}' input.vcf
This command:
- Splits the INFO field to extract the AC value
<span>split(string, array, separator)</span>: splits the 8th column into an array<span>info</span>based on<span>;</span>.- Outputs chromosome, position, and corresponding AC value
<span>print $1 "\t" $2 "\t" ac[2]</span>prints the result: the 1st column (chromosome), the 2nd column (position), and the AC value (<span>ac[2]</span>), separated by tabs
2. MAF Files
MAF (Mutation Annotation Format) files are used to store gene mutation annotation information, and here are common processing methods.
(1) Extract all mutations of the TP53 gene
awk 'BEGIN {FS="\t"} NR==1; $1 == "TP53"' input.maf
This command:
- Retains the first line header information using NR.
- Extracts all records where Hugo_Symbol is TP53
(2) Filter all missense mutations (Missense_Mutation)
awk 'BEGIN {FS="\t"} NR==1; $8 == "Missense_Mutation"' input.maf
Having gone through VCF files, processing ordinary text files like MAF files becomes quite simple.
(3) Extract all mutations on chromosome 17 and display gene name, position, and variant type
awk 'BEGIN {FS="\t"} NR==1 {print "Gene\tChromosome\tPosition\tVariant_Type"}; $4 == "chr17" {print $1 "\t" $4 "\t" $5 "\t" $9}' input.maf
This command:
- Customizes the output header
- Only processes mutations on chr17
- Outputs gene name, chromosome, position, and variant type
(4) Count the number of different variant types
awk 'BEGIN {FS="\t"} NR>1 {count[$9]++} END {print "Variant_Type\tCount"; for (type in count) print type "\t" count[type]}' input.maf
This is similar to counting how many mutations are on different chromosomes in VCF.
3. BED Files
BED files are used to describe genomic feature regions, typically containing chromosome, start position, end position, and name information.
(1) Calculate the length of each region and add it to the output
awk 'BEGIN {FS="\t"; OFS="\t"} {length = $3 - $2; print $0, length}' input.bed
This command:
- Calculates the length of each region (end position – start position)
- Adds the length as a new column to the output
- Uses OFS to ensure the output field separator is a tab character
(2) Filter regions longer than 100
awk 'BEGIN {FS="\t"} ($3 - $2) > 100 {print $0}' input.bed
Here, we find that as long as the processing steps do not cause ambiguity, they can be freely manipulated.
(3) Extract regions containing the TP53 gene
awk 'BEGIN {FS="\t"} $4 ~ /TP53/ {print $0}' input.bed
Uses a regular expression to match regions containing the TP53 gene
4. Demonstration of Other Complex Situations
For complex text files containing annotations, special values, and different formats, AWK provides flexible processing capabilities.
Skip comment lines and empty lines, only process data lines
awk 'BEGIN {FS="\t"} !/^#/ &&& NF > 0 {print $0}' input.txt
- Skips all comment lines starting with #
- Skips empty lines (NF > 0 indicates the number of fields is greater than 0)
- Only processes data lines
When to Use Curly Braces, Parentheses, and Semicolons?
1. Curly Braces `{}`: Used to Enclose Code Blocks
The core function of curly braces is to group a set of statements (actions) into a logical block, typically bound to a “pattern” or specific structure (such as functions, loops).
1. Actions in the Pattern-Action Pair
The basic syntax of AWK is <span>pattern { action }</span>, where the action part must be enclosed in curly braces, indicating “a series of operations to execute when the pattern matches.” Example:
# Execute the action of printing the first column for each line (pattern is empty, matches all lines)
{ print $1 }
# When the third column is greater than 100, execute the action of printing the first and third columns
$3 > 100 { print $1, $3 }
2. BEGIN/END Blocks
<span>BEGIN</span> (executed before processing the file) and <span>END</span> (executed after processing the file) blocks must have their code enclosed in curly braces:
BEGIN {
FS = "\t"
print "Starting to process the file"
}
END { print "Processing complete, total", NR, "lines" }
3. Bodies of Loops and Conditional Statements
<span>if</span>, <span>for</span>, <span>while</span> structures require their execution bodies to be enclosed in curly braces (even if there is only one statement, using braces is more standard):
{
if ($2 > 30) {
print $1, "is older than 30"
count++
}
for (i=1; i<=3; i++) {
print "Column", i, ":", $i
}
}
4. Function Definitions
The implementation body of a custom function must be enclosed in curly braces:
function sum(a, b) {
return a + b
}
{ print sum($2, $3) } # Call the function with parameters in parentheses
2. Parentheses `()`: Used for Grouping or Parameter Lists
Parentheses are mainly used for grouping expressions, function parameters, or regular expression capture groups.
1. Grouping Expressions (Changing Operation Precedence)
In mathematical or logical expressions, parentheses change the order of operations:
# Calculate 5+3 first, then multiply by 2 (otherwise defaults to 5+(3*2))
{ print (5 + 3) * 2 }
# First check if $2>20, then do logical AND with $3<100
{ if (($2 > 20) &&& ($3 < 100)) print $0 }
2. Parameter Lists for Function Calls
When calling built-in functions (like <span>print</span>, <span>split</span>) or custom functions, parameters must be enclosed in parentheses:
# Parameters for the split function: string, array, separator
{ split($8, info, /;/) }
# Split the 8th column into the info array by ;
# Parameters for the custom function sum
{ total = sum($2, $3) }
3. Capture Groups in Regular Expressions
In regular patterns, parentheses are used to define capture groups (which can be referenced by <span>\1</span>, <span>\2</span> etc.):
# Match "AC=number" and extract the number (\1 references the first group)
$8 ~ /AC=([0-9]+)/ { print "AC value:", gensub(/AC=([0-9]+)/, "\1", "g", $8) }
4. Array Traversal (for Loop)
When traversing an array, the variable before the <span>in</span> keyword needs to be in parentheses (fixed syntax):
# Traverse the keys of the count array
END {
for (key in count) {
# key in parentheses
print key, count[key]
}
}
3. Semicolon `;`: Used to Separate Statements
The semicolon is a statement separator, used to distinguish multiple independent statements on the same line.
1. Multiple Statements on the Same Line
When multiple statements are written on the same line, they must be separated by semicolons:
# Multiple statements written in one line in the BEGIN block
BEGIN { FS="\t"; OFS=","; print "Starting" }
# Multiple operations executed in one line in the action { $1 = toupper($1); print $0 }
# Print the entire line after converting the first column to uppercase
2. Single-Line Writing of Loops or Conditions
If the execution body of a loop/condition has only one statement, the curly braces can be omitted, but the statement must be separated from the structure by a semicolon (not recommended, as it reduces readability):
$2 > 30 { print $1 } # Standard writing (with curly braces)
$2 > 30; print $1 # Incorrect! Will be parsed as two independent actions
3. Empty Statements
A standalone semicolon represents an empty statement (no operation), sometimes used as a placeholder in loops:
# Skip the first 3 lines (empty statement means "do nothing")
NR <= 3 { ; } # Equivalent to not processing the first 3 lines
NR > 3 { print $0 }
Conclusion
AWK provides powerful text processing capabilities. By flexibly utilizing its pattern matching, field processing, and flow control functions, various structured and semi-structured data can be efficiently processed. The examples above demonstrate common applications in bioinformatics and general text processing. In practical use, these techniques can be combined according to specific needs to build more complex processing logic.
Mastering AWK commands can significantly improve data processing efficiency, especially when dealing with large datasets or requiring automated analysis processes, where AWK is often an indispensable tool.