A Comprehensive Guide to AWK and Its Applications in Genomics

AWK is an acronym for the names of developers Aho, Weinberger, and Kernighan, and we would like to express our gratitude to these pioneers.The history of AWK can be traced back to the early Unix era. It is part of the POSIX standard and should be available on any Unix-like system, and possibly on other systems as well.When analyzing data, such as obtaining raw BED, VCF, MAF files, or GTF text files, we may want to perform statistical analysis on basic information and some filtering variables in Linux, which showcases the power of AWK. In this tutorial, we will use VCF, MAF, and BED files, which are commonly used formats in bioinformatics, as examples:

In genomic analysis, we often encounter text files like VCF, MAF, BAM, and BED. It is crucial to analyze results promptly after obtaining them; therefore, mastering AWK fluently can significantly enhance the speed and quality of genomic analysis compared to AI.

Example Data

We can store the example data locally for use, and the example files can also be obtained from GitHub: (1) VCF example file:

##fileformat=VCFv4.2
##for test by Hu
#CHROM    POS ID  REF ALT QUAL    FILTER  INFO
chr1    100 .   A   T   30  PASS    AC=1;AN=2
chr1    200 .   C   G   25  PASS    AC=2;AN=2
chr2    150 .   T   C   20  PASS    AC=1;AN=2
chr10    150 .   G   C   20  Low AC=1;AN=2
chrY    150 .   T   AC  20  PASS    AC=0;AN=2

(2) MAF example file:

Hugo_Symbol    Entrez_Gene_Id  NCBI_Build  Chromosome  Start_Position  End_Position    Strand  Variant_Classification  Variant_Type    Reference_Allele    Tumor_Seq_Allele1   Tumor_Seq_Allele2
TP53    7157    GRCh38  chr17   7674409 7674409 +   Missense_Mutation   SNP C   A   T
BRCA1    672 GRCh38  chr17   41246000    41246000    +   Nonsense_Mutation   SNP A   T   T

(3) BED example file:

chr1    100 200 TP53,LOC
chr1    300 400 TTLL
chr2    150 250 TP63,Loc2

Additionally, a complex file test.txt has been added for testing:

# This is a complex example file for learning AWK commands
# Each line represents a record, fields are separated by tabs

Name    Age  Occupation    Salary   Location    Notes
Alice   25   Engineer      50000    New York    Has 5 years of experience
Bob     30   Doctor        70000    Los Angeles Specializes in cardiology
Charlie 28   Teacher       45000    Chicago     Teaches mathematics
Diana   35   Lawyer        80000    Houston     Works in corporate law
Eve     22   Student       N/A      Boston      Majoring in Computer Science
Frank   40   Entrepreneur  100000   San Francisco    Founded a tech startup

# Below are some special records for testing AWK's processing capabilities
SpecialRecord1    99   Unknown    0    Nowhere    This is a special record with unusual values
SpecialRecord2    101  Alien      -1   Space      From another galaxy

# Comment lines and empty lines
# This is a comment line
# Another comment line

# More data
Gina    27   Designer      55000    Seattle     Specializes in UI/UX design
Hank    33   Pilot         60000    Miami       Commercial airline pilot

Predefined and Automatic Variables in AWK

AWK supports several predefined and automatic variables to help you write programs. You will often encounter: (1) RS – Record Separator AWK processes data line by line. By default, it is a newline character. Therefore, if not specified, a record is one line of the input file.

(2) NR – Current Input Record Number This will match the current line number of your data stream.

(3) FS/OFS – Characters used as field separators When AWK reads a line, it splits it into different fields based on the value of <span>FS</span>. When AWK prints a line in the output, it will use the <span>OFS</span> separator instead of the <span>FS</span> separator to recombine these fields. Typically, <span>FS</span> and <span>OFS</span> are the same, but this is not mandatory. “Space” is the default for both.

(4) NF – Number of Fields in the Current Record If a line of data is split, it will match the number of fields after splitting.

There are others, but these four are commonly used, and mastering them can solve 99% of problems.

Basic Usage of AWK

<span>awk</span> scripts consist of <span>BEGIN</span> block, main processing block, and <span>END</span> block. The execution order is:<span>BEGIN</span> (before processing the file) → main processing block (process the file line by line) → <span>END</span> (after processing the file)

The command structure of AWK is as follows:

awk options 'selection_criteria {action }' input-file &gt; output-file

(1) Options are the predefined functional fields mentioned above. (2) The ‘selection_criteria { action }’ part is used to select what to process, such as which rows and columns, and which fields. (3) Action is what to do when the condition is met. (4) The input file is directly attached to the command, obtaining the output file in a chained manner.

Practical Usage Examples

1. VCF Files

VCF (Variant Call Format) is a standard format for storing genetic variation data, and we will demonstrate how to extract and analyze information from it through multiple examples.

(1): Extract all filtered variants (FILTER=PASS)

awk 'BEGIN {FS="\t"} !/^##/ {if ($7 == "PASS") print $0}' input.vcf

This command:

  • Sets the field separator to a tab character
  • Skips metadata lines starting with <span>##</span> (retaining the header line)
  • <span>/^##/</span> indicates “match lines starting with <span>##</span>
  • Only prints records where the FILTER field is “PASS”

(2) Extract variants on chromosome chr1 with quality value (QUAL) greater than 25

awk 'BEGIN {FS="\t"} !/^#/ {if ($1 == "chr1" &amp;&amp; $6 &gt; 25) print $1 "\t" $2 "\t" $4 "\t" $5 "\t" $6}' input.vcf

This command:

  • Skips all comment lines (starting with #)
  • Only processes variants on chr1
  • Filters records with quality values greater than 25, using && to indicate “and”
  • Only outputs chromosome, position, reference allele, variant allele, and quality value, reordering the output after print
  • In the previous command, $0 indicates outputting all columns that meet the condition

(3) Count the number of variants on each chromosome

awk 'BEGIN {FS="\t"} !/^#/ {count[$1]++} END {for (chr in count) print chr "\t" count[chr]}' input.vcf | sort

This command:

  • Uses an associative array count to tally the occurrences on each chromosome
  • <span>count</span> is an associative array in <span>awk</span> (similar to a dictionary), with <span>$1</span> (chromosome name) as the key, and <span>++</span> increments the count by 1 for each occurrence of the same key.
  • After processing all lines, it prints each chromosome and its corresponding variant count
  • <span>for (chr in count)</span>: iterates through all keys (chromosome names) in the associative array <span>count</span>.
  • Finally, the sort command is used to sort the output results

(4) Extract the AC (allele count) value from the INFO field

awk 'BEGIN {FS="\t"} !/^#/ {split($8, info, /;/); for (i in info) {if (info[i] ~ /^AC=/) {split(info[i], ac, /=/); print $1 "\t" $2 "\t" ac[2]}}}' input.vcf

This command:

  • Splits the INFO field to extract the AC value
  • <span>split(string, array, separator)</span>: splits the 8th column into an array <span>info</span> based on <span>;</span>.
  • Outputs chromosome, position, and corresponding AC value
  • <span>print $1 "\t" $2 "\t" ac[2]</span> prints the result: the 1st column (chromosome), the 2nd column (position), and the AC value (<span>ac[2]</span>), separated by tabs

2. MAF Files

MAF (Mutation Annotation Format) files are used to store gene mutation annotation information, and here are common processing methods.

(1) Extract all mutations of the TP53 gene

awk 'BEGIN {FS="\t"} NR==1; $1 == "TP53"' input.maf

This command:

  • Retains the first line header information using NR.
  • Extracts all records where Hugo_Symbol is TP53

(2) Filter all missense mutations (Missense_Mutation)

awk 'BEGIN {FS="\t"} NR==1; $8 == "Missense_Mutation"' input.maf

Having gone through VCF files, processing ordinary text files like MAF files becomes quite simple.

(3) Extract all mutations on chromosome 17 and display gene name, position, and variant type

awk 'BEGIN {FS="\t"} NR==1 {print "Gene\tChromosome\tPosition\tVariant_Type"}; $4 == "chr17" {print $1 "\t" $4 "\t" $5 "\t" $9}' input.maf

This command:

  • Customizes the output header
  • Only processes mutations on chr17
  • Outputs gene name, chromosome, position, and variant type

(4) Count the number of different variant types

awk 'BEGIN {FS="\t"} NR&gt;1 {count[$9]++} END {print "Variant_Type\tCount"; for (type in count) print type "\t" count[type]}' input.maf

This is similar to counting how many mutations are on different chromosomes in VCF.

3. BED Files

BED files are used to describe genomic feature regions, typically containing chromosome, start position, end position, and name information.

(1) Calculate the length of each region and add it to the output

awk 'BEGIN {FS="\t"; OFS="\t"} {length = $3 - $2; print $0, length}' input.bed

This command:

  • Calculates the length of each region (end position – start position)
  • Adds the length as a new column to the output
  • Uses OFS to ensure the output field separator is a tab character

(2) Filter regions longer than 100

awk 'BEGIN {FS="\t"} ($3 - $2) &gt; 100 {print $0}' input.bed

Here, we find that as long as the processing steps do not cause ambiguity, they can be freely manipulated.

(3) Extract regions containing the TP53 gene

awk 'BEGIN {FS="\t"} $4 ~ /TP53/ {print $0}' input.bed

Uses a regular expression to match regions containing the TP53 gene

4. Demonstration of Other Complex Situations

For complex text files containing annotations, special values, and different formats, AWK provides flexible processing capabilities.

Skip comment lines and empty lines, only process data lines

awk 'BEGIN {FS="\t"} !/^#/ &amp;&& NF &gt; 0 {print $0}' input.txt
  • Skips all comment lines starting with #
  • Skips empty lines (NF > 0 indicates the number of fields is greater than 0)
  • Only processes data lines

When to Use Curly Braces, Parentheses, and Semicolons?

1. Curly Braces `{}`: Used to Enclose Code Blocks

The core function of curly braces is to group a set of statements (actions) into a logical block, typically bound to a “pattern” or specific structure (such as functions, loops).

1. Actions in the Pattern-Action Pair

The basic syntax of AWK is <span>pattern { action }</span>, where the action part must be enclosed in curly braces, indicating “a series of operations to execute when the pattern matches.” Example:

# Execute the action of printing the first column for each line (pattern is empty, matches all lines)
{ print $1 }
# When the third column is greater than 100, execute the action of printing the first and third columns
$3 &gt; 100 { print $1, $3 }

2. BEGIN/END Blocks

<span>BEGIN</span> (executed before processing the file) and <span>END</span> (executed after processing the file) blocks must have their code enclosed in curly braces:

BEGIN { 
    FS = "\t" 
    print "Starting to process the file" 
} 
END { print "Processing complete, total", NR, "lines" }

3. Bodies of Loops and Conditional Statements

<span>if</span>, <span>for</span>, <span>while</span> structures require their execution bodies to be enclosed in curly braces (even if there is only one statement, using braces is more standard):

{ 
    if ($2 &gt; 30) { 
    print $1, "is older than 30" 
    count++ 
    } 
    for (i=1; i&lt;=3; i++) {  
        print "Column", i, ":", $i 
    } 
}

4. Function Definitions

The implementation body of a custom function must be enclosed in curly braces:

function sum(a, b) {  
    return a + b 
} 
{ print sum($2, $3) } # Call the function with parameters in parentheses

2. Parentheses `()`: Used for Grouping or Parameter Lists

Parentheses are mainly used for grouping expressions, function parameters, or regular expression capture groups.

1. Grouping Expressions (Changing Operation Precedence)

In mathematical or logical expressions, parentheses change the order of operations:

# Calculate 5+3 first, then multiply by 2 (otherwise defaults to 5+(3*2))
{ print (5 + 3) * 2 }
# First check if $2&gt;20, then do logical AND with $3&lt;100
{ if (($2 &gt; 20) &amp;&& ($3 &lt; 100)) print $0 }

2. Parameter Lists for Function Calls

When calling built-in functions (like <span>print</span>, <span>split</span>) or custom functions, parameters must be enclosed in parentheses:

# Parameters for the split function: string, array, separator
{ split($8, info, /;/) }
# Split the 8th column into the info array by ;

# Parameters for the custom function sum
{ total = sum($2, $3) }

3. Capture Groups in Regular Expressions

In regular patterns, parentheses are used to define capture groups (which can be referenced by <span>\1</span>, <span>\2</span> etc.):

# Match "AC=number" and extract the number (\1 references the first group)
$8 ~ /AC=([0-9]+)/ { print "AC value:", gensub(/AC=([0-9]+)/, "\1", "g", $8) }

4. Array Traversal (for Loop)

When traversing an array, the variable before the <span>in</span> keyword needs to be in parentheses (fixed syntax):

# Traverse the keys of the count array
END { 
    for (key in count) { 
    # key in parentheses 
        print key, count[key] 
    } 
}

3. Semicolon `;`: Used to Separate Statements

The semicolon is a statement separator, used to distinguish multiple independent statements on the same line.

1. Multiple Statements on the Same Line

When multiple statements are written on the same line, they must be separated by semicolons:

# Multiple statements written in one line in the BEGIN block
BEGIN { FS="\t"; OFS=","; print "Starting" }

# Multiple operations executed in one line in the action { $1 = toupper($1); print $0 }
# Print the entire line after converting the first column to uppercase

2. Single-Line Writing of Loops or Conditions

If the execution body of a loop/condition has only one statement, the curly braces can be omitted, but the statement must be separated from the structure by a semicolon (not recommended, as it reduces readability):

$2 &gt; 30 { print $1 }  # Standard writing (with curly braces)
$2 &gt; 30; print $1    # Incorrect! Will be parsed as two independent actions

3. Empty Statements

A standalone semicolon represents an empty statement (no operation), sometimes used as a placeholder in loops:

# Skip the first 3 lines (empty statement means "do nothing")
NR &lt;= 3 { ; }  # Equivalent to not processing the first 3 lines
NR &gt; 3 { print $0 }

Conclusion

AWK provides powerful text processing capabilities. By flexibly utilizing its pattern matching, field processing, and flow control functions, various structured and semi-structured data can be efficiently processed. The examples above demonstrate common applications in bioinformatics and general text processing. In practical use, these techniques can be combined according to specific needs to build more complex processing logic.

Mastering AWK commands can significantly improve data processing efficiency, especially when dealing with large datasets or requiring automated analysis processes, where AWK is often an indispensable tool.

Leave a Comment