Overview of the awk Command in Linux for Bioinformatics Applications

Today, I am sharing an overview of the awk command in Linux for bioinformatics applications, covering core principles, frequently used scenarios, and examples, which can efficiently process commonly used structured text data such as genomic and transcriptomic data.

1. Core Principles of awk

awk = Text Scanning Engine + Programming Language

Line-by-line Processing: Treats the input file as a stream of records separated by\n.
Field Splitting: Each line is automatically split into$1 (1st column), $2 (2nd column)… $NF (last column)
Pattern-Action Rules: pattern { action } → Executes action when the pattern matches
Built-in Variables: FS (Field Separator), OFS (Output Field Separator), NR (Current Line Number), NF (Number of Fields in Current Line)

2. Frequently Used Commands in Bioinformatics

1. Basics: Data Extraction and Statistics

# (1) Extract chromosome and position from SAM file (columns 3 and 4)
awk '$3!="*" {print $3, $4}' alignments.sam

# (2) Count total sequence length in FASTQ file (every 4 lines take the 2nd line)
awk 'NR%4==2 {sum+=length($1)} END{print sum}' reads.fastq

# (3) Calculate average quality value of variant sites in VCF file (column 6)
awk '/^[^#]/ {total+=$6; count++} END{print total/count}' variants.vcf

# (4) Filter exon entries in GFF file (column 3=exon)
awk '$3 == "exon" {print $0}' annotations.gff

2. Advanced: Conditional Filtering and Data Transformation

# (1) Extract SNPs in VCF with PASS filter and depth>20 (column 7 is PASS, column 8 contains DP=20)
awk '$7=="PASS" && match($8,/DP=([0-9]+)/, d) && d[1]>20 {print $1,$2,$4,$5}' variants.vcf

# (2) Merge overlapping regions in BED file (assumed sorted)
awk 'BEGIN{OFS="\t"} {if ($2 <= end && $3 > end) end=$3; else {if (NR>1) print chr,start,end; chr=$1; start=$2; end=$3}} END{print chr,start,end}' regions.bed

# (3) Convert single-line FASTA sequences to multi-line (60 characters per line)
awk '/^>/ {print; next} {gsub(/.{60}/,"\n"); printf "%s",$0}' sequences.fasta

# (4) Convert SAM to BED (retain Read name)
awk 'BEGIN{OFS="\t"} !/^@/ {print $3, $4-1, $4+length($10)-1, $1}' alignments.sam

3. Advanced: Association Analysis and Complex Statistics

# (1) Normalize gene expression matrix (TSV): divide each column by its mean
awk 'NR==1 {print; next} {for(i=2;i<=NF;i++) {sum[i]+=$i; a[NR,i]=$i}} END{for(r=2;r<=NR;r++) {printf "%s", $1; for(i=2;i<=NF;i++) printf "\t%.4f", a[r,i]/(sum[i]/(NR-1)); print ""}}' expression_matrix.tsv

# (2) Find co-occurring mutations (common chr:pos in two VCFs)
awk 'NR==FNR {a[$1="$2"]=1; next} $1="$2" in a' file1.vcf file2.vcf

# (3) Calculate distribution of ChIP-seq peaks within 5k upstream and downstream of genes
awk 'BEGIN{OFS="\t"} 
    NR==FNR {if ($3=="gene") {chr=$1; start=$4-5000; end=$5+5000; genes[chr,++count[chr]]=start OFS end}; next} 
    {for(i=1; i<=count[$1]; i++) {split(genes[$1,i], s); if ($2>=s[1] && $3<=s[2]) {print $0; break}}}' genes.gtf peaks.bed

3. Performance Optimization Techniques

Preprocess Data: Use sort/grep to filter irrelevant lines before processing with awk
Reduce Pipeline Operations: Combine multiple awk operations into a single script (using ; to separate commands)
Use Bitwise Operations: Quickly check FLAG (e.g., in SAM file, $2 & 0x0040 to check if it is the second end)
Disable Regex Engine: Use index() instead of ~ matching when processing large files

4. Bioinformatics Data Practical Case Library

Case 1: Count GC Content Distribution in FASTQ

awk 'BEGIN{total=0; gc=0} 
    NR%4==2 {seq=$1; len=length(seq); total+=len; gc+=gsub(/[GC]/, "", seq)} 
    END{print "GC%:", (gc/total)*100}' input.fastq

Case 2: Filter Low-Quality Reads (Phred33≥20 proportion <90%)

awk 'BEGIN{qual_offset=33} 
    NR%4==0 {poor=0; len=length($1); for(i=1;i<=len;i++) {if (ord(substr($1,i,1))-qual_offset <20) poor++}; if (poor/len <0.1) print prev_line ORS $0; getline; next} 
    {prev_line=$0}' input.fastq

Case 3: Extract Homozygous Mutations from VCF (GT field is 1/1 or 0/0)

awk '/^#/ {print; next} 
    {split($9, fmt, ":"); split($10, val, ":"); for (i in fmt) if (fmt[i]=="GT") gt=val[i]} 
    gt ~ /(1\/1|0\/0)/' variants.vcf

5. Commonly Used awk Built-in Functions Quick Reference

Function	Purpose	Bioinformatics Application Example
`<span>gsub(r,s,t)</span>`	Globally replace regex r with s	Correct chromosome naming (`<span>gsub(/chrM/,"chrMT",$1)</span>`)
`<span>split(s,a,fs)</span>`	Split string into array	Parse INFO field (`<span>split($8,info,/;/</span><code><span>)</span>`
`<span>substr(s,p,n)</span>`	Extract substring	Extract first 10bp of Read sequence (`<span>substr($10,1,10)</span>`)
`<span>system(cmd)</span>`	Execute Shell command	Dynamic compression output (`<span>system("gzip > output.gz")</span>`)
`<span>sprintf(fmt,expr)</span>`	Format output	Retain decimal places (`<span>sprintf("%.2f", $6)</span>`)

Efficiency Summary: For NGS data processing (e.g., hundreds of GB FASTQ), prioritize using bioawk (an extended version of awk that supports FASTA/Q parsing) or combine with parallel for multi-threading acceleration!

Tips: The above content and commands are for reference only; for practical operations, refer to the actual data structure and server configuration used.