Overview of the awk Command in Linux for Bioinformatics Applications

Today, I am sharing an overview of the awk command in Linux for bioinformatics applications, covering core principles, frequently used scenarios, and examples, which can efficiently process commonly used structured text data such as genomic and transcriptomic data.

1. Core Principles of awk

awk = Text Scanning Engine + Programming Language

  • Line-by-line Processing: Treats the input file as a stream of records separated by<span>\n</span>.
  • Field Splitting: Each line is automatically split into<span>$1</span> (1st column), <span>$2</span> (2nd column)… <span>$NF</span> (last column)
  • Pattern-Action Rules: <span>pattern { action }</span> → Executes action when the pattern matches
  • Built-in Variables: <span>FS</span> (Field Separator), <span>OFS</span> (Output Field Separator), <span>NR</span> (Current Line Number), <span>NF</span> (Number of Fields in Current Line)

2. Frequently Used Commands in Bioinformatics

1. Basics: Data Extraction and Statistics
# (1) Extract chromosome and position from SAM file (columns 3 and 4)
awk '$3!="*" {print $3, $4}' alignments.sam

# (2) Count total sequence length in FASTQ file (every 4 lines take the 2nd line)
awk 'NR%4==2 {sum+=length($1)} END{print sum}' reads.fastq

# (3) Calculate average quality value of variant sites in VCF file (column 6)
awk '/^[^#]/ {total+=$6; count++} END{print total/count}' variants.vcf

# (4) Filter exon entries in GFF file (column 3=exon)
awk '$3 == "exon" {print $0}' annotations.gff
2. Advanced: Conditional Filtering and Data Transformation
# (1) Extract SNPs in VCF with PASS filter and depth>20 (column 7 is PASS, column 8 contains DP=20)
awk '$7=="PASS" && match($8,/DP=([0-9]+)/, d) && d[1]>20 {print $1,$2,$4,$5}' variants.vcf

# (2) Merge overlapping regions in BED file (assumed sorted)
awk 'BEGIN{OFS="\t"} {if ($2 <= end && $3 > end) end=$3; else {if (NR>1) print chr,start,end; chr=$1; start=$2; end=$3}} END{print chr,start,end}' regions.bed

# (3) Convert single-line FASTA sequences to multi-line (60 characters per line)
awk '/^>/ {print; next} {gsub(/.{60}/,"\n"); printf "%s",$0}' sequences.fasta

# (4) Convert SAM to BED (retain Read name)
awk 'BEGIN{OFS="\t"} !/^@/ {print $3, $4-1, $4+length($10)-1, $1}' alignments.sam
3. Advanced: Association Analysis and Complex Statistics
# (1) Normalize gene expression matrix (TSV): divide each column by its mean
awk 'NR==1 {print; next} {for(i=2;i<=NF;i++) {sum[i]+=$i; a[NR,i]=$i}} END{for(r=2;r<=NR;r++) {printf "%s", $1; for(i=2;i<=NF;i++) printf "\t%.4f", a[r,i]/(sum[i]/(NR-1)); print ""}}' expression_matrix.tsv

# (2) Find co-occurring mutations (common chr:pos in two VCFs)
awk 'NR==FNR {a[$1="$2"]=1; next} $1="$2" in a' file1.vcf file2.vcf

# (3) Calculate distribution of ChIP-seq peaks within 5k upstream and downstream of genes
awk 'BEGIN{OFS="\t"} 
    NR==FNR {if ($3=="gene") {chr=$1; start=$4-5000; end=$5+5000; genes[chr,++count[chr]]=start OFS end}; next} 
    {for(i=1; i<=count[$1]; i++) {split(genes[$1,i], s); if ($2>=s[1] && $3<=s[2]) {print $0; break}}}' genes.gtf peaks.bed

3. Performance Optimization Techniques

  1. Preprocess Data: Use <span>sort</span>/<span>grep</span> to filter irrelevant lines before processing with awk
  2. Reduce Pipeline Operations: Combine multiple <span>awk</span> operations into a single script (using <span>;</span> to separate commands)
  3. Use Bitwise Operations: Quickly check FLAG (e.g., in SAM file, <span>$2 & 0x0040</span> to check if it is the second end)
  4. Disable Regex Engine: Use <span>index()</span> instead of <span>~</span> matching when processing large files

4. Bioinformatics Data Practical Case Library

Case 1: Count GC Content Distribution in FASTQ
awk 'BEGIN{total=0; gc=0} 
    NR%4==2 {seq=$1; len=length(seq); total+=len; gc+=gsub(/[GC]/, "", seq)} 
    END{print "GC%:", (gc/total)*100}' input.fastq
Case 2: Filter Low-Quality Reads (Phred33≥20 proportion <90%)
awk 'BEGIN{qual_offset=33} 
    NR%4==0 {poor=0; len=length($1); for(i=1;i<=len;i++) {if (ord(substr($1,i,1))-qual_offset <20) poor++}; if (poor/len <0.1) print prev_line ORS $0; getline; next} 
    {prev_line=$0}' input.fastq
Case 3: Extract Homozygous Mutations from VCF (GT field is 1/1 or 0/0)
awk '/^#/ {print; next} 
    {split($9, fmt, ":"); split($10, val, ":"); for (i in fmt) if (fmt[i]=="GT") gt=val[i]} 
    gt ~ /(1\/1|0\/0)/' variants.vcf

5. Commonly Used awk Built-in Functions Quick Reference

Function Purpose Bioinformatics Application Example
<span>gsub(r,s,t)</span> Globally replace regex r with s Correct chromosome naming (<span>gsub(/chrM/,"chrMT",$1)</span>)
<span>split(s,a,fs)</span> Split string into array Parse INFO field (<span>split($8,info,/;/</span><code><span>)</span>
<span>substr(s,p,n)</span> Extract substring Extract first 10bp of Read sequence (<span>substr($10,1,10)</span>)
<span>system(cmd)</span> Execute Shell command Dynamic compression output (<span>system("gzip > output.gz")</span>)
<span>sprintf(fmt,expr)</span> Format output Retain decimal places (<span>sprintf("%.2f", $6)</span>)

Efficiency Summary: For NGS data processing (e.g., hundreds of GB FASTQ), prioritize using <span>bioawk</span> (an extended version of awk that supports FASTA/Q parsing) or combine with <span>parallel</span> for multi-threading acceleration!

Tips: The above content and commands are for reference only; for practical operations, refer to the actual data structure and server configuration used.

Leave a Comment