Overview of the Linux grep Command for Bioinformatics

Today, I will share an overview of the Linux grep command tailored for bioinformatics scenarios, covering core principles, high-frequency usage, and examples to quickly filter text data such as genes and transcriptomes.

1. Core Principles of grep

grep = Pattern Matching Engine

  • Line-by-line Scanning: Matches file content based on regular expressions (RegEx)
  • Pattern Types:
    • Basic Regular Expressions (BRE): Supported by default, requires escaping certain symbols
    • Extended Regular Expressions (ERE):<span>-E</span> option activates, supports<span>+?{}|</span>
    • Perl Compatible Regular Expressions (PCRE):<span>-P</span> option (requires GNU grep support)
  • Controlling Output:
    • <span>-v</span> Reverse filtering (exclude matching lines)
    • <span>-o</span> Only output matching parts
    • <span>-A n</span>/<span>-B n</span> Output n lines after/before matching lines
  • Performance Keys:
    • <span>-F</span> Fixed string matching (disables regex, faster)
    • <span>LC_ALL=C</span> Set locale to speed up (effective for ASCII data)

2. High-Frequency Commands

1. Basic: Quickly Locate Key Data
# (1) Extract header lines from FASTA files (starting with >)
grep '^>' sequences.fasta

# (2) Count low-quality reads in FASTQ files (line number %4==0 is quality line, contains '!' indicating Phred33≤0)
awk 'NR%4 == 0' reads.fastq | grep -c '[!]'

# (3) Filter PASS variants in VCF files (7th column)
grep -E '^#|PASS' variants.vcf  # Retain header information

# (4) Find reads mapped to chr1 in SAM files (3rd column)
grep -w 'chr1' alignments.sam | awk '$3 == "chr1"'
2. Advanced: Complex Patterns and Piping Operations
# (1) Filter reads in FASTQ with GC content ≥60% (requires multi-line processing)
grep -B1 -A2 '^[GCgcn]*$' input.fastq | grep -v '^--$'  # Approximate match (case insensitive)
# More precise version (with awk):
awk 'NR%4==2 {if (gsub(/[GC]/,"")/length($0) >=0.6) {print a; print $0; getline; print; getline; print}} {a=$0}' input.fastq

# (2) Extract mRNA_ID of gene exons (type=exon) from GFF files
grep -E '\texon\t' annotation.gff | grep -oP 'Parent=\K[^;]+ '

# (3) Merge non-annotated lines from multiple VCF files (cross-file deduplication)
grep -hv '^#' *.vcf | sort -u -k1,2

# (4) Check if BED file is ordered (check chromosome order + start/end positions)
grep -v '^#' peaks.bed | awk '{if (prev_chr && (chr < prev_chr || (chr == prev_chr && $2 < prev_end))) print "Unsorted"; prev_chr=$1; prev_end=$3}' 
3. Advanced: Precise Matching with Regular Expressions
# (1) Match SAM FLAG value for paired and unmapped (0x1+0x4=5)
grep -P '\t5$' alignments.sam
# Or directly binary calculation (safer):
awk 'and($2,5)==5' alignments.sam

# (2) Filter variants according to HGVS naming rules (e.g., c.123A>T)
grep -P 'c\.\d+[ACGTU]?>[ACGTU]' clinical_variants.txt

# (3) Search for primer sequence matches (allowing degenerate bases, e.g., R=A/G)
grep -iE 'AG[CGT]TA[AG]A[AT]G' primer_sequences.fasta

# (4) Count reads in BAM files with mapping quality ≥30 (requires samtools)
samtools view input.bam | grep -P '\t(?:[3-9][0-9]|[0-9]{3,})\t' 

3. Performance Optimization Tips

  1. Disable Regex: Use<span>grep -F</span> for fixed strings to speed up
  2. Parallel Processing: Combine<span>parallel</span> to split files for acceleration (e.g., large FASTQ)
    parallel --pipepart -a bigfile.fastq --block 1G grep 'pattern' > matches.txt
    
  3. Preprocessing Filters: Prefer using<span>LC_ALL=C grep</span> for pure ASCII data
  4. Combine Conditions: Avoid multiple grep calls, use<span>|</span> to separate conditions
    grep -E 'condition1|condition2' file
    

4. Practical Data Examples

Case 1: Extract Sequences with Specified ID from FASTQ
# (1) Single ID match
grep -A3 '@SRR1234567.1' reads.fastq

# (2) Multiple ID list (ids.txt with one ID per line)
grep -A3 -f ids.txt reads.fastq | grep -v '^--$' > target.fastq
Case 2: Count Sequencing Data Contamination (match vector sequences)
# First, build a vector sequence list (vector.fa)
grep '^>' vector.fa | cut -c2- > vector_ids.txt
grep -F -f vector_ids.txt blast_results.tsv | cut -f1 | sort | uniq -c
Case 3: Filter Exon Splicing Sites (GT/AG Rule)
# Find GT...AG pattern from genomic sequences (allowing intron length >50bp)
grep -P 'GT[ACGT]{50,}AG' genome.fa | grep -oP 'GT.{50,}AG'
Case 4: Separate Paired-End Sequencing Data (SAM/BAM FLAG Check)
# First-end read (0x40=64): FLAG & 64 !=0
samtools view input.bam | grep -P '\t(7[0-9]|1[1-9][0-9]|...)	' > read1.sam
# Second-end read (0x80=128): FLAG & 128 !=0
samtools view input.bam | grep -P '\t(...valid FLAG...)	' > read2.sam

5. grep Family Tool Extensions

Tool Usage Bioinformatics Example
egrep =<span>grep -E</span> Extended regex Match complex variant types
fgrep =<span>grep -F</span> Fast string matching Quickly screen known mutation IDs
agrep Fuzzy matching (requires compilation for installation) Tolerant search for similar primers
rgrep Recursive directory search Batch search for mutations across samples
bio-grep Supports FASTA/Q syntax Directly filter by sequence length/quality (requires installation)

Summary Recommendations: For extremely large files (e.g., whole genome SAM), prefer using<span>samtools view</span> with<span>-L</span>/<span>-M</span> options to filter regions, which is more efficient than pure grep!

Tips: The above content and commands are for reference only; for practical operations, please refer to the actual data structure, code, and server configuration used.

Leave a Comment