Overview of the Linux grep Command for Bioinformatics

Today, I will share an overview of the Linux grep command tailored for bioinformatics scenarios, covering core principles, high-frequency usage, and examples to quickly filter text data such as genes and transcriptomes.

1. Core Principles of grep

grep = Pattern Matching Engine

Line-by-line Scanning: Matches file content based on regular expressions (RegEx)
Pattern Types:

Basic Regular Expressions (BRE): Supported by default, requires escaping certain symbols
Extended Regular Expressions (ERE):-E option activates, supports+?{}|
Perl Compatible Regular Expressions (PCRE):-P option (requires GNU grep support)

Controlling Output:

-v Reverse filtering (exclude matching lines)
-o Only output matching parts
-A n/-B n Output n lines after/before matching lines

Performance Keys:

-F Fixed string matching (disables regex, faster)
LC_ALL=C Set locale to speed up (effective for ASCII data)

2. High-Frequency Commands

1. Basic: Quickly Locate Key Data

# (1) Extract header lines from FASTA files (starting with >)
grep '^>' sequences.fasta

# (2) Count low-quality reads in FASTQ files (line number %4==0 is quality line, contains '!' indicating Phred33≤0)
awk 'NR%4 == 0' reads.fastq | grep -c '[!]'

# (3) Filter PASS variants in VCF files (7th column)
grep -E '^#|PASS' variants.vcf  # Retain header information

# (4) Find reads mapped to chr1 in SAM files (3rd column)
grep -w 'chr1' alignments.sam | awk '$3 == "chr1"'

2. Advanced: Complex Patterns and Piping Operations

# (1) Filter reads in FASTQ with GC content ≥60% (requires multi-line processing)
grep -B1 -A2 '^[GCgcn]*$' input.fastq | grep -v '^--$'  # Approximate match (case insensitive)
# More precise version (with awk):
awk 'NR%4==2 {if (gsub(/[GC]/,"")/length($0) >=0.6) {print a; print $0; getline; print; getline; print}} {a=$0}' input.fastq

# (2) Extract mRNA_ID of gene exons (type=exon) from GFF files
grep -E '\texon\t' annotation.gff | grep -oP 'Parent=\K[^;]+ '

# (3) Merge non-annotated lines from multiple VCF files (cross-file deduplication)
grep -hv '^#' *.vcf | sort -u -k1,2

# (4) Check if BED file is ordered (check chromosome order + start/end positions)
grep -v '^#' peaks.bed | awk '{if (prev_chr && (chr < prev_chr || (chr == prev_chr && $2 < prev_end))) print "Unsorted"; prev_chr=$1; prev_end=$3}'

3. Advanced: Precise Matching with Regular Expressions

# (1) Match SAM FLAG value for paired and unmapped (0x1+0x4=5)
grep -P '\t5$' alignments.sam
# Or directly binary calculation (safer):
awk 'and($2,5)==5' alignments.sam

# (2) Filter variants according to HGVS naming rules (e.g., c.123A>T)
grep -P 'c\.\d+[ACGTU]?>[ACGTU]' clinical_variants.txt

# (3) Search for primer sequence matches (allowing degenerate bases, e.g., R=A/G)
grep -iE 'AG[CGT]TA[AG]A[AT]G' primer_sequences.fasta

# (4) Count reads in BAM files with mapping quality ≥30 (requires samtools)
samtools view input.bam | grep -P '\t(?:[3-9][0-9]|[0-9]{3,})\t'

3. Performance Optimization Tips

Disable Regex: Usegrep -F for fixed strings to speed up
Parallel Processing: Combineparallel to split files for acceleration (e.g., large FASTQ)
```
parallel --pipepart -a bigfile.fastq --block 1G grep 'pattern' > matches.txt
```
Preprocessing Filters: Prefer usingLC_ALL=C grep for pure ASCII data
Combine Conditions: Avoid multiple grep calls, use| to separate conditions
```
grep -E 'condition1|condition2' file
```

4. Practical Data Examples

Case 1: Extract Sequences with Specified ID from FASTQ

# (1) Single ID match
grep -A3 '@SRR1234567.1' reads.fastq

# (2) Multiple ID list (ids.txt with one ID per line)
grep -A3 -f ids.txt reads.fastq | grep -v '^--$' > target.fastq

Case 2: Count Sequencing Data Contamination (match vector sequences)

# First, build a vector sequence list (vector.fa)
grep '^>' vector.fa | cut -c2- > vector_ids.txt
grep -F -f vector_ids.txt blast_results.tsv | cut -f1 | sort | uniq -c

Case 3: Filter Exon Splicing Sites (GT/AG Rule)

# Find GT...AG pattern from genomic sequences (allowing intron length >50bp)
grep -P 'GT[ACGT]{50,}AG' genome.fa | grep -oP 'GT.{50,}AG'

Case 4: Separate Paired-End Sequencing Data (SAM/BAM FLAG Check)

# First-end read (0x40=64): FLAG & 64 !=0
samtools view input.bam | grep -P '\t(7[0-9]|1[1-9][0-9]|...)	' > read1.sam
# Second-end read (0x80=128): FLAG & 128 !=0
samtools view input.bam | grep -P '\t(...valid FLAG...)	' > read2.sam

5. grep Family Tool Extensions

Tool	Usage	Bioinformatics Example
egrep	=`<span>grep -E</span>` Extended regex	Match complex variant types
fgrep	=`<span>grep -F</span>` Fast string matching	Quickly screen known mutation IDs
agrep	Fuzzy matching (requires compilation for installation)	Tolerant search for similar primers
rgrep	Recursive directory search	Batch search for mutations across samples
bio-grep	Supports FASTA/Q syntax	Directly filter by sequence length/quality (requires installation)

Summary Recommendations: For extremely large files (e.g., whole genome SAM), prefer usingsamtools view with-L/-M options to filter regions, which is more efficient than pure grep!

Tips: The above content and commands are for reference only; for practical operations, please refer to the actual data structure, code, and server configuration used.