Today, I will share an overview of the Linux grep command tailored for bioinformatics scenarios, covering core principles, high-frequency usage, and examples to quickly filter text data such as genes and transcriptomes.
1. Core Principles of grep
grep = Pattern Matching Engine
- Line-by-line Scanning: Matches file content based on regular expressions (RegEx)
- Pattern Types:
- Basic Regular Expressions (BRE): Supported by default, requires escaping certain symbols
- Extended Regular Expressions (ERE):
<span>-E</span>option activates, supports<span>+?{}|</span> - Perl Compatible Regular Expressions (PCRE):
<span>-P</span>option (requires GNU grep support) - Controlling Output:
<span>-v</span>Reverse filtering (exclude matching lines)<span>-o</span>Only output matching parts<span>-A n</span>/<span>-B n</span>Output n lines after/before matching lines- Performance Keys:
<span>-F</span>Fixed string matching (disables regex, faster)<span>LC_ALL=C</span>Set locale to speed up (effective for ASCII data)
2. High-Frequency Commands
1. Basic: Quickly Locate Key Data
# (1) Extract header lines from FASTA files (starting with >)
grep '^>' sequences.fasta
# (2) Count low-quality reads in FASTQ files (line number %4==0 is quality line, contains '!' indicating Phred33≤0)
awk 'NR%4 == 0' reads.fastq | grep -c '[!]'
# (3) Filter PASS variants in VCF files (7th column)
grep -E '^#|PASS' variants.vcf # Retain header information
# (4) Find reads mapped to chr1 in SAM files (3rd column)
grep -w 'chr1' alignments.sam | awk '$3 == "chr1"'
2. Advanced: Complex Patterns and Piping Operations
# (1) Filter reads in FASTQ with GC content ≥60% (requires multi-line processing)
grep -B1 -A2 '^[GCgcn]*$' input.fastq | grep -v '^--$' # Approximate match (case insensitive)
# More precise version (with awk):
awk 'NR%4==2 {if (gsub(/[GC]/,"")/length($0) >=0.6) {print a; print $0; getline; print; getline; print}} {a=$0}' input.fastq
# (2) Extract mRNA_ID of gene exons (type=exon) from GFF files
grep -E '\texon\t' annotation.gff | grep -oP 'Parent=\K[^;]+ '
# (3) Merge non-annotated lines from multiple VCF files (cross-file deduplication)
grep -hv '^#' *.vcf | sort -u -k1,2
# (4) Check if BED file is ordered (check chromosome order + start/end positions)
grep -v '^#' peaks.bed | awk '{if (prev_chr && (chr < prev_chr || (chr == prev_chr && $2 < prev_end))) print "Unsorted"; prev_chr=$1; prev_end=$3}'
3. Advanced: Precise Matching with Regular Expressions
# (1) Match SAM FLAG value for paired and unmapped (0x1+0x4=5)
grep -P '\t5$' alignments.sam
# Or directly binary calculation (safer):
awk 'and($2,5)==5' alignments.sam
# (2) Filter variants according to HGVS naming rules (e.g., c.123A>T)
grep -P 'c\.\d+[ACGTU]?>[ACGTU]' clinical_variants.txt
# (3) Search for primer sequence matches (allowing degenerate bases, e.g., R=A/G)
grep -iE 'AG[CGT]TA[AG]A[AT]G' primer_sequences.fasta
# (4) Count reads in BAM files with mapping quality ≥30 (requires samtools)
samtools view input.bam | grep -P '\t(?:[3-9][0-9]|[0-9]{3,})\t'
3. Performance Optimization Tips
- Disable Regex: Use
<span>grep -F</span>for fixed strings to speed up - Parallel Processing: Combine
<span>parallel</span>to split files for acceleration (e.g., large FASTQ)parallel --pipepart -a bigfile.fastq --block 1G grep 'pattern' > matches.txt - Preprocessing Filters: Prefer using
<span>LC_ALL=C grep</span>for pure ASCII data - Combine Conditions: Avoid multiple grep calls, use
<span>|</span>to separate conditionsgrep -E 'condition1|condition2' file
4. Practical Data Examples
Case 1: Extract Sequences with Specified ID from FASTQ
# (1) Single ID match
grep -A3 '@SRR1234567.1' reads.fastq
# (2) Multiple ID list (ids.txt with one ID per line)
grep -A3 -f ids.txt reads.fastq | grep -v '^--$' > target.fastq
Case 2: Count Sequencing Data Contamination (match vector sequences)
# First, build a vector sequence list (vector.fa)
grep '^>' vector.fa | cut -c2- > vector_ids.txt
grep -F -f vector_ids.txt blast_results.tsv | cut -f1 | sort | uniq -c
Case 3: Filter Exon Splicing Sites (GT/AG Rule)
# Find GT...AG pattern from genomic sequences (allowing intron length >50bp)
grep -P 'GT[ACGT]{50,}AG' genome.fa | grep -oP 'GT.{50,}AG'
Case 4: Separate Paired-End Sequencing Data (SAM/BAM FLAG Check)
# First-end read (0x40=64): FLAG & 64 !=0
samtools view input.bam | grep -P '\t(7[0-9]|1[1-9][0-9]|...) ' > read1.sam
# Second-end read (0x80=128): FLAG & 128 !=0
samtools view input.bam | grep -P '\t(...valid FLAG...) ' > read2.sam
5. grep Family Tool Extensions
| Tool | Usage | Bioinformatics Example |
|---|---|---|
| egrep | =<span>grep -E</span> Extended regex |
Match complex variant types |
| fgrep | =<span>grep -F</span> Fast string matching |
Quickly screen known mutation IDs |
| agrep | Fuzzy matching (requires compilation for installation) | Tolerant search for similar primers |
| rgrep | Recursive directory search | Batch search for mutations across samples |
| bio-grep | Supports FASTA/Q syntax | Directly filter by sequence length/quality (requires installation) |
Summary Recommendations: For extremely large files (e.g., whole genome SAM), prefer using<span>samtools view</span> with<span>-L</span>/<span>-M</span> options to filter regions, which is more efficient than pure grep!
Tips: The above content and commands are for reference only; for practical operations, please refer to the actual data structure, code, and server configuration used.