Introduction
Reference Tutorials:
https://www.gnu.org/software/make/manual/make.html
https://makefiletutorial.com/
https://www.biostarhandbook.com/books/scripting/index.html
In bioinformatics upstream analysis, it is impractical to manually input our commands every time. Typically, we write our commands into a script file and execute them by running the script. However, when our work changes, we need to make adjustments, including adding or deleting commands, which increases our workload and reduces code reusability. Therefore, we can use a Makefile to modularize our analysis workflow.
A Makefile is a file used to automate the build process, commonly used on Unix and Unix-like systems. It is read and executed by a tool called <span>make</span>
. The Makefile defines a series of rules and dependencies to guide how to compile and link programs.
Using a Makefile in bioinformatics analysis can help us automate the analysis workflow without requiring advanced learning. Below is a simple example of a Makefile:
First, we create a file named <span>Makefile</span>
and write some content in it.
.RECIPEPREFIX = >
# Command 1
foo:
> echo Hello world!
# Command 2
bar:
> echo Hello world!
> echo Hello Everyone!
In this Makefile, we define two targets <span>foo</span>
and <span>bar</span>
, corresponding to two commands. We can execute these commands by using <span>make foo</span>
and <span>make bar</span>
.
$ make -f Makefile foo
echo Hello world!
Hello world!
$ make -f Makefile bar
echo Hello world!
Hello world!
echo Hello Everyone!
Hello Everyone!
When your filename is <span>Makefile</span>
, you can directly use the <span>make</span>
command without needing the <span>-f</span>
parameter.
Typically, a Makefile will output both commands and results to the terminal. If we only want to output the results, we can prefix the command with an @ symbol. For example:
.RECIPEPREFIX = >
# Command 1
foo:
> @echo Hello world!
Like bash scripts, Makefiles can also use variables, for example:
.RECIPEPREFIX = >
NAME = xiaoming
hello:
> @echo Hello ${NAME}
The output will be:
$ make -f anyfile hello
Hello xiaoming
<span>make</span>
is designed to track dependencies between targets and only execute necessary steps when any file changes. For example, if you already have a genome index, the <span>make</span>
command will skip the indexing step.
# Set the prefix from tabs to >
.RECIPEPREFIX = >
counts.txt:
> echo 100 > counts.txt
names.txt:
> echo Joe > names.txt
results.txt: counts.txt names.txt
> cat counts.txt names.txt > results.txt
In the above Makefile, running <span>make results.txt</span>
will first generate <span>counts.txt</span>
and <span>names.txt</span>
if they do not exist. Then, it will generate <span>results.txt</span>
from these two files. If you run <span>make</span>
again, it will not perform any operations since all files already exist.
Tips for Makefile
Default Settings
Adding these settings to your Makefile can make it more powerful:
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
SHELL := bash
.ONESHELL:
.SHELLFLAGS := -eu -o pipefail -c
MAKEFLAGS += --warn-undefined-variables --no-print-directory
<span>.DELETE_ON_ERROR:</span>
This special target tells <span>make</span>
to delete generated target files when command execution fails, preventing the creation of incomplete or corrupted files.<span>SHELL := bash</span>
This line specifies that <span>make</span>
should use <span>bash</span>
as the default shell to execute commands. By default, <span>make</span>
uses <span>/bin/sh</span>
, but this allows you to specify a different shell.<span>.ONESHELL:</span>
This special target instructs <span>make</span>
to execute all commands for a single target in the same shell session, rather than starting a new shell for each command. This is useful for cases where you need to share environment or state between multiple commands.<span>.SHELLFLAGS := -eu -o pipefail -c</span>
These flags configure the behavior of <span>bash</span>
:<span>-e</span>
: If any command fails (returns a non-zero status), <span>bash</span>
exits.<span>-u</span>
: If an undefined variable is used, <span>bash</span>
will report an error and exit.<span>-o pipefail</span>
: If any command in a pipeline fails, the entire pipeline returns a failure status.<span>-c</span>
: Read commands from a string and execute them.<span>MAKEFLAGS += --warn-undefined-variables --no-print-directory</span>
<span>--warn-undefined-variables</span>
: Warn about the use of undefined variables to help catch typos or missing variable definitions.<span>--no-print-directory</span>
: Disable <span>make</span>
from printing directory information during recursive calls, which can make output cleaner.
“Dry Run” Mode
In bash commands, we can run commands to check if they are correct, and if not, we can run them again. However, in a Makefile, we want to ensure correctness in one go, so we can use the <span>-n</span>
parameter for “dry run” mode. This way, the Makefile will output the commands that will be executed without actually executing them. This helps us verify if the commands are correct.
$ make hello
Hello xiaoming
$ make hello -n
echo Hello xiaoming
Variable Naming
In a Makefile, you can preset variables using <span>?=</span>
like this:
.RECIPEPREFIX = >
NAME ?= xiaoming
hello:
> @echo Hello ${NAME}
Thus, if we do not define the <span>NAME</span>
variable in the command line, it will be set to <span>xiaoming</span>
.
$ make hello
Hello xiaoming
$ make hello NAME=lihua
Hello lihua
Text Replacement
In some cases, you may want to create a new string from an existing string.
For example, you want to extract the directory name or filename from a path, such as
PATH = data/reads/ The Makefile syntax provides functions to perform this operation, for example:
DIR = $(dir ${PATH}) will contain data/reads/ FNAME = $(notdir ${PATH}) will contain abc.txt
# Sets the prefix for commands.
.RECIPEPREFIX = >
# Set a filename
FILE = data/refs/ebola.fa.gz
demo:
# Prints: data/refs/ebola.fa.gz
> @echo${FILE}
# Prints: data/refs/
> @ echo $(dir ${FILE})
# Prints: ebola.fa.gz
> @ echo $(notdir ${FILE})
# Prints: data/refs/human.fa.gz
> @ echo $(subst ebola,human,${FILE})
# Prints: data/refs/ebola.fa
> @echo $(patsubst %.gz,%,${FILE})
$ make demo
data/refs/ebola.fa.gz
data/refs/
ebola.fa.gz
data/refs/human.fa.gz
data/refs/ebola.fa
Modularizing Bioinformatics Workflows
While we all wish for a single script to accomplish all tasks, this is achievable but may encounter various issues such as overly long scripts, difficulty in maintenance, and lack of reusability. Therefore, we can break down the script into multiple modules, each responsible for a specific task. This can enhance the readability, maintainability, and reusability of the code.
Once the requirements are understood, following good coding practices can effectively modularize the script. This can improve the readability, maintainability, and reusability of the code. Next, let’s write a Makefile script to download metadata from the SRA database.
Step 1: Default Settings
# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
SHELL := bash
.ONESHELL:
.SHELLFLAGS := -eu -o pipefail -c
MAKEFLAGS += --warn-undefined-variables --no-print-directory
Step 2: Print Help
We often forget what our scripts do. Therefore, it is best to have the first command print help.
# Print usage information.
usage:
> @echo"#"
> @echo"# metadata.mk: downloads metadata from SRA"
> @echo"#"
> @echo"# SRA=${SRA}"
> @echo"#"
> @echo"# make run|keys|clean"
> @echo"#"
Step 3: Define Variables
This step may not be necessary at first, but as you become more proficient, you can define commonly used variables here.
# Sets the default target.
SRA ?= PRJEB31790
Step 4: Add Code
Here, we add code for downloading, extracting information, and cleaning up.
run:
> @bio search ${SRA} -H --csv --all > ${SRA}.csv
keys: ${SRA}.csv
> @cat ${SRA}.csv | csvcut -c run_accession,sample_title
clean:
> @rm -f ${SRA}.csv
Step 5: Run Commands
Put all the above code into a file and then run the make command.
# Use default variable
make -f metadata.mk run
# Use custom variable
make -f metadata.mk run SRA=PRJNA932187
# Extract key information
make -f metadata.mk keys
# You can also run directly, as dependencies are defined, it will run run first, then keys
# Clean up
make -f metadata.mk clean
Extensions
I frequently use the software <span>bio</span>
: https://www.bioinfo.help/index.html. This software has many commonly used bioinformatics tools. I recommend you use it!
After downloading the software, running the following code will download many Makefile scripts. Familiarizing yourself with them will make bioinformatics analysis much simpler.
bio code
For example, here is a Makefile I always use to download data:
#
# Downloads sequencing reads from SRA.
#
# The directory that stores FASTQ reads that we operate on.
DIR ?= reads
# SRR number (sequencing run from the Ebola outbreak data of 2014)
SRR ?= SRR1553425
# The name of the unpacked reads
P1 ?= ${DIR}/${SRR}_1.fastq
P2 ?= ${DIR}/${SRR}_2.fastq
# The name of the reads
# You can rename reads to be more descriptive.
R1 ?= ${P1}
R2 ?= ${P2}
# How many reads to download (N=ALL downloads everything).
N ?= 10000
# Makefile customizations.
.DELETE_ON_ERROR:
SHELL := bash
.ONESHELL:
.SHELLFLAGS := -eu -o pipefail -c
MAKEFLAGS += --warn-undefined-variables --no-print-directory
# Print usage information.
usage::
@echo "#"
@echo "# sra.mk: downloads FASTQ reads from SRA"
@echo "#"
@echo "# SRR=${SRR}"
@echo "# N=${N} (use N=ALL to download all reads)"
@echo "#"
@echo "# R1=${R1}"
@echo "# R2=${R2}"
@echo "#"
@echo "# make run|test|aria|clean"
@echo "#"
# Determine if we download all reads or just a subset.
ifeq ($(N), ALL)
FLAGS ?= -F --split-files
else
FLAGS ?= -F --split-files -X ${N}
endif
# Obtain the reads from SRA.
${R1}:
# Create the directory.
mkdir -p ${DIR}
# Download the reads.
fastq-dump ${FLAGS} -O ${DIR} ${SRR}
# Rename the first in pair if final name is different.
if [ "${P1}" != "${R1}" ]; then mv -f ${P1} ${R1}; fi
# Rename the second pair if exists and is different.
if [ -f ${P2} ] && [ "${P2}" != "${R2}" ]; then mv -f ${P2} ${R2}; fi
# List the data. We don't know if the reads are paired or not.
run: ${R1}
@if [ -f ${R2} ]; then
@ls -lh ${R1} ${R2}
else
@ls -lh ${R1}
fi
# Removes the SRA files.
clean:
rm -f ${P1} ${P2} ${R1} ${R2}
# A synonym for clean.
run!: clean
# Download via aria2c command line too.
# The process may be more reliable than fastq-dump.
# Will not rename files to R1 and R2!
aria:
# Extract URLs from the search output
# Then use aria2c to download the reads for the SRR number.
bio search ${SRR} | jq -r '.[].fastq_url[]' | \
parallel -j 1 --lb make -f src/run/aria.mk URL={} DIR=${DIR} run
# Shows the resulting files.
ls -l ${DIR}
# Run the test suite.
test: clean run
install::
@echo micromamba install sra-tools
# Targets that are not files.
.PHONY: usage run run! test install