Introduction

Reference Tutorials:

https://www.gnu.org/software/make/manual/make.html

https://makefiletutorial.com/

https://www.biostarhandbook.com/books/scripting/index.html

In bioinformatics upstream analysis, it is impractical to manually input our commands every time. Typically, we write our commands into a script file and execute them by running the script. However, when our work changes, we need to make adjustments, including adding or deleting commands, which increases our workload and reduces code reusability. Therefore, we can use a Makefile to modularize our analysis workflow.

A Makefile is a file used to automate the build process, commonly used on Unix and Unix-like systems. It is read and executed by a tool called make. The Makefile defines a series of rules and dependencies to guide how to compile and link programs.

Using a Makefile in bioinformatics analysis can help us automate the analysis workflow without requiring advanced learning. Below is a simple example of a Makefile:

First, we create a file named Makefile and write some content in it.

.RECIPEPREFIX = >

# Command 1 
foo:
> echo Hello world!

# Command 2
bar:
> echo Hello world!
> echo Hello Everyone!

In this Makefile, we define two targets foo and bar, corresponding to two commands. We can execute these commands by using make foo and make bar.

$ make -f Makefile foo
echo Hello world!
Hello world!

$ make -f Makefile bar
echo Hello world!
Hello world!
echo Hello Everyone!
Hello Everyone!

When your filename is Makefile, you can directly use the make command without needing the -f parameter.

Typically, a Makefile will output both commands and results to the terminal. If we only want to output the results, we can prefix the command with an @ symbol. For example:

.RECIPEPREFIX = >

# Command 1 
foo:
> @echo Hello world!

Like bash scripts, Makefiles can also use variables, for example:

.RECIPEPREFIX = >

NAME = xiaoming

hello:
> @echo Hello ${NAME}

The output will be:

$ make -f anyfile hello
Hello xiaoming

make is designed to track dependencies between targets and only execute necessary steps when any file changes. For example, if you already have a genome index, the make command will skip the indexing step.

# Set the prefix from tabs to >
.RECIPEPREFIX = >

counts.txt:
> echo 100 > counts.txt

names.txt:
> echo Joe > names.txt

results.txt: counts.txt names.txt
> cat counts.txt names.txt > results.txt

In the above Makefile, running make results.txt will first generate counts.txt and names.txt if they do not exist. Then, it will generate results.txt from these two files. If you run make again, it will not perform any operations since all files already exist.

Tips for Makefile

Default Settings

Adding these settings to your Makefile can make it more powerful:

.RECIPEPREFIX = >
.DELETE_ON_ERROR:
SHELL := bash
.ONESHELL:
.SHELLFLAGS := -eu -o pipefail -c
MAKEFLAGS += --warn-undefined-variables --no-print-directory

.DELETE_ON_ERROR:

This special target tells make to delete generated target files when command execution fails, preventing the creation of incomplete or corrupted files.SHELL := bash

This line specifies that make should use bash as the default shell to execute commands. By default, make uses /bin/sh, but this allows you to specify a different shell..ONESHELL:

This special target instructs make to execute all commands for a single target in the same shell session, rather than starting a new shell for each command. This is useful for cases where you need to share environment or state between multiple commands..SHELLFLAGS := -eu -o pipefail -c

These flags configure the behavior of bash:-e: If any command fails (returns a non-zero status), bash exits.-u: If an undefined variable is used, bash will report an error and exit.-o pipefail: If any command in a pipeline fails, the entire pipeline returns a failure status.-c: Read commands from a string and execute them.MAKEFLAGS += --warn-undefined-variables --no-print-directory

--warn-undefined-variables: Warn about the use of undefined variables to help catch typos or missing variable definitions.--no-print-directory: Disable make from printing directory information during recursive calls, which can make output cleaner.

“Dry Run” Mode

In bash commands, we can run commands to check if they are correct, and if not, we can run them again. However, in a Makefile, we want to ensure correctness in one go, so we can use the -n parameter for “dry run” mode. This way, the Makefile will output the commands that will be executed without actually executing them. This helps us verify if the commands are correct.

$ make hello
Hello xiaoming

$ make hello -n
echo Hello xiaoming

Variable Naming

In a Makefile, you can preset variables using ?= like this:

.RECIPEPREFIX = >

NAME ?= xiaoming

hello:
> @echo Hello ${NAME}

Thus, if we do not define the NAME variable in the command line, it will be set to xiaoming.

$ make hello
Hello xiaoming

$ make hello NAME=lihua
Hello lihua

Text Replacement

In some cases, you may want to create a new string from an existing string.

For example, you want to extract the directory name or filename from a path, such as

PATH = data/reads/ The Makefile syntax provides functions to perform this operation, for example:

DIR = $(dir ${PATH}) will contain data/reads/ FNAME = $(notdir ${PATH}) will contain abc.txt

# Sets the prefix for commands.
.RECIPEPREFIX = >

# Set a filename
FILE = data/refs/ebola.fa.gz

demo:

# Prints: data/refs/ebola.fa.gz
> @echo${FILE}

# Prints: data/refs/
> @ echo $(dir ${FILE})

# Prints: ebola.fa.gz
> @ echo $(notdir ${FILE})

# Prints: data/refs/human.fa.gz
> @ echo $(subst ebola,human,${FILE})

# Prints: data/refs/ebola.fa
> @echo $(patsubst %.gz,%,${FILE})

$ make demo
data/refs/ebola.fa.gz
data/refs/
ebola.fa.gz
data/refs/human.fa.gz
data/refs/ebola.fa

Modularizing Bioinformatics Workflows

While we all wish for a single script to accomplish all tasks, this is achievable but may encounter various issues such as overly long scripts, difficulty in maintenance, and lack of reusability. Therefore, we can break down the script into multiple modules, each responsible for a specific task. This can enhance the readability, maintainability, and reusability of the code.

Once the requirements are understood, following good coding practices can effectively modularize the script. This can improve the readability, maintainability, and reusability of the code. Next, let’s write a Makefile script to download metadata from the SRA database.

Step 1: Default Settings

# Makefile customizations.
.RECIPEPREFIX = >
.DELETE_ON_ERROR:
SHELL := bash
.ONESHELL:
.SHELLFLAGS := -eu -o pipefail -c
MAKEFLAGS += --warn-undefined-variables --no-print-directory

Step 2: Print Help

We often forget what our scripts do. Therefore, it is best to have the first command print help.

# Print usage information.
usage:
> @echo"#"
> @echo"# metadata.mk: downloads metadata from SRA"
> @echo"#"
> @echo"# SRA=${SRA}"
> @echo"#"
> @echo"# make run|keys|clean"
> @echo"#"

Step 3: Define Variables

This step may not be necessary at first, but as you become more proficient, you can define commonly used variables here.

# Sets the default target.
SRA ?= PRJEB31790

Step 4: Add Code

Here, we add code for downloading, extracting information, and cleaning up.

run:
> @bio search ${SRA} -H --csv --all > ${SRA}.csv

keys: ${SRA}.csv
> @cat ${SRA}.csv | csvcut -c run_accession,sample_title

clean:
> @rm -f ${SRA}.csv

Step 5: Run Commands

Put all the above code into a file and then run the make command.

# Use default variable
make -f metadata.mk run

# Use custom variable
make -f metadata.mk run SRA=PRJNA932187

# Extract key information
make -f metadata.mk keys
# You can also run directly, as dependencies are defined, it will run run first, then keys

# Clean up
make -f metadata.mk clean

Extensions

I frequently use the software bio: https://www.bioinfo.help/index.html. This software has many commonly used bioinformatics tools. I recommend you use it!

After downloading the software, running the following code will download many Makefile scripts. Familiarizing yourself with them will make bioinformatics analysis much simpler.

bio code

For example, here is a Makefile I always use to download data:

#
# Downloads sequencing reads from SRA.
#

# The directory that stores FASTQ reads that we operate on.
DIR ?= reads

# SRR number (sequencing run from the Ebola outbreak data of 2014)
SRR ?= SRR1553425

# The name of the unpacked reads
P1 ?= ${DIR}/${SRR}_1.fastq
P2 ?= ${DIR}/${SRR}_2.fastq

# The name of the reads
# You can rename reads to be more descriptive.
R1 ?= ${P1}
R2 ?= ${P2}

# How many reads to download (N=ALL downloads everything).
N ?= 10000

# Makefile customizations.
.DELETE_ON_ERROR:
SHELL := bash
.ONESHELL:
.SHELLFLAGS := -eu -o pipefail -c
MAKEFLAGS += --warn-undefined-variables --no-print-directory

# Print usage information.
usage::
	@echo "#"
	@echo "# sra.mk: downloads FASTQ reads from SRA"
	@echo "#"
	@echo "# SRR=${SRR}"
	@echo "# N=${N} (use N=ALL to download all reads)"
	@echo "#"
	@echo "# R1=${R1}"
	@echo "# R2=${R2}"
	@echo "#"
	@echo "# make run|test|aria|clean"
	@echo "#"


# Determine if we download all reads or just a subset.
ifeq ($(N), ALL)
FLAGS ?= -F --split-files
else
FLAGS ?= -F --split-files -X ${N}
endif

# Obtain the reads from SRA.
${R1}:
	# Create the directory.
	mkdir -p ${DIR}

	# Download the reads.
	fastq-dump ${FLAGS} -O ${DIR} ${SRR}

	# Rename the first in pair if final name is different.
	if [ "${P1}" != "${R1}" ]; then mv -f ${P1} ${R1}; fi

	# Rename the second pair if exists and is different.
	if [ -f ${P2} ] && [ "${P2}" != "${R2}" ]; then mv -f ${P2} ${R2}; fi

# List the data. We don't know if the reads are paired or not.
run: ${R1}
	@if [ -f ${R2} ]; then 
		@ls -lh ${R1} ${R2}
	else
		@ls -lh ${R1}
	fi

# Removes the SRA files.
clean:
	rm -f ${P1} ${P2} ${R1} ${R2}

# A synonym for clean.
run!: clean

# Download via aria2c command line too.
# The process may be more reliable than fastq-dump.
# Will not rename files to R1 and R2!
aria:
	# Extract URLs from the search output
	# Then use aria2c to download the reads for the SRR number.
	bio search ${SRR} | jq -r '.[].fastq_url[]' | \
		parallel -j 1 --lb make -f src/run/aria.mk URL={} DIR=${DIR} run

	# Shows the resulting files.
	ls -l ${DIR}

# Run the test suite.
test: clean run

install::
	@echo micromamba install sra-tools

# Targets that are not files.
.PHONY: usage run run! test install

Makefile: A Guide to Automating Bioinformatics Workflows

Introduction

Tips for Makefile

Default Settings

“Dry Run” Mode

Variable Naming

Text Replacement

Modularizing Bioinformatics Workflows

Step 1: Default Settings

Step 2: Print Help

Step 3: Define Variables

Step 4: Add Code

Step 5: Run Commands

Extensions

Leave a Comment Cancel reply

Introduction

Tips for Makefile

Default Settings

“Dry Run” Mode

Variable Naming

Text Replacement

Modularizing Bioinformatics Workflows

Step 1: Default Settings

Step 2: Print Help

Step 3: Define Variables

Step 4: Add Code

Step 5: Run Commands

Extensions

Related posts

Leave a Comment Cancel reply