Getting Started with Genome Annotation: Linux Guide

Genome Annotation refers to the analysis of genomes to predict the locations of genes, which is crucial for genomic analysis and gene function studies.

Genome annotation is a comprehensive process that involves numerous software tools, and various bugs may occur during the installation and usage of the software, leading to low-quality genomic annotation information or even failure to complete the annotation.

This series of posts will provide a step-by-step learning approach to genome annotation and document potential issues to facilitate better analysis. We also welcome discussions and exchanges from everyone.

01 Linux System Information Management

1.1 System Information

arch # Display the machine's processor architecture
yf@host:~$ arch
x86_64

uname -r # Display the current kernel version
yf@host:~$ uname -r
6.8.0-49-generic

shutdown # Shutdown or reboot
exit # Logout

1.2 Directory and Resource Management

cd /home # Enter the home directory
cd .. # Go back to the previous directory
cd ../.. # Go back two directories
cd # Enter the personal home directory

ls # List files in the directory
ls -l # Show detailed information about files and directories
mkdir dir1 # Create a directory named dir1
rm -f file1 # Delete a file named file1
rm -rf dir1 # Delete a directory named dir1 and its contents
mv dir1 new_dir # Rename/move a directory

which halt # Show the full path of a binary or executable file
yf@host:~$ which halt
/usr/sbin/halt

whereis halt # Show the location of a binary file, source, or man page
yf@host:~$ whereis halt
halt: /usr/sbin/halt /usr/share/man/man8/halt.8.gz

df -h # Show the list of mounted partitions
du -sh dir1 # Estimate the disk space used by directory dir1

chmod 777 file1 # Set file1's permissions to read/write/execute (rwx) for User, Group, Other
4(r), 2(w), 1(x)
chmod 700 file1 # Set file1's permissions to read/write/execute (rwx) only for User

renice [-n] priority [[-p] pid [...]] [[-g] pgrp [...]] [[-u] user [...]] # Change the priority of running processes

renice -n -5 -u username # This lowers the priority of all processes for user username by 5. Lowering priority can reduce competition for system resources but may slow process responsiveness.

grep pattern files # Search for a pattern in files
grep -r pattern dir # Recursively search for a pattern in files in a directory
command | grep pattern # Search for a pattern in the output of a command

tar -cf file.tar files # Create a tar archive containing files
tar -xf file.tar # Extract files from a tar archive
tar -xzf file.tar.gz # Extract a tar archive using Gzip

zip archive.zip file1.txt # Compress a single file
zip -r archive.zip dir1/ # Recursively compress an entire directory
unzip archive.zip # Extract files from a zip archive
unzip archive.zip -d /path/to/directory # Specify a directory to extract to
unzip -l archive.zip # List files in the zip archive

^C # Terminate the current command
^Z # Suspend the current command
^D # Exit the current session

top -u username # Real-time system monitoring, press B, X, or M to display, press Q to exit
htop -u username -t # Display in a tree format
pstree -u username # Display all processes for the current user in a tree format

bg # Move a process from foreground to background
fg # Move a process from background to foreground

02 Creating and Logging in as a New User on Linux

2.1 Creating a New User on Linux

useradd [OPTIONS] USERNAME
# Only the superuser (root) or users with sudo privileges can use useradd to create a new user account. When executing the useradd command, it creates a new user account based on options specified on the command line and defaults in the /etc/default/useradd file.

sudo useradd yf
# If no options are specified, useradd will create a new user account using the defaults from /etc/default/useradd. This command will add corresponding entries in /etc/passwd, /etc/shadow, /etc/group, and /etc/gshadow.

id yf
# You can verify if the user has been created and view the user's details by running the id command:
uid=1017(yf) gid=1019(yf) groups=1019(yf),27(sudo)

sudo passwd yf
# To allow the newly created user to log in, you need to set a password for them. You can do this by running the passwd command followed by the username. This command will prompt you to enter and confirm the password; make sure to use a strong password:
Changing password for user jane.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.

sudo useradd -m -r -d /data/yf yf -G sudo
# This is an example of how to create a new user named yf, set the home directory to /data/yf, and grant system user administrator privileges.

2.2 Remote Login to Host

SSH Protocol:
ssh -P port user@host # Log in specifying the port, port is the port number, host is the host address, user is the username

SFTP Protocol:
sftp -P port user@host # Same as above
get remote_file_name local_file_name # Download a file from the remote server to local and rename it
get -Pr directory_name # Download files and directories while preserving original file attributes and dates

put localfile # Upload a local file to the remote server
put -r local_directory_name # Upload files and directories

2.3 New User Remote Login Guide

When a new user logs in to the host for the first time, they may encounter: the command line only shows a prompt$

You can directly enter:bash
In bash mode, enter:chsh
AfterShell [/bin/sh]: enter:/bin/bash

03 Environment Management Software Conda and Mamba Configuration

3.1 Installing Conda

Simply download the latest version

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Run the installation script

bash Miniconda3-latest-Linux-x86_64.sh

After installation, reopen the terminal, and you will enter the base environment, which will be displayed at the beginning of the command line:

(base) yf@host:~$

3.2 Creating Isolated Environments with Conda:

Create a small environment

conda create -n env # Create a small environment named env

Enter/Exit the small environment

conda activate # Enter base environment
conda activate env # Enter env small environment (include env because there may be multiple environments, so you need to specify which one to enter)
conda deactivate # Exit env small environment, return to base environment (no need to include env when exiting, as it defaults to exiting the current environment)

List small environments

# List existing small environments, both commands work
conda env list # List conda environments
conda info --env # List conda information environments

Pre-specifying environment dependencies

# Specify dependency versions when creating an environment
conda create -n py python=3.7 # Create a small environment named py, specifying its dependency python version to be 3.7

Deleting small environments

# Delete a created small environment and installed packages
conda remove -n env --all # Remember to include all to delete the environment completely

3.3 Using Conda to Install Bioinformatics Software:

Direct method

conda create -n py
conda activate py
conda install -c conda-forge -c bioconda busco=5.6.0 -y # If installation is successful, three done will appear
busco --help # Successfully calling the software's help documentation also indicates successful download; otherwise, the download failed, and you can redownload it

Indirect method

Export the current environment:

conda env export envname > env.yml # Cross-platform applicable, works on Windows, Mac, Linux
conda list --explicit > env.txt # Only for the same platform

Import environment:

conda env create --name <envname> --file env.yml
conda create --name <env> --file <this file>

Deleting software:

# If the -n parameter is not specified, you must enter the environment before performing the delete operation; similarly, you can use the -y parameter to skip the confirmation step
conda remove -n py busco -y

3.4 Mamba

It is generally recommended to use conda to install mamba, but most find it difficult to install successfully this way; therefore, I recommend directly uninstalling conda and installing mamba from scratch.

Important environments in conda can be exported usingconda env export and then reconstructed with mamba, completing the migration.

Go tohttps://github.com/conda-forge/miniforge/releases/ to download the installation package

wget https://github.com/conda-forge/miniforge/releases/download/24.5.0-0/Mambaforge-24.5.0-0-Linux-x86_64.sh

Installation is similar to conda and is straightforward

bash Mambaforge-24.5.0-0-Linux-x86_64.sh

You can also useconda to install

$ curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ sh Miniconda3-latest-Linux-x86_64.sh
# setup channels
$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge

$ conda install mamba -n base -c conda-forge  # "mamba" is much much faster than the "conda" command!

# It is a good idea to install MitoZ into an independent environment, i.e. 'mitozEnv' here!
$ mamba create -n mitozEnv -c bioconda -c conda-forge mitoz=3.6 # It's recommended to specify the version you want to install!

The command usage is also the same as conda, just replace conda with mamba, such asmamba install XXX, etc. The mirror configuration is also in.condarc. Similarly, the environment management concept is the same as conda, so I won’t elaborate further.

Compared to conda, mamba indeed has significant improvements inspeed and dependency resolution, making it highly recommended to usemamba for managing bioinformatics environments.