Understanding Essential Systems for Machine Learning: Linux (With Common & Advanced Commands)

Understanding Essential Systems for Machine Learning: Linux (With Common & Advanced Commands)

Source: Machine Heart

This article is 3208 words, and it is recommended to read it in 5 minutes.

This article outlines the basic features of the Linux system and introduces some commonly used Linux commands to help you get started quickly.

Linux has gained favor among many developers due to its stability and has become the operating system for most servers. For machine learning developers, using a Mac/Linux system is almost a necessity. However, due to its steep learning curve, many people hesitate to dive in. This article will introduce some commonly used Linux commands to help you get started quickly.

As software systems continue to evolve, today, different operating systems cater to different user groups: Windows targets office and commercial users, Mac targets creative individuals, while Linux targets software developers. For operating system providers, this market segmentation greatly simplifies the investment in product technical requirements, user experience, and product direction. However, it also exacerbates compatibility issues, leading different businesses into narrow, incompatible domains: business professionals cannot provide insights into creativity, and developers cannot delve into business decisions.

In reality, knowledge and skills are fluid, crossing multiple disciplines and fields. Rather than saying “you can only excel at one thing,” it might be more accurate to say this is an early optimization approach. Once you sample a large number of tasks from society, you can only know what you excel at, and you might even discover that you excel at many of them.

For modern business analysts, bridging the gap between business and software is especially important. Business analysis must be a “dual platform”, able to utilize command-line tools available only on Linux (or OS X), while still benefiting from Microsoft Office on Windows. It is understandable that Linux can intimidate those with business degrees. Fortunately, as with most things, you only need to accomplish 20% of the tasks to achieve 80% of the work. Here is my 20%.

Business analysis is data-driven, and machine learning is a powerful data analysis tool. The best environment for analyzing data using machine learning models is precisely the Linux system, not only because it supports a wide range of Python machine learning libraries but also due to the simplicity and clarity of environment configuration and management. Therefore, this article will outline the basic features and commands of the Linux system for machine learning readers.

Why Machine Learning Analysts Need to Understand Linux

Due to its open-source foundation, Linux continuously benefits from contributions by thousands of developers. The programs and tools they build not only simplify their work but also streamline the jobs of programmers who follow them. As a result, open-source development creates a network effect: the more developers build tools on the platform, the more other developers can instantly write their programs using these tools.

The result is an extensive suite of Linux programs and utilities (collectively referred to as software) written in Linux—many of which have never been used on Windows. One example is the version control system (VCS) known as git. Developers could have written this software to work on Windows, but they did not. They made it work on the Linux command line because the ecosystem already provided all the necessary tools.

Specifically, there are two main issues with development on Windows:

  • Basic tasks, such as file parsing, job scheduling, and text searching, are more critical than running command-line tools.

  • Programming languages (such as Python, C++) and their associated code libraries can trigger errors because they expect specific Linux parameters or file system locations.

This means that if we want to develop on Windows, we need to spend more time rewriting basic tools that already exist in Linux and eliminating operating system compatibility errors. This is not surprising—Windows’ ecosystem was initially not designed with the needs of software development in mind.

With this Linux development case, let’s start with the basics.

The Basic Unit of Linux: “Shell”

The “shell” (also known as terminal, console, or command line) is a text-based user interface through which commands are sent to the machine. In Linux, the default language of the shell is bash. Unlike Windows users, who primarily perform clicks within Windows, Linux developers insist on using the keyboard to input commands into the shell. For those without a programming background, this transition may initially feel unnatural, but the benefits of developing in Linux easily outweigh the initial learning investment.

Learning a Few Important Concepts

Compared to mature programming languages, bash requires learning only a few key concepts. Once this step is completed, learning bash is just a matter of memorization. More clearly, to master bash, you only need to remember 20-30 commands (command) and their most commonly used parameters (argument).

For non-developers, Linux can be perplexing because developers seem to use obscure terminal commands effortlessly. The truth is, they only remember a small number of commands—when faced with more complex issues, they (like everyone else) also need to Google it.

Here are the main concepts in bash.

1. Command Syntax

  • Commands in bash are case-sensitive and follow the syntax structure of {command}{argument}.

For example, in ‘grep -inr’, grep is the command (searching for a string in text), and -inr is the flag (flag) or argument (varying with the default execution of grep). The only way to understand this command is to use Google search or input the command ‘man grep’. I recommend learning commands along with their most commonly used parameters; otherwise, learning the function of each flag individually can be quite laborious.

2. Relative Directory Addresses

  • Current directory: .

  • Parent directory’s parent directory: ..

  • User’s home directory: ~

  • Root directory of the file system: /

For example, to change from the current directory to the parent directory, you need to input: “cd ..”. Similarly, to copy the file located at “/path/to/file.txt” to the parent directory, you need to input “cp /path/to/file.txt .” (note the dot at the end of the command). The examples used here are relative paths, and absolute paths can be used instead.

3. Standard Input (STDIN) / Standard Output (STDOUT)

  • Any input and submission (by typing ENTER) to the window command is referred to as standard input (standard input, STDIN).

  • Any program that prints (print) to the terminal (e.g., text from a file) is referred to as standard output (standard output, STDOUT).

4. Piping

  • |

A pipe takes the STDOUT of a command on the left and serves as the STDIN for another command on the right.

For example: echo ‘test text’ | wc -l

  • >

The greater-than sign takes the STDOUT from a command on the left and writes/overwrites it into a new file on the right.

For example: ls > tmp.txt

  • >>

Two greater-than signs take the STDOUT from a command on the left and append it to an existing or new file on the right.

For example: date >> tmp.txt

5. Wildcards

This is similar to the % symbol in SQL, for example, using “WHERE first_name LIKE ‘John%'” to search for all names starting with John.

In bash, the corresponding command is “John*”. If you want to list all files ending with “.json” in a folder, you can input: “ls *.json”.

6. TAB Key Autocomplete

If we enter a command and press the TAB key, bash will autocomplete that command. However, we should also use tools like zsh or fish for autocompletion, as it is hard to remember various commands and their parameters. More accurately, these tools will autocomplete command statements based on our command line history.

7. Exit

Sometimes we get stuck in some programs and do not know how to exit them. This is a common problem among Linux beginners and can greatly diminish their enthusiasm. Generally, exit commands have some relation to the letter “q”, so remembering the following exit commands or shortcuts is very useful.

  • Bash

CTRL+c

q

exit

  • Python

quit()

CTRL+d

  • Nano: CTRL+x

  • Vim: <Esc> :q!

Common Bash Commands

Below are the most commonly used commands in Linux, and remembering these commands is crucial for getting started quickly with a new system.

  • cd {directory}: Change the current directory

  • ls -lha: List directory files (detailed information)

  • vim or nano: Command line text editor

  • touch {file}: Create a new empty file

  • cp -R {original_name} {new_name}: Copy a file or directory (including all internal files)

  • mv {original_name} {new_name}: Move or rename a file

  • rm {file}: Delete a file

  • rm -rf {file/folder}: Permanently delete a file or folder (use with caution)

  • pwd: Print the current working directory

  • cat or less or tail or head -n10 {file}: Standard output content of a file

  • mkdir {directory}: Create an empty directory

  • grep -inr {string}: Search for a string in files in the current directory or subdirectories

  • column -s, -t <delimited_file>: Display a comma-separated file in columnar format

  • ssh {username}@{hostname}: Connect to a remote machine

  • tree -LhaC 3: Display a three-level directory structure (with file size information and hidden directory information)

  • htop (or top): Task manager

  • pip install –user {pip_package}: Python package manager, installs packages to ~/.local/bin directory

  • pushd . ; popd ; dirs; cd -: Push/pop/view a directory on the stack and return to the last directory

  • sed -i “s/{find}/{replace}/g” {file}: Replace a string in a file

  • find . -type f -name ‘*.txt’ -exec sed -i “s/{find}/{replace}/g” {} \;: Replace a string in files with a .txt extension in the current directory and subdirectories

  • tmux new -s session, tmux attach -t session: Create another terminal session interface without creating a new window [Advanced command]

  • wget {link}: Download a webpage or web resource

  • curl -X POST -d “{key: value}” http://www.google.com: Send an HTTP request to a website server

  • find <directory>: Recursively list all directories and their contents

Advanced & Less Common Commands

It is also necessary to keep a useful command list for rare situations, even if these situations do not occur often (like a process blocking several network ports). Below we will list a few less common commands:

  • lsof -i :8080: List open file descriptors (-i is the flag for network interfaces)

  • netstat | head -n20: List currently open Internet/UNIX interfaces (sockets) and related information

  • dstat -a: Output current disk, network, CPU activity, etc.

  • nslookup <IP address>: Find the hostname for a remote IP address

  • strace -f -e <syscall> <cmd>: Trace a program’s system calls (-e flag is used to filter certain system calls)

  • ps aux | head -n20: Output currently active processes

  • file <file>: Check file type (e.g., executable file, binary file, ASCII text file)

  • uname -a: Kernel information

  • lsb_release -a: System information

  • hostname: View your machine’s hostname (the name that other computers can search for)

  • pstree: Visualize branch processes

  • time <cmd>: Execute a command and report the time taken

  • CTRL + z ; bg; jobs; fg: Pass a process from the current tty to the background and return to the foreground

  • cat file.txt | xargs -n1 | sort | uniq -c: Count unique words in a file

  • wc -l <file>: Count the number of lines in a file

  • du -ha: Show the size of directories and their contents on disk

  • zcat <file.gz>: Display the contents of a compressed text file

  • scp <user@remote_host> <local_path>: Copy a file from a remote server to the local server, or vice versa

  • man {command}: Show the manual (documentation) for a command, but usually not as useful as Google search

Editor: Wenjing

Proofreader: Wang Hongyu

To ensure the quality of published articles and establish a good reputation, Data Dispatch has set up a “Spelling Error Fund”, encouraging readers to actively correct errors.

If you find any errors while reading this article, please leave a comment at the end of the article, or provide feedback in the background. After confirmation by the editor, Data Dispatch will send a 8.8 yuan red envelope to the reporting reader.

If the same reader points out multiple errors in the same article, the reward remains unchanged. If different readers point out the same error, the reward goes to the first reader.

Thank you for your continuous attention and support. We hope you can supervise Data Dispatch in producing higher quality content.

Understanding Essential Systems for Machine Learning: Linux (With Common & Advanced Commands)

Understanding Essential Systems for Machine Learning: Linux (With Common & Advanced Commands)

Leave a Comment