Linux Shell | Master Core Commands for Cluster Monitoring and Troubleshooting in 5 Minutes

Linux Shell | Master Core Commands for Cluster Monitoring and Troubleshooting in 5 Minutes

In a Linux cluster environment, quickly identifying issues and monitoring system performance are core skills for operations personnel. This article compiles 17 high-frequency commands covering system monitoring, process management, resource analysis, and more to assist you in efficient operations.

1. Overall System Monitoring

1. top/htop

Real-time view of CPU, memory, and process status, supporting dynamic sorting.

top: Sort by CPU usage with P, by memory usage with M. htop: Interactive interface, supports tree view of processes (requires installation).

Applicable scenario: Quickly locate high-load processes.

2. vmstat

Monitor virtual memory, disk I/O, and CPU status.

vmstat 2 5  # Refresh every 2 seconds, for a total of 5 times

Focus on us (user CPU), wa (disk wait), free (free memory).

3. iostat

Analyze disk I/O performance and troubleshoot slow disk issues.

iostat -x 2  # Display extended statistics, including %util (disk utilization)

2. Process and Thread Troubleshooting

1. Use ps to check process status, combining the aux parameter to get complete information.

ps aux | grep nginx  # Find nginx process

2. pstree displays process relationships in a tree structure, quickly locating zombie processes.

pstree -p  # Display process PID

3. pkill safely terminates processes, supporting matching by name or user.

pkill -9 java  # Forcefully terminate all java processes

3. Memory Analysis

1. free checks memory usage, focusing on the available field (available memory).

free -h  # 

2. vmstat with the -m parameter views slab allocator (kernel object cache).

vmstat -m  # Analyze memory leaks

3. pmap analyzes process memory mapping to troubleshoot memory leaks.

pmap -x 1234  # View memory distribution of process with PID=1234

4. Disk and Network Diagnostics

1. df/du checks disk space and locates large files.

df -h /data  # View usage of /data partition du -sh * | sort -h  # Sort current directory sizes

2. netstat/ss monitors network connections and troubleshoots port usage.

netstat -antp | grep 80  # View port 80 usage ss -s  # Statistics of TCP/UDP connection status

3. nmap scans cluster nodes for alive status.

nmap -sP 192.168.1.0/24  # Batch check for live hosts in the subnet

5. Log and Performance Tuning

1. grep/awk for quick filtering and statistics of logs.

grep "ERROR" /var/log/syslog | awk '{print $5}'  # Extract error timestamps

2. sar for historical performance analysis, supporting multi-dimensional backtracking of CPU, memory, and I/O.

sar -u 1 5  # Real-time monitoring of CPU usage per second

3. strace traces process system calls to locate blocking issues.

strace -p 1234  # Trace system calls of process with PID=1234

6. Cluster-Specific Tools

1. parallel-ssh executes commands in bulk, synchronously monitoring multiple nodes.

parallel-ssh   -h hosts.txt         # Host list  -l root              # Remote username  -A                   # Enable password authentication (or use key)  -i                   # Real-time output  -e /tmp/error.log   # Error log  -p 10                # Number of concurrent connections  "df -h"              # Command to execute

The above command executes df -h on all hosts listed in hosts.txt to check disk usage across nodes.

2. collectl centrally collects performance data from cluster nodes.

collectl -sSM -i10 -n5  # Monitor CPU and memory

Click to follow. Big Data Technology Enthusiast

Leave a Comment