In a Linux cluster environment, quickly identifying issues and monitoring system performance are core skills for operations personnel. This article compiles 17 high-frequency commands covering system monitoring, process management, resource analysis, and more to assist you in efficient operations.
1. Overall System Monitoring
1. top/htop
Real-time view of CPU, memory, and process status, supporting dynamic sorting.
top: Sort by CPU usage with P, by memory usage with M. htop: Interactive interface, supports tree view of processes (requires installation).
Applicable scenario: Quickly locate high-load processes.
2. vmstat
Monitor virtual memory, disk I/O, and CPU status.
vmstat 2 5 # Refresh every 2 seconds, for a total of 5 times
Focus on us (user CPU), wa (disk wait), free (free memory).
3. iostat
Analyze disk I/O performance and troubleshoot slow disk issues.
iostat -x 2 # Display extended statistics, including %util (disk utilization)
2. Process and Thread Troubleshooting
1. Use ps to check process status, combining the aux parameter to get complete information.
ps aux | grep nginx # Find nginx process
2. pstree displays process relationships in a tree structure, quickly locating zombie processes.
pstree -p # Display process PID
3. pkill safely terminates processes, supporting matching by name or user.
pkill -9 java # Forcefully terminate all java processes
3. Memory Analysis
1. free checks memory usage, focusing on the available field (available memory).
free -h #
2. vmstat with the -m parameter views slab allocator (kernel object cache).
vmstat -m # Analyze memory leaks
3. pmap analyzes process memory mapping to troubleshoot memory leaks.
pmap -x 1234 # View memory distribution of process with PID=1234
4. Disk and Network Diagnostics
1. df/du checks disk space and locates large files.
df -h /data # View usage of /data partition du -sh * | sort -h # Sort current directory sizes
2. netstat/ss monitors network connections and troubleshoots port usage.
netstat -antp | grep 80 # View port 80 usage ss -s # Statistics of TCP/UDP connection status
3. nmap scans cluster nodes for alive status.
nmap -sP 192.168.1.0/24 # Batch check for live hosts in the subnet
5. Log and Performance Tuning
1. grep/awk for quick filtering and statistics of logs.
grep "ERROR" /var/log/syslog | awk '{print $5}' # Extract error timestamps
2. sar for historical performance analysis, supporting multi-dimensional backtracking of CPU, memory, and I/O.
sar -u 1 5 # Real-time monitoring of CPU usage per second
3. strace traces process system calls to locate blocking issues.
strace -p 1234 # Trace system calls of process with PID=1234
6. Cluster-Specific Tools
1. parallel-ssh executes commands in bulk, synchronously monitoring multiple nodes.
parallel-ssh -h hosts.txt # Host list -l root # Remote username -A # Enable password authentication (or use key) -i # Real-time output -e /tmp/error.log # Error log -p 10 # Number of concurrent connections "df -h" # Command to execute
The above command executes df -h on all hosts listed in hosts.txt to check disk usage across nodes.
2. collectl centrally collects performance data from cluster nodes.
collectl -sSM -i10 -n5 # Monitor CPU and memory
Click to follow. Big Data Technology Enthusiast