Guide to Troubleshooting High CPU Usage in Linux When No High Usage Applications Are Found
In Linux systems, we often encounter a “strange” situation: when using conventional tools like top or ps, the CPU usage remains persistently high (even spiking to 100%), yet we cannot find the corresponding application that is consuming high CPU resources. When refreshing the process list, all processes show a normal % CPU usage, as if the high CPU consumption has “vanished into thin air”.
In fact, these issues mostly stem from short-lived processes: they run for a very short time (possibly only milliseconds) or frequently restart after crashing, making it difficult for conventional tools to capture their running state. This article will address two typical scenarios of short-lived processes and teach you how to use professional tools to locate the root of the problem and completely resolve the “invisible CPU consumption”.
1. Understanding the Nature of Two Types of “Invisible” High CPU Processes
The core reason conventional tools cannot capture high CPU applications is that the processes have either “too short a lifespan” or “too rapid state transitions”. Specifically, they can be divided into the following two scenarios:
1. Scenario One: Applications Calling Short-Lived Binary Programs
Many business applications (such as backend services and scheduled task scripts) will indirectly call other binary programs during execution (such as curl, grep, awk, or custom lightweight tools). These called programs typically perform a single task (like initiating network requests or processing text data) and run for a very short time (possibly only tens of milliseconds), leading to:
-
When monitoring in real-time with top, the refresh interval of the process list (default 3 seconds) may allow the short-lived program to complete and exit before it can be captured;
-
When viewing statically with ps, only a snapshot of the process at the moment of execution can be obtained, likely missing the running state of the short-lived program;
-
If the call frequency is extremely high (e.g., dozens of times per second), the CPU consumption of numerous short-lived processes can “stack up”, causing the overall CPU usage of the system to spike, but the consumption of individual processes is difficult to detect due to their “quick in and out” nature.
Typical Case: A Python backend service calls curl 10 times per second using the subprocess module to fetch external interface data. The CPU usage of a single curl process is only 0.5%, but the frequency of 10 calls per second causes the system’s CPU usage to remain above 80%, while conventional top monitoring fails to identify any high-usage processes.
2. Scenario Two: Application Crashes and Restarts Leading to Stacked CPU Consumption
Some applications may enter a **”crash – restart – crash again” cycle** due to code defects (such as memory leaks or configuration errors) or insufficient resources. During the startup process, applications typically need to perform resource initialization (such as loading configuration files, initializing database connections, preloading caches), which consumes a certain amount of CPU resources; if the restart frequency is extremely high (e.g., restarting 1-2 times per second), the CPU consumption during the startup phase will continue to accumulate, leading to increased overall CPU usage.
The insidiousness of this type of problem lies in:
-
After an application crashes, it exits quickly, and the PID of the new process after the restart changes, making it difficult to identify “multiple restarts of the same application” in top;
-
The CPU consumption of a single startup may not be high (e.g., 5%-10%), but high-frequency restarts can cause CPU consumption to “seamlessly connect”, presenting an overall high load state;
-
If the application startup script includes “automatic restart logic” (such as the systemd Restart=always configuration), it will further exacerbate the “crash – restart” cycle, obscuring the essence of the problem.
Typical Case: A Java application crashes due to a configuration file error, triggering a null pointer exception within 1 second of startup, and systemd is configured for automatic restarts, causing the application to restart once per second. Each startup consumes 8% CPU for loading classes and initializing the Spring container, leading to the system’s CPU usage remaining around 70%, while the % CPU of the Java process in top always shows “0.3%” (only capturing the low-load state after startup).
2. Practical Tools: Accurately Capturing Short-Lived Processes and Their Parent Processes
For the above two scenarios, conventional tools (top, ps) are insufficient, and specialized tools that capture “process creation / exit” or “system calls” are needed to investigate from the “process” rather than the “result”.
1. Step One: Use execsnoop to Capture the Creation and Execution of Short-Lived Processes
execsnoop is part of the Linux kernel toolset bpftrace (or bcc) and can real-time track all processes’ exec() system calls (the core system call for process startup). Regardless of how short the process runs, it can record its startup command, parent process PID, execution time, and other key information, making it a “weapon” for troubleshooting short-lived processes.
(1) Install Dependency Tools
execsnoop depends on bpftrace or bcc, so the corresponding tool package must be installed first:
- CentOS/RHEL:
# Install bpftrace (recommended, more comprehensive functionality)
yum install -y bpftrace
# Or install bcc
yum install -y bcc bcc-tools
- Ubuntu/Debian:
apt install -y bpftrace
# Or install bcc
apt install -y bcc bcc-tools
(2) Use execsnoop to Track Short-Lived Processes
Run execsnoop directly (requires root privileges) to capture all process startup information in real-time:
# Run execsnoop, default refreshes results every second
execsnoop
(3) Key Output Field Interpretation
The output of execsnoop contains the following core information, which should be focused on:
| Field | Meaning |
|---|---|
| PID | Tracked short-lived process ID |
| PPID | Parent process ID (the PID of the application calling the short-lived process, which is the core clue for troubleshooting) |
| ARGS | Complete execution command of the short-lived process (e.g., curl http://example.com, /usr/local/bin/process_data) |
| EXIT_CODE | Process exit code (0 indicates normal exit, non-0 may indicate abnormal process exit, such as a crash) |
| DURATION | Process runtime (in milliseconds, can determine if it is a short-lived process) |
(4) Scenario One Investigation: Locating High-Frequency Calls to Short-Lived Programs
If you suspect that “the application calls short-lived binary programs”, after running execsnoop, focus on:
-
High-Frequency Occurrences of ARGS: If a command (like curl, grep) appears multiple times per second, it indicates that its call frequency is too high, and you need to record the corresponding PPID (parent process ID);
-
Long-Duration Short-Lived Processes: If a short-lived process’s DURATION exceeds 100 milliseconds and has a high call frequency, its CPU consumption will be more significant.
Operation Example:
Assuming the execsnoop output shows that the ARGS for the process curl http://api.example.com/data appears 10 times per second, with PPID 1234. At this point, you need to check the corresponding application of the parent process using ps -p 1234 -o cmd:
# View the command of the parent process with PPID=1234
ps -p 1234 -o cmd
# Output example: /usr/bin/python3 /opt/service/backend.py (the Python backend service calling curl)
2. Step Two: Use pstree and systemctl to Locate Crashing and Restarting Applications
If you suspect “the application is crashing and restarting”, you need to combine the parent process tree and process management tools (like systemd) to identify “frequently restarting applications”.
(1) Use pstree to View the Process Tree and Locate Parent Process Relationships
pstree can display the parent-child relationships between processes in a tree structure, helping us find stable parent processes through “frequently changing child processes” (usually process management tools or startup scripts):
# View all process trees, focusing on branches with frequently changing PIDs
pstree -p
# Or specify the investigation direction, such as viewing child processes of PID=1234 (suspected parent process)
pstree -p 1234
Key Tips:
-
If a certain parent process (like a child process of systemd) has child process PIDs that change frequently, it indicates that the child process is restarting frequently;
-
Combine the watch command to observe process tree changes in real-time: watch -n 1 pstree -p, if a branch’s PID changes every second, it is likely that the application is crashing and restarting.
(2) Use systemctl to Check Service Restart Status
If the application is managed by systemd (default for mainstream Linux distributions), you can quickly locate crashing and restarting applications by checking the restart count and status of the service using systemctl status:
# View the status of all services, filtering for services with high restart counts
systemctl list-units --type=service --all | grep -E 'restart|failed'
# View detailed status of a specific service (e.g., xxx.service), including restart counts and recent logs
systemctl status xxx.service
Key Output Interpretation:
-
If the output contains Restarted xxx.service and the count is high (e.g., “5 times/minute”), it indicates that the service is restarting frequently;
-
Check the Active status: if it shows active (running) but the “start time” updates frequently, it is also a typical feature of restarts;
-
Use journalctl -u xxx.service -f to view real-time logs of the service, which can help locate the cause of the crash (e.g., “configuration file error”, “memory overflow”).
(3) Use ps Combined with awk to Count Process Restart Frequency
If the application is not managed by systemd (e.g., custom startup scripts), you can periodically count the PID changes corresponding to the process name using ps to determine if it is restarting:
# Count the PID of the process named "java" every second, outputting PID changes
while true; do ps -C java -o pid=; sleep 1; echo "===="; done
-
If the output PIDs are different each time, it indicates that the process is restarting every second;
-
Combining the ppid field can further find the PID of the startup script: ps -C java -o pid,ppid,cmd.
3. Step Three: Use perf or strace to Deeply Investigate Parent Process Issues
After identifying the “root parent process” with high CPU consumption, further analysis is needed to understand why the parent process frequently calls short-lived programs or why it frequently crashes. At this point, you can use perf (performance analysis) and strace (system call tracing) tools.
(1) Scenario One: Analyze Why the Parent Process Frequently Calls Short-Lived Programs
If the parent process frequently calls short-lived binary programs (like curl), you need to investigate whether “the calling logic is necessary” or “if there is room for optimization”:
# Use perf to sample the system calls of the parent process, analyzing the call frequency
perf trace -p parent_process_PID -e execve # Only track execve system calls (process startup)
-
If the execve call frequency is extremely high (e.g., more than 10 times per second), check the parent process code or configuration to determine if the number of calls can be reduced (e.g., batch processing requests, reusing connections);
-
If the short-lived program can be replaced with internal logic of the parent process (e.g., using Python’s requests library instead of curl), it can avoid the overhead of process creation and reduce CPU consumption.
(2) Scenario Two: Analyze the Reasons for Frequent Crashes and Restarts of the Parent Process
If the parent process frequently crashes, you need to locate the root cause of the crash through logs or strace:
# Use strace to trace the system calls of the parent process, capturing abnormal behavior before the crash
strace -f -p parent_process_PID -o strace.log # -f traces child processes, -o outputs to log file
-
Check strace.log, if exit_group(1) (non-0 exit) appears before the crash, along with read()/write() failures (e.g., “Bad file descriptor”), it may indicate file or network connection issues;
-
If the parent process is a Java/Python application, you can combine application logs (e.g., JVM’s hs_err_pid.log, Python’s traceback) to locate code defects (e.g., null pointer, memory leak).
3. Solutions: Optimization Strategies for Short-Lived Process Issues
After identifying the root cause of the problem, targeted measures should be taken based on different scenarios to reduce CPU usage from either “reducing short-lived process consumption” or “resolving crashes and restarts”.
1. Scenario One: Optimize Parent Process Calls to Short-Lived Programs
(1) Reduce Call Frequency
-
Batch Processing Tasks: Change “multiple single calls per second” to “batch calls every 3 seconds”. For example, a script originally calls curl to fetch 1 piece of data every second, it can be changed to fetch 3 pieces of data every 3 seconds, reducing the call frequency from 10 times/second to 3 times/second, resulting in a 60% reduction in CPU consumption;
-
Merge Duplicate Calls: If multiple business logics need to call the same short-lived program, they can be merged into a single call, sharing the results. For example, two Python threads that need to call grep to analyze the same log file can be changed to a single-threaded call, sharing the results with other threads.
(2) Replace with Internal Logic of the Process
-
Use the native libraries of the parent process instead of external binary programs: for example, a Python application can use the requests library instead of curl to initiate network requests, and a Java application can use java.net.HttpURLConnection instead of wget, avoiding the overhead of process creation and switching;
-
For custom short-lived tools, their logic can be integrated into the parent process. For example, rewriting the originally independent process_data binary program as a function or module of the parent process can completely eliminate the creation of short-lived processes.
(3) Use Process Pools to Reuse Processes
If it is necessary to call external programs (such as relying on specific binary tools), you can use process pools to reuse processes, reducing the overhead of process creation/destruction:
-
Python can use multiprocessing.Pool to create a process pool, reusing processes to execute short-lived tasks;
-
Shell scripts can use xargs -P to specify the number of concurrent processes, avoiding the simultaneous creation of numerous short-lived processes.
2. Scenario Two: Resolve Parent Process Crash and Restart Issues
(1) Emergency Mitigation: Pause Automatic Restarts to Avoid Continuous CPU Consumption
If application crashes and restarts lead to high CPU load, you can temporarily disable automatic restarts to prevent the problem from worsening:
-
If managed by systemd: systemctl stop xxx.service, and modify the service configuration (/etc/systemd/system/xxx.service) to change Restart= to Restart=no, then systemctl daemon-reload;
-
If restarted by custom scripts: terminate the startup script process (e.g., kill -9 script_PID) to avoid triggering further restarts.
(2) Thorough Resolution: Locate and Fix the Cause of Crashes
-
Configuration Errors: Check the application configuration files (such as database connection addresses, ports, permissions) and use journalctl or application logs to view configuration-related error messages (e.g., “Connection refused”);
-
Code Defects: For Java applications, analyze hs_err_pid.log to locate memory overflow or thread deadlock; for Python applications, check traceback logs to locate the line of code causing the exception; for C/C++ applications, use gdb to debug the crash core file (core dump);
-
Insufficient Resources: If crashes are caused by insufficient memory, you can check memory usage with free -h, increase server memory, or optimize application memory usage (e.g., reduce cache size, release unused objects).
(3) Optimize Startup Process: Reduce CPU Consumption During Startup Phase
If CPU consumption during the application startup phase is too high, you can optimize the startup logic:
-
Delay initialization of non-core resources: change non-essential resources (like non-core caches, secondary database connections) to be initialized “on first use” rather than during startup;
-
Simplify startup checks: reduce time-consuming operations such as configuration checks and version checks during startup, or change them to be executed asynchronously;
-
Preload resources into shared memory: if multiple instances are started, common resources (like configuration files, static data) can be preloaded into shared memory to avoid repeated loading by each instance.
4. Preventive Measures: Avoid Recurrence of Short-Lived Process Issues
After resolving the high CPU usage caused by short-lived processes, preventive mechanisms should be established to avoid recurrence:
1. Monitor Short-Lived Processes and Restart Behavior
-
Deploy execsnoop for Long-Term Tracking: On core servers, use execsnoop combined with log collection tools (like ELK) to record short-lived process calls over the long term, setting alert rules (e.g., “the call frequency of a certain command exceeds 5 times/second”);
-
Monitor Service Restart Counts: For services managed by systemd, use systemctl show -p NRestarts xxx.service to get the restart count, setting alerts (e.g., “more than 3 restarts within 1 minute”);
-
Track Process PID Changes: For core applications, periodically record their PIDs; if the PIDs change frequently (e.g., more than 5 changes within 10 minutes), trigger an alert.
2. Standardize Application Development and Deployment
-
Limit External Program Calls: In application development specifications, clearly state “avoid frequent calls to external binary programs”, prioritizing the use of native libraries or modules; if external calls are necessary, evaluate the call frequency and performance impact;
-
Improve Crash Handling Mechanisms: Add crash retry logic to applications (e.g., “restart after a 5-second delay” instead of immediate restart) to avoid high-frequency restarts; also, add crash alerts.
-
Audit Startup Processes: Before applications go live, audit CPU consumption and initialization steps during the startup phase to avoid overly complex startup logic.
3. Regularly Investigate Hidden High CPU Issues
-
Periodically Execute execsnoop Analysis: Run execsnoop for 10-15 minutes during business low peaks weekly to check for hidden high-frequency short-lived processes;
-
Analyze System Call Logs: Regularly sample the system calls of core applications using perf trace to check for abnormal execve (high-frequency calls) or exit (frequent exits) behavior;
-
Check for Abnormal Exit Information in Application Logs: Regularly analyze application logs to filter for “crash”, “exit”, “restart” related information, to proactively locate potential crash risks;
5. Practical Case: From “Unexplained High CPU” to “Complete Resolution” Process
To help everyone grasp the troubleshooting methods more intuitively, here is a real case that fully restores the process from discovering the problem to solving it.
Case Background
A Linux server of an e-commerce platform (CentOS 8) maintained a CPU usage of 75%-85% during business low peaks (2-4 AM), but when checking with top and ps, all processes showed % CPU below 5%, and there were no obvious high-usage processes. During business peak periods, CPU usage further spiked to over 95%, causing some interfaces to respond with delays exceeding 2 seconds, affecting user experience.
Troubleshooting Process
1. Step One: Use execsnoop to Capture Short-Lived Processes
Considering the “invisible high CPU”, the first step is to run execsnoop to track short-lived processes:
execsnoop
After running for 5 minutes, it was found that the ARGS for the process /usr/bin/python3 /opt/monitor/check_disk.py appeared 8-10 times per second, with PPID 2345, DURATION approximately 150 milliseconds (considered a short-lived process), and EXIT_CODE 0 (normal exit).
2. Step Two: Locate Parent Process and Calling Logic
Using ps -p 2345 -o cmd to check the parent process, it was found to be a custom monitoring service /usr/bin/python3 /opt/monitor/main.py. After reviewing the main.py code, it was discovered that this service called the check_disk.py script via subprocess.Popen to check disk usage, with a call interval set to “every 0.1 seconds”—this was the core reason for the high-frequency calls.
Further analysis of the check_disk.py script revealed that it relied on the df -h command to obtain disk information, then used grep and awk to parse the results, with the entire process depending on 3 external binary programs, and each call required creating a new process, leading to high CPU usage due to the frequency of 10 calls per second.
3. Step Three: Optimize Calling Logic and Script
To address the issue, the following optimization measures were taken:
4. Optimization Results
After optimization, the server’s CPU usage dropped to 15%-20% during business low peaks and stabilized at 40%-50% during peak periods, with interface response delays restored to within 500 milliseconds, completely resolving the issue.
6. Common Misconceptions and Precautions
When troubleshooting high CPU usage caused by short-lived processes, many personnel easily fall into misconceptions. Here are some key precautions to help everyone avoid “pitfalls”.
1. Misconception One: Relying Solely on top/ps Tools, Ignoring Professional Tracking Tools
Many people are accustomed to using top and ps to troubleshoot CPU issues, but for short-lived processes, the limitations of these two tools are significant (unable to capture processes that quickly enter and exit). At this point, it is necessary to switch thinking and use tools like execsnoop and strace, focusing on “tracking the process creation process” rather than “viewing the current process state”.
2. Misconception Two: Directly Terminating Short-Lived Processes After Finding Them Without Analyzing the Root Cause
Some personnel, after discovering high-frequency short-lived processes, will directly kill the parent process. While this can temporarily reduce CPU usage, it does not solve the fundamental problem (as the parent process may restart, or the business may depend on that parent process). The correct approach is to first analyze the calling logic or crash reasons of the parent process, then optimize accordingly to avoid “treating the symptoms rather than the root cause”.
3. Misconception Three: Ignoring the Importance of Process Exit Codes and Logs
When using execsnoop, many people only focus on PID and ARGS, neglecting EXIT_CODE (exit code). If EXIT_CODE is non-0 (like 1, 255), it indicates that the short-lived process exited abnormally, which may be due to errors in the parent process calling logic (like incorrect parameter passing), requiring further investigation in conjunction with the parent process logs.
4. Precautions: Be Aware of Permissions and Performance Impact When Using Tools
7. Conclusion
The issue of “high CPU usage in Linux without finding high usage applications” is fundamentally due to the “insidiousness of short-lived processes”, which prevents conventional tools from capturing them. The core idea for solving such problems is:
We hope that through the methods and cases presented in this article, everyone can quickly locate and resolve issues when encountering similar problems, ensuring the stable operation of Linux servers. If you encounter special scenarios or have questions during actual operations, feel free to discuss in the comments!
-
Permission Issues: Tools like execsnoop, strace, and perf require root privileges to run; ordinary users do not have permission to track system processes;
-
Performance Impact: strace and perf trace will track all system calls of the process, which may have a certain performance impact on high-concurrency applications (e.g., temporarily increasing CPU usage by 5%-10%). It is recommended to use them during business low peaks or limit tracking duration (e.g., strace -p 1234 -o log 30, only track for 30 seconds).
-
Extend Call Intervals: Change the call interval of check_disk.py from 0.1 seconds to 30 seconds (disk usage changes slowly, high-frequency detection is unnecessary);
-
Replace External Program Calls: Change the logic in check_disk.py that relies on df, grep, and awk to use the native Python library psutil (pip install psutil) to obtain disk information, avoiding the creation of external processes;
-
Merge Monitoring Tasks: Integrate the logic of check_disk.py into main.py, deleting the independent script to completely eliminate process creation overhead.
-
Compare Historical CPU Data: Regularly compare historical data of system CPU usage (e.g., during the same time period each week). If there is no significant business growth but CPU usage rises abnormally, check for hidden short-lived process issues.
-
Identify Scenarios: Determine whether it is “the application frequently calls short-lived programs” or “the application crashes and restarts”;
-
Tool Breakthrough: Use execsnoop to capture short-lived processes, use pstree/systemctl to locate parent processes, and use perf/strace to analyze the root cause of the problem;
-
Targeted Optimization: Reduce the call frequency of short-lived programs, replace them with internal logic, and fix crash reasons to fundamentally reduce CPU consumption;
-
Long-Term Prevention: Monitor short-lived processes, standardize application development, and conduct regular inspections to avoid recurrence of issues.