Summary of Troubleshooting 100% CPU in Linux

Summary of Troubleshooting 100% CPU in Linux

When your server’s CPU reaches 100%, how do you troubleshoot the abnormal fault? At the end of this article, a shell script will be shared to help you troubleshoot the Linux system CPU 100% anomaly. Yesterday afternoon, I suddenly received an operations email alert, indicating that the data platform server’s CPU utilization had reached 98.94%, and it had been consistently above 70% for a while. This looks like a hardware resource bottleneck that needs expansion, but upon careful thought, we realize that our business system is not a high concurrency or CPU-intensive application. This utilization seems exaggerated, and the hardware bottleneck should not have reached so quickly; there must be an issue with the business code logic.

2. Troubleshooting Approach

2.1 Identify High Load Process PID

First, log into the server and use the top command to confirm the server’s specific situation, and then analyze and judge based on the specifics.

Summary of Troubleshooting 100% CPU in Linux

By observing the load average and the load evaluation criteria (for an 8-core system), it can be confirmed that the server is under high load;

Summary of Troubleshooting 100% CPU in Linux

Observing the resource usage of each process, it can be seen that the process with PID 682 has a high CPU usage ratio.

2.2 Identify Specific Abnormal Business Logic

Here we can use the pwdx command to find the business process path based on the PID, thus locating the responsible person and project:

Summary of Troubleshooting 100% CPU in Linux

It can be concluded that the process corresponds to the web service of the data platform.

2.3 Identify Abnormal Threads and Specific Code Lines

The traditional approach generally involves four steps:

1. top ordered by load with P: 1040 // First, sort by process load to find maxLoad(pid)

2. top -Hp ProcessPID: 1073 // Find the relevant load thread PID

3. printf “0x%x ” ThreadPID: 0x431 // Convert the thread PID to hexadecimal in preparation for later jstack log lookup

4. jstack ProcessPID | vim +/hexadecimal thread

PID – // For example: jstack 1040|vim

+/0x431 –

However, for online problem localization, time is of the essence. The above four steps are still too cumbersome and time-consuming. Previously, I introduced a tool encapsulated by Taobao’s oldratlee called show-busy-java-threads.sh, which can conveniently locate such online issues:

Summary of Troubleshooting 100% CPU in Linux

It can be concluded that a time utility method’s execution has a high CPU usage ratio. After locating the specific method, check whether there are performance issues in the code logic.

※ If the online issue is urgent, you can skip 2.1 and 2.2 and directly execute 2.3. The analysis from multiple angles here is just to present a complete analytical thought process.

3. Root Cause Analysis

After the previous analysis and troubleshooting, the issue was ultimately traced back to a problem with a time utility class that caused excessive server load and CPU usage.

  • Abnormal method logic: converts timestamps to corresponding specific date-time formats;

  • Upper layer call: calculates the total seconds from midnight to the current time and converts it into the corresponding format to store in a set and return results;

  • Logical layer: corresponds to the real-time report query logic of the data platform, where real-time reports are queried at fixed intervals and involve multiple (n times) method calls in a single query.

Thus, we can conclude that if the current time is 10 AM, the number of calculations in a single query would be 10*60*60*n times = 36,000*n times of calculations. Moreover, as time progresses, the number of calculations will linearly increase as we approach midnight. Due to the large number of query requests from modules like real-time queries and real-time alerts, there were numerous calls to this method, resulting in significant CPU resource consumption and waste.

4. Solution

After locating the issue, the first consideration was to reduce the number of calculations and optimize the abnormal method. After troubleshooting, it was found that in the logical layer, the method’s returned set was not being used; instead, only the size of the set was used. After confirming the logic, a new method was implemented to simplify calculations (current seconds – seconds at midnight) and replace the called method, resolving the excessive calculations issue. After going live, the server load and CPU utilization were observed to drop by 30 times compared to the abnormal period, returning to normal state, thus resolving the issue.

Summary of Troubleshooting 100% CPU in Linux

5. Summary

In the coding process, in addition to implementing business logic, attention should also be paid to optimizing code performance. The ability to implement a business requirement efficiently and elegantly is a manifestation of two distinctly different engineering capabilities and realms, and the latter is also the core competitiveness of engineers.

After completing the code writing, conduct more reviews and think about whether there is a better way to achieve it.

Don’t overlook any small details in online issues! Details are crucial; technical personnel need to have a curiosity to dig deep into problems and a pursuit of excellence spirit. Only in this way can continuous growth and improvement be achieved.

Original link:

https://my.oschina.net/leejun2005/blog/1602482

Summary of Troubleshooting 100% CPU in Linux

Kopry Enterprise IT Academy

Nanjing Kopry Information Technology Co., Ltd. has long focused on promoting the digital/information construction and development of governments and enterprises, providing customers with specialized services including: IT technology and management training, IT operation and maintenance services, information security services, business intelligence services, OA application services, and human resources outsourcing. After years of development, Kopry has formed a business service model centered around its Nanjing company, with additional branches in Hangzhou and Shanghai, covering East China and radiating nationwide, serving over 5,000 customers across various industries such as telecommunications, finance, electricity, petrochemicals, tobacco, taxation, public security, social security, and finance, gaining widespread customer recognition!

Kopry Enterprise IT Academy was founded in 2002 and is affiliated with Nanjing Kopry Information Technology Co., Ltd. It is dedicated to providing professional digital/information talent training service solutions for government and enterprise clients, helping clients build and cultivate a sustainable professional talent team. The company boasts a team of over 100 expert professional lecturers composed of industry veterans and certified instructors from major vendors, offering more than a hundred IT technology and management courses in various professional directions, including networking, mainframes, software development, large databases, middleware, virtualization, information security, cloud computing, big data, IT management, and IT applications.

Based on the increasingly stratified and diversified digital/information business needs of clients, Kopry Enterprise IT Academy designs reasonable and effective specialized training implementation solutions around enterprise architecture and IT system architecture, emphasizing the pertinence, practicality, and efficiency of courses, relying on an IT expert lecturer team to provide high-quality and efficient training services, helping clients effectively improve employee professional technical capabilities and work efficiency while reducing enterprise operating costs.

Corporate Mission:

Create value for employees, create value for customers, create value for society, and strive to promote the progress of the entire society!

Corporate Vision:

To become a first-class provider of enterprise-level digital talent training solutions in China!

Service Hotline: 025-87787966

Nanjing Campus: 300 Zhongshan East Road, Nanjing

Longfa Center, Building A, 23rd Floor

Hangzhou Campus: 2nd Floor, Building 2, Dongfang Mao Commercial Center, Gongshu District, Hangzhou

Shanghai Campus: 4th Floor, Building 2, No. 33 Leshan Road, Xuhui District, Shanghai

Academy Website: www.china-esp.com

Summary of Troubleshooting 100% CPU in Linux

Scan to Follow

Leave a Comment