Linux Kernel Performance Optimization: Secrets to Speeding Up Your System

In today’s digital age, the Linux system is widely used in servers, cloud computing, big data, and many other fields due to its strong stability, openness, and flexibility. However, with the continuous growth of business volume and the increasing complexity of application scenarios, the performance of the Linux kernel faces tremendous challenges. Even minor performance bottlenecks can snowball and be magnified infinitely during high-load operations, leading to a series of serious issues.

Imagine an e-commerce website during peak shopping periods, where poor Linux kernel performance causes slow server responses. Users click on product details, but the page takes a long time to load; after submitting an order, they receive no feedback for a long time. This not only severely impacts user experience but can also lead to significant loss of potential customers, resulting in immeasurable economic losses for the business. Similarly, a real-time data analysis system may fail to process vast amounts of data in a timely manner due to kernel performance issues, causing analysis results to lag seriously. Decisions made by the business decision-makers based on outdated data may diverge from the actual market situation, affecting the company’s strategic layout and development direction.

Thus, optimizing Linux kernel performance is no small matter; it directly relates to system stability, reliability, and the normal operation of the business. By reasonably adjusting kernel parameters, optimizing system resource allocation, and employing efficient scheduling algorithms, we can significantly enhance the performance of the Linux kernel, allowing the system to handle various complex workloads with ease. Next, let us dive into the exciting world of Linux kernel performance optimization.

1. Revealing Key System Performance Metrics

Before delving into methods for optimizing Linux kernel performance, we must first clarify the key metrics for measuring system performance. Just as we focus on speed, fuel consumption, and handling when evaluating a car’s performance, metrics such as throughput, system latency, CPU usage, memory usage, and disk I/O can help us comprehensively understand the operational status of the Linux system.

1.1 Throughput: The Speed of Data Processing

Throughput refers to the amount of data processed or the number of requests successfully handled by the system within a unit time; it intuitively reflects the data processing capability of the system. For example, during a promotional event, if an e-commerce website can handle 1000 product query requests and 200 order submission requests per second, the number of requests processed reflects the website’s system throughput. If the website’s throughput is low, a large number of requests may pile up under high concurrency, leading to slow system responses or even crashes. It’s like a narrow road where a slight increase in traffic can cause congestion. Improving system throughput allows the system to handle more tasks within a unit time, meeting the demands of business growth.

1.2 System Latency: A Key Factor Affecting User Experience

System latency refers to the time taken from the moment a request is sent until a response is received. Taking the e-commerce website as an example, if the product detail page loads within 1 second, users hardly notice the delay and enjoy a smooth experience. However, if the loading time extends to 5 seconds or more, users may become impatient and leave the site. Low latency is crucial for enhancing user experience, especially in scenarios requiring real-time interactions, such as online gaming and financial transactions. In online games, every action taken by players needs to be reflected promptly in the game interface. If system latency is too high, an attack command issued by a player may take several seconds to take effect, severely affecting the fairness and enjoyment of the game.

Generally, system performance is constrained by these two conditions, both of which are indispensable. For instance, if my system can handle one million concurrent requests but has a latency of over two minutes, the one million load is meaningless. Conversely, if latency is low but throughput is also low, it holds no significance. Therefore, a good system’s performance testing must consider both conditions simultaneously.

Those with experience know some relationships between these two metrics:

The greater the throughput, the worse the latency. As the request volume increases, the system becomes too busy, naturally slowing down response speed.
The better the latency, the higher the supported throughput. Short latency indicates fast processing speed, allowing for more requests to be handled.

2. Practical Methods for Identifying Performance Bottlenecks

2.1 System Performance Testing

From the above discussion, we understand that testing system performance requires us to collect the values of throughput and latency.

First, we need to define the value of latency. For example, the response time for a website system must be within 5 seconds (for some real-time systems, it may need to be defined even shorter, such as within 5ms, depending on different business requirements).
Second, develop performance testing tools—one tool to generate high-intensity throughput and another to measure latency. For the first tool, you can refer to “Ten Free Web Stress Testing Tools.” Regarding measuring latency, you can measure it in the code, but this will affect program execution and only test internal latency. The actual latency includes the operating system and network delays; you can use Wireshark to capture network packets for measurement. How to implement these two tools is left for everyone to think about.
Finally, start performance testing. You need to continually increase the tested throughput and observe the system’s load. If the system can handle it, then observe the latency value. This way, you can find the system’s maximum load and know the response delay.

To elaborate further:

Regarding latency, if throughput is low, this value is likely to be very stable. As throughput increases, the system’s latency may experience significant fluctuations. Therefore, when measuring latency, we need to pay attention to its distribution, meaning we need to know what percentage falls within our acceptable range, what percentage exceeds it, and what percentage is entirely unacceptable. Perhaps the average latency meets the standard, but only 50% fall within our acceptable range, which is meaningless.
For performance testing, we also need to define a time period. For example, maintain a certain throughput for 15 minutes. When load peaks, the system may become unstable, and it may take a minute or two for the system to stabilize. Additionally, it’s possible that your system performs normally under this load for the first few minutes and then becomes unstable or even crashes. Therefore, this time period is necessary. We call this value the peak limit.
Performance testing also requires conducting a soak test, where the system can continuously run under a certain throughput for a week or longer. This value is referred to as the system’s normal operating load limit.

Performance testing involves numerous critical elements, such as burst testing, etc. Here, we cannot elaborate on everything; we only mention some aspects related to performance optimization. In summary, performance testing is both a meticulous and laborious task.

With the above groundwork laid, we can test the system’s performance. Before tuning, let’s discuss how to identify performance bottlenecks. I’ve seen many friends think this is easy, but upon further questioning, they often lack a systematic approach.

2.2 Checking Operating System Load

When troubleshooting Linux kernel performance issues, checking the operating system load is a crucial first step. Through this operation, we can understand the current workload of the system and identify potential performance bottlenecks.

Using the top command, we can view the overall operating status of the system in real-time. By entering “top” in the command line, a dynamically updated interface will appear, displaying various important information. The top line details CPU usage, including the percentage of CPU time used by user space processes (us), kernel space processes (sy), user processes adjusted for priority (ni), idle CPU time percentage (id), CPU time waiting for I/O operations to complete (wa), CPU time used by hardware interrupts (hi), CPU time used by software interrupts (si), and the percentage of time the virtual machine manager waits for other virtual CPUs to run (st). For example, if the us value is high, it indicates that user space processes are consuming a significant amount of CPU resources, necessitating a deeper investigation into these processes to determine if there is any abnormal or inefficient code.

The htop command provides a more intuitive and richer display of information. It visually presents CPU usage, with different colored bar graphs representing various types of CPU load, such as green for user processes and red for kernel processes. It also clearly displays detailed information about each process, including process ID, user, priority, and memory usage. When we suspect that a particular process significantly impacts system performance, using htop can facilitate easier identification and analysis of that process.

SystemTap is a powerful dynamic kernel and user space tracing tool. It allows us to write scripts to capture and analyze various events during system operation. For example, if we want to understand the resource usage of a specific process during execution, we can write a SystemTap script to monitor that process. Through the script, we can obtain detailed information about the process’s CPU utilization, memory access frequency, etc., allowing us to pinpoint the issue accurately.

LatencyTOP focuses on detecting latency issues within the system. When running LatencyTOP, it analyzes the latency of various processes and highlights those with high latency. For instance, if LatencyTOP reports excessive latency for a particular driver process, it may indicate performance deficiencies in that driver, requiring further optimization or updating.

First, when our system encounters issues, we should not rush to investigate our code, as this is meaningless. Our first priority is to review the operating system’s reports. Check the operating system’s CPU utilization, memory usage, I/O operations, network I/O, the number of network connections, etc. Perfmon is an excellent tool under Windows, and there are many related commands and tools under Linux, such as SystemTap, LatencyTOP, vmstat, sar, iostat, top, tcpdump, etc. By observing this data, we can determine where our software’s performance primarily lies. For example:

First, check CPU utilization. If CPU utilization is low but throughput and latency do not improve, it indicates that our program is not busy computing but is occupied with other tasks, such as I/O. (Additionally, CPU utilization must also consider kernel and user modes; if kernel mode utilization rises, overall system performance will decline. For multi-core CPUs, CPU 0 is particularly critical; if CPU 0’s load is high, it can affect the performance of other cores, as scheduling between CPU cores relies on CPU 0.)
Next, examine I/O levels. Generally, CPU utilization is inversely related to I/O. High CPU utilization usually corresponds to low I/O, while high I/O results in lower CPU utilization. We need to consider three aspects of I/O: disk file I/O, driver I/O (e.g., network cards), and memory page swapping rates. These three aspects can all impact system performance.
Then, check network bandwidth usage. In Linux, you can use commands such as iftop, iptraf, ntop, tcpdump to monitor this. Alternatively, you can use Wireshark for analysis.
If CPU, I/O, memory usage, and network bandwidth are all low, yet system performance remains stagnant, it suggests that your program may have issues, such as being blocked. This could be due to waiting for a lock, waiting for a resource, or being stuck in context switching.

By understanding the performance of the operating system, we can identify performance issues, such as insufficient bandwidth, inadequate memory, insufficient TCP buffers, etc. Often, no adjustments to the program are needed; simply tweaking hardware or operating system configurations suffices.

2.3 Using Profilers for Testing

Next, we need to use performance testing tools, specifically a profiler, to assess our program’s runtime performance. Examples include Java’s JProfiler/TPTP/CodePro Profiler, GNU’s gprof, IBM’s PurifyPlus, Intel’s VTune, AMD’s CodeAnalyst, and Linux’s OProfile/perf. The latter two can optimize your code down to the CPU micro-instruction level. If you’re concerned about CPU L1/L2 cache tuning, you should consider using VTune. Using these profiling tools allows us to gather various metrics about our program’s modules and functions, such as runtime, call count, CPU utilization, etc. These metrics are invaluable for us.

We should focus on functions and instructions with the most runtime and call counts. It’s important to note that for functions called frequently but with short execution times, minor optimizations can lead to significant performance improvements (for example, if a function is called 1 million times in one second, improving its execution time by 0.01 milliseconds can yield substantial performance gains).

There is a caveat when using profilers: they can degrade your program’s performance. Tools like PurifyPlus insert numerous lines of code into your program, lowering its execution efficiency, thus making it difficult to test performance under high throughput. To identify system bottlenecks, there are generally two methods:

Implement your own statistics in your code, using microsecond-level timers and function call counters, logging statistics to a file every 10 seconds.
Segment your code blocks by commenting out certain functions to allow others to run, performing hard-coded mocks, and then testing whether throughput and latency exhibit qualitative changes. If they do, the commented-out functions are likely the performance bottlenecks. You can then comment out sections of code within those functions until you identify the most resource-intensive statements.

Lastly, it’s worth mentioning that different throughputs yield different test results, and varying test data also leads to different outcomes. Therefore, the data used for performance testing is crucial, and we need to observe the results across different throughputs.

2.4 Analyzing I/O Conditions

I/O operations play a crucial role in system performance, and in-depth analysis can help us uncover many potential performance issues.

For disk file I/O, the iostat command is our reliable assistant. By executing “iostat -d -x 1” (where “-d” displays only disk-related information, “-x” shows more detailed information, and “1” means output data every second), we can obtain detailed statistics for each disk device. Here, “r/s” represents the number of reads completed per second, “w/s” represents the number of writes completed per second, “rkB/s” indicates the amount of data read per second, “wkB/s” indicates the amount of data written per second, and “% util” indicates the device’s busy status. When “% util” approaches 100%, it indicates that the disk I/O system is operating at full capacity and may become a performance bottleneck. For instance, in a data storage server, if a certain disk’s “% util” consistently remains above 95%, it may be necessary to optimize that disk, such as replacing it with a higher-performance disk or adjusting the data storage method.

The iotop command allows us to see which processes are consuming significant disk I/O resources. After executing “iotop,” it lists the I/O usage for each process, including read and write rates. If we find that a particular process has an abnormally high I/O read/write rate, we need to analyze that process’s business logic to determine if there are unreasonable I/O operations. For example, if a backup process frequently reads and writes small files during data backup, it may degrade disk I/O performance. In this case, we could consider optimizing the backup strategy by using larger file reads/writes or batch operations.

The impact of driver I/O on system performance should not be underestimated. If a driver has issues, it can lead to inefficient data transfer between the device and the system. For instance, if a network driver is outdated, it may not fully utilize the network card’s performance, limiting network transmission speeds. In such scenarios, updating the driver promptly can significantly enhance system performance.

Memory page swapping rates are also an important metric for measuring system performance. The vmstat command can help us monitor memory swapping conditions. In the output, “si” indicates the number of swap pages transferred from disk to memory, while “so” indicates the number of pages swapped from memory to disk. If both “si” and “so” have high values, it indicates that the system is frequently performing memory swapping operations, which can severely impact system performance. This is often due to insufficient memory. For example, in a server running multiple large applications, if the memory swapping rate is excessively high, it may be necessary to increase physical memory or optimize the memory usage of the applications.

2.5 Insights into Network Bandwidth Usage

In a networked environment, understanding network bandwidth usage is crucial for identifying performance bottlenecks.

iftop is an excellent real-time traffic monitoring tool. By executing “iftop -i eth0” (where “-i” specifies the network interface to monitor, and “eth0” is a common interface name), we can intuitively see the real-time traffic on the specified network interface. The iftop interface clearly displays the traffic transfer between various IP addresses, including sending traffic (TX), receiving traffic (RX), and total traffic (TOTAL). We can also see average traffic over different periods, such as the last 2 seconds, 10 seconds, and 40 seconds. This helps us quickly identify if certain IP addresses are generating excessive traffic, indicating potential network bottlenecks. For example, in a corporate network, if we find that traffic between a particular IP address and an external server consistently exceeds 80% of the network bandwidth, we need to investigate the business associated with that IP address for any abnormal data transfer.

iptraf is also a powerful network traffic monitoring tool. It can not only monitor network interface traffic in real-time but also provide detailed network connection information, such as TCP and UDP connections. By using the command “iptraf -g,” we can enter a graphical interface to conveniently view traffic statistics for each network interface. Additionally, iptraf supports viewing traffic by protocol type, which is very helpful for analyzing the usage of different protocols within the network. For instance, in a network environment dominated by HTTP protocol, if we find that HTTP traffic occupies an excessive proportion, squeezing other business network bandwidth, we may consider optimizing HTTP operations, such as employing caching techniques and optimizing page loading methods to reduce network bandwidth consumption.

3. Methods for Performance Optimization

The following are some issues I have encountered; they may not be exhaustive or entirely correct, and I welcome any additions or corrections. Generally speaking, performance optimization can be summarized as follows:

Using space to exchange time. Various caches, such as CPU L1/L2/RAM to hard drives, are strategies that exchange space for time. This strategy essentially saves or caches the computation process step by step, eliminating the need to recalculate each time, such as data buffering and CDN. This strategy also manifests as redundant data, such as data mirroring and load balancing.
Using time to exchange space. Sometimes, a small amount of space may yield better performance. For example, in network transmission, certain data compression algorithms (like the “Huffman coding algorithm” and the core algorithm of “rsync”) may be time-consuming, but since the bottleneck is in network transmission, using time to save space can actually save time.
Simplifying code. The most efficient program is one that executes no code at all, so less code leads to better performance. There are many examples of code-level optimization techniques in university textbooks. For instance, reducing loop nesting levels, minimizing recursion, declaring fewer variables within loops, reducing memory allocation and deallocation operations within loops, moving expressions out of the loop body, carefully ordering multiple conditions in conditional expressions, preparing some elements during program startup, being mindful of function call overhead (stack overhead), and being cautious with temporary objects in object-oriented languages, etc.
Parallel processing. If a CPU has only one core, using multiple processes or threads may actually slow it down for compute-intensive software (due to the significant overhead of operating system scheduling and context switching). The advantages of multi-process and multi-threading can only be realized with multi-core CPUs. Parallel processing requires our programs to have scalability; programs that cannot scale horizontally or vertically cannot be parallelized.

From an architectural perspective, this raises the question: Can performance improvement be achieved simply by adding machines without altering code?

In summary, according to the 80/20 principle, 20% of the code consumes 80% of the performance. By identifying that 20% of the code, you can optimize that 80% of performance. The following are some of my experiences; I will only list a few of the most valuable performance optimization methods for your reference, and I welcome any additions.

3.1 Algorithm Optimization

Algorithms are very important; good algorithms yield better performance. Here are a few examples from projects I’ve experienced:

One is a filtering algorithm. The system needs to filter incoming requests. We configured filterable items in a file. Initially, the filtering algorithm traversed the filtering configuration. Later, we found a way to sort this filtering configuration, allowing us to filter using binary search, resulting in a 50% performance increase.
Another is a hashing algorithm. The function for calculating the hash algorithm is inefficient; it is both time-consuming and prone to collisions. High collision rates lead to performance similar to that of a singly linked list (see Hash Collision DoS issue). We know that algorithms are closely related to the data being processed. Even the often-ridiculed “bubble sort” can be more efficient in certain situations (e.g., when most data is already sorted) than all other sorting algorithms. The same applies to hashing algorithms; well-known hashing algorithms are tested using English dictionaries, but our business has specific data characteristics. Therefore, we need to select suitable hashing algorithms based on our data. In one of my previous projects, a talented colleague provided me with a hashing algorithm that improved our system’s performance by 150% (you must check out the article on various hashing algorithms on StackExchange).

Divide and conquer and pre-processing. Previously, a program needed to compute long monthly reports, sometimes taking almost an entire day. We found a way to implement an incremental algorithm, meaning we computed that day’s data and merged it with the previous day’s report. This significantly reduced computation time; daily data computation only took 20 minutes, but computing an entire month’s data took over 10 hours (SQL statements experience exponential performance degradation with large data volumes). This divide-and-conquer approach is beneficial for performance in the face of big data, similar to merge sort. SQL statement and database performance optimization also follow this strategy, such as using nested selects instead of Cartesian product selects and utilizing views, etc.

3.2 Code Optimization

From my experience, code optimization includes the following points:

String operations. This is one of the most performance-consuming tasks. Whether it’s strcpy, strcat, or strlen, special attention should be paid to substring matching. Therefore, using integers is preferable whenever possible. For example, years ago when I worked at a bank, a colleague liked to store dates as strings (e.g., 2012-05-29 08:30:02). A select where between statement was incredibly time-consuming. Another example is a colleague who processed status codes as strings because it allowed for direct display on the interface. Later, during performance optimization, I changed all these status codes to integers and used bitwise operations to check statuses. Since there was a function called 150K times per second that needed to check statuses in three places, performance improved by about 30% after the change. Additionally, I recall a programming specification from a product I worked on that required defining the function name in every function, such as const char fname[]=”functionName()”. This was meant for logging, but why not declare it as static?

Multi-threading optimization. Some say threads are evil, and this can be a performance issue for systems at times. The bottleneck in multi-threading lies in mutexes and synchronization locks, as well as the cost of thread context switching. Minimizing or eliminating locks (for instance, the application of multi-version concurrency control (MVCC) in distributed systems can address performance issues) is fundamental. Additionally, read-write locks can solve most concurrency performance issues involving read operations. I should also mention that in C++, we might use thread-safe smart pointers like AutoPtr or other containers. However, if they are thread-safe, they will invariably require locking, which is a costly operation. Using AutoPtr can significantly decrease system performance. If you can ensure that there are no concurrent thread issues, you should avoid using AutoPtr.

I remember a colleague removing reference counting from smart pointers, which improved system performance by over 50%. Regarding Java object reference counting, if I’m not mistaken, it involves numerous locks, which is why Java performance has always been an issue. Moreover, having more threads does not necessarily mean better performance. Thread scheduling and context switching can be quite extreme; it’s best to do as much as possible within a single thread and avoid synchronizing threads. This can lead to significant performance gains.

Memory allocation. Do not underestimate program memory allocation. Operations like malloc/realloc/calloc are very time-consuming, especially when memory fragmentation occurs. In a previous company, we encountered an issue where our program became unresponsive on a user’s site. After using GDB to investigate, we found the system hung on a malloc operation for 20 seconds. Restarting some systems resolved the issue. This was a memory fragmentation problem. Many people complain that STL has serious memory fragmentation issues due to excessive small memory allocations and deallocations. Some believe that using memory pools can solve this problem, but in reality, they are merely reinventing the memory management mechanisms of Runtime-C or the operating system, which does not help at all.

Of course, addressing memory fragmentation issues still involves using memory pools. Specifically, it requires a series of memory pools of different sizes (this is left for everyone to think about). Moreover, minimizing dynamic memory allocation is the best approach. Speaking of memory pools, we need to mention pooling techniques, such as thread pools and connection pools. Pooling techniques are very effective for short jobs (like HTTP services) as they can reduce connection establishment and thread creation overhead, thus improving performance.

Asynchronous operations. We know that file operations in Unix can be blocking or non-blocking, and some system calls are also blocking, such as Socket’s select or Windows’ WaitforObject. If our program uses synchronous operations, it will significantly impact performance. We can switch to asynchronous methods, but this will complicate your program. Asynchronous methods typically involve queues, and we need to pay attention to queue performance issues. Additionally, state notifications under asynchronous conditions can often be problematic, such as message event notification methods or callback methods; these can also affect performance. However, generally speaking, asynchronous operations can significantly enhance throughput while sacrificing system response time (latency). This requires business support.

Language and library. We need to be familiar with the performance of the language and the functions or class libraries we use. For instance, many STL containers, even after deleting elements, do not reclaim memory, which can create the illusion of memory leaks and potentially cause memory fragmentation issues. Additionally, the size() and empty() methods in some STL containers are not the same; size() is O(n) complexity while empty() is O(1) complexity, which requires caution. Java’s JVM tuning requires specific parameters: -Xms -Xmx -Xmn -XX:SurvivorRatio -XX:MaxTenuringThreshold, and we also need to pay attention to JVM GC; everyone knows about the GC’s dominance, especially full GC (which also organizes memory fragmentation); it halts the entire world while running.

3.3 Network Optimization

Regarding network optimization, especially TCP tuning (you can find many articles online using these two keywords), there is a lot to discuss. Just look at the numerous parameters for TCP/IP under Linux.

⑴ TCP Tuning

We know that TCP connections have many overheads, including consuming file descriptors and cache. Generally, the number of TCP connections a system can support is limited, and we need to recognize that TCP connections impose significant resource costs. This is why many attacks aim to create a large number of TCP connections on your system, exhausting your system’s resources, such as the famous SYN Flood attack.

Therefore, we need to configure the KeepAlive parameter, which defines a time limit; if no data is transmitted over the connection during this time, the system sends a packet. If no response is received, TCP considers the connection broken and closes it, freeing up system resources. (Note: There is also a KeepAlive parameter at the HTTP level.) For short connections like HTTP, setting a KeepAlive of 1-2 minutes is crucial. This can help mitigate DoS attacks to some extent. The following parameters (these values are for reference only):

net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 20
net.ipv4.tcp_fin_timeout = 30

Regarding TCP’s TIME_WAIT state, the actively closing party enters the TIME_WAIT state, which lasts for 2 MSL (Max Segment Lifetime), typically 4 minutes by default. During the TIME_WAIT state, resources cannot be reclaimed. A large number of TIME_WAIT connections typically occur on HTTP servers. For this, two parameters need attention:

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1

The former indicates reusing TIME_WAIT; the latter indicates reclaiming TIME_WAIT resources. Another important concept in TCP is RWIN (TCP Receive Window Size), which refers to the maximum amount of data a TCP connection can receive without sending an ack to the sender. This is crucial because if the sender does not receive an ack from the receiver, it will stop sending data and wait for a while. If it times out, it will retransmit. This is why TCP connections are reliable. Retransmissions are not the most severe issue; if packet loss occurs, TCP’s bandwidth utilization will immediately be affected (it will blindly halve), and if packet loss continues, the bandwidth will keep halving until it gradually recovers. Relevant parameters are as follows:

net.core.wmem_default = 8388608
net.core.rmem_default = 8388608
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

Sending all data can negatively impact network performance. (Of course, in poor network conditions, high performance becomes irrelevant.) Therefore, for high-performance networks, ensuring a very low packet loss rate (ideally in LANs) is crucial. If the network is generally reliable, using larger buffers can improve network transmission performance (as frequent back-and-forth transmissions can significantly impact performance).

Additionally, consider whether to use faster UDP if network quality is excellent and occasional packet loss is acceptable. Have you thought about this question?

⑵ UDP Tuning

When discussing UDP tuning, I want to emphasize one thing: MTU—Maximum Transmission Unit (this applies to TCP as well, as it is a link-layer issue). The maximum transmission unit can be imagined as a bus on a road. Suppose a bus can carry a maximum of 70 people, and the bandwidth is like the number of lanes on the road. If a road can accommodate a maximum of 100 buses, that means I can transport 7000 people. However, if the buses are not full, say each bus carries only 20 people, then I only transport 2000 people, wasting road resources (bandwidth resources). Therefore, for a UDP packet, we should try to maximize its size to the MTU before transmitting it over the network to maximize bandwidth utilization. For Ethernet, the MTU is 1500 bytes, for fiber, it is 4352 bytes, and for 802.11 wireless, it is 7981 bytes.

However, when sending TCP/UDP packets, our effective payload must be lower than this value because the IP protocol adds 20 bytes, and UDP adds 8 bytes (TCP adds more). Therefore, generally, the maximum size for a UDP packet should be 1500-8-20=1472 for your data size. Of course, if you’re using fiber, this value can be larger. (By the way, for some NB’s gigabit Ethernet cards, if the hardware detects that your packet exceeds the MTU, it will handle fragmentation and reassembly at the destination, eliminating the need for you to manage it in your program.)

Moreover, when using socket programming, you can use setsockopt() to set the sizes of SO_SNDBUF/SO_RCVBUF, TTL, and KeepAlive, among other key settings. There are many more, which you can check in the socket manual.

Finally, one of UDP’s greatest advantages is multicast, which is very convenient and efficient for notifying multiple nodes within an internal network. Additionally, multicast technology is beneficial for horizontal scalability (requiring more machines to listen for multicast information).

⑶ Network Card Tuning

Network cards can also be tuned, which is particularly necessary for gigabit and higher cards. In Linux, we can use ifconfig to view network statistics. If we see data in the overrun field, we may need to adjust the txqueuelen size (default is usually 1000). We can increase it to something like: ifconfig eth0 txqueuelen 5000. The ethtool command in Linux can also be used to set the buffer sizes of network cards. In Windows, we can adjust related parameters (such as Receive Buffers, Transmit Buffer, etc.) in the advanced tab of the network adapter settings, though different cards have different parameters. Increasing the buffer size is very effective for large data transmissions over the network.

⑷ Other Network Performance

Regarding multiplexing technology, which uses a single thread to manage all TCP connections, three system calls require special attention: one is select, which only supports a maximum of 1024 connections; the second is poll, which can exceed that limit but is based on a polling mechanism, leading to poor performance with many connections due to its O(n) algorithm. Thus, epoll emerged, which is supported by the operating system kernel and only callbacks when connections are active. This is triggered by the operating system, but it is only supported in Linux Kernel 2.6 and later (specifically introduced in 2.5.44). However, if all connections are active, excessive use of epoll_ctl may impact performance more than polling, though the impact is minimal.

Additionally, be cautious with certain system calls related to DNS Lookup, such as gethostbyaddr/gethostbyname, as they can be quite time-consuming due to network lookups. DNS recursive queries can lead to severe timeouts, and you cannot set timeout parameters for them. To speed things up, you can configure the hosts file or manage a corresponding table in memory during program startup instead of querying it each time during runtime. Moreover, in multi-threaded environments, gethostbyname can lead to severe issues; if one thread’s gethostbyname blocks, all other threads will also block at that point, which can be frustrating.

3.4 System Tuning

⑴ I/O Models

We previously discussed the three system calls: select/poll/epoll. We know that in Unix/Linux, all devices are treated as files for I/O, so these three operations should be considered I/O-related system calls. When discussing I/O models, this is crucial for our I/O performance. The classic I/O methods in Unix/Linux include (for more on Linux I/O models, you can read the article “Significantly Improving Performance with Asynchronous I/O”):

The first is synchronous blocking I/O, which needs no further explanation.
The second is synchronous non-blocking I/O, achieved by setting O_NONBLOCK using fcntl.
The third involves select/poll/epoll, which are non-blocking for I/O but block on events, making them I/O asynchronous and event-synchronous calls.
The fourth is AIO (Asynchronous I/O), a model that processes I/O in parallel. I/O requests return immediately, indicating that the request has been successfully initiated. While the I/O operation completes in the background, the application receives notifications through one of two methods: generating a signal or executing a callback function based on threads to complete the I/O processing.

The fourth model has no blocking, whether for I/O or event notifications, allowing for full CPU utilization. Compared to the second synchronous non-blocking model, the advantage is the need to poll repeatedly. Nginx is highly efficient because it uses epoll and AIO for I/O.

Now, let’s discuss the I/O models in Windows:

One is the WriteFile system call, which can be either synchronous blocking or synchronous non-blocking, depending on whether the file is opened as Overlapped. For synchronous non-blocking, you need to set the last parameter as Overlapped. Microsoft calls this Overlapped I/O, and you need to use WaitForSingleObject to determine if the write is complete. The performance of this system call is predictable.
The other system call is WriteFileEx, which can achieve asynchronous I/O and allows you to pass in a callback function, which is invoked once the I/O is complete. However, this callback process is placed in the APC (Asynchronous Procedure Calls) queue by Windows, and will only be called when the application’s current thread becomes alterable, which can be a lengthy process.
Another is IOCP (I/O Completion Port), which places I/O results in a queue. However, the thread that listens to this queue is not the main thread; instead, it is a dedicated thread or multiple threads (older platforms require you to create threads manually, while newer platforms allow you to create a thread pool). IOCP is a thread pool model, similar to the AIO model in Linux, but the implementation and usage methods differ.

Ultimately, improving I/O performance involves minimizing the number of I/O operations with external devices, ideally eliminating them altogether. For reads, memory caching can significantly enhance performance since memory is much faster than external devices. For writes, caching data to be written can reduce the number of writes, but this introduces the challenge of real-time responsiveness, meaning latency will increase. We need to balance the number of writes with the corresponding response times.

⑵ Multi-core CPU Tuning

Regarding multi-core CPU technology, we know that CPU 0 is critical. If CPU 0 is heavily utilized, the performance of other CPUs may decline since CPU 0 is responsible for load balancing. Therefore, we should not rely solely on the operating system for load balancing; we understand our programs better. We can manually assign CPU cores to processes, ensuring that CPU 0 is not overburdened and that critical processes do not compete with other processes.

For Windows, we can set and limit which cores a process can run on via the “Set Affinity” option in the right-click menu of “Processes” in the Task Manager.
For Linux, we can use the taskset command to set this (you can install it via apt-get install schedutils).

Multi-core CPUs also utilize NUMA (Non-Uniform Memory Access) technology. Traditional multi-core processing uses SMP (Symmetric Multi-Processor) mode, where multiple processors share a centralized memory and I/O bus. This leads to memory access consistency issues, which can affect performance. In NUMA mode, processors are divided into multiple nodes, each with its own local memory space. For more details on NUMA technology in Linux, you can read the article “NUMA Technology in Linux.” The command for NUMA tuning in Linux is numactl. For instance, the command below runs the command “myprogram arg1 arg2” on node 0, allocating memory from nodes 0 and 1:

1	numactl –cpubind=0 –membind=0,1 myprogram arg1 arg2

This command is not optimal since it accesses memory across two nodes, which is undesirable. The best approach is to allow the program to access memory only from the node where it runs, such as:

1	$ numactl –membind 1 –cpunodebind 1 –localalloc myapplication

⑶ File System Tuning

Regarding the file system, it also has cache, so to maximize file system performance, the first priority is to allocate sufficient memory. This is crucial; in Linux, you can use the free command to check free/used/buffers/cached memory. Ideally, buffers and cached memory should be around 40%. Next, a fast disk controller is essential, with SCSI being much better. The fastest is Intel SSDs, which are incredibly fast but have limited write cycles.

Next, we can optimize file system configurations. For Linux’s Ext3/4 file systems, one parameter that helps in nearly all cases is disabling file system access time. Check your file system in /etc/fstab for the noatime parameter (it should generally be present). Additionally, the dealloc parameter allows the system to decide which block to use for writing at the last moment, optimizing the write process. Pay attention to the three logging modes: data=journal, data=ordered, and data=writeback. The default setting, data=ordered, provides the best balance between performance and protection.

Here is a command for viewing I/O in Linux—iotop, which allows you to see the disk read and write load for each process.

There are also optimizations for NFS and XFS; you can search for related articles on Google for more information. For various file systems, refer to the article “Linux Log File Systems and Performance Analysis.”

3.5 Database Tuning

Database tuning is not my strong suit, but I will share my limited knowledge. Note that the following points may not be universally applicable, as different business scenarios and database designs can yield completely opposite conclusions. Thus, I will provide some general guidelines, but specific issues must be analyzed in context.

⑴ Database Engine Tuning

I am not an expert in database engines, but there are a few aspects I believe are essential to understand.

The way databases handle locks is extremely important. In concurrent situations, locks significantly affect performance. Various isolation levels, row locks, table locks, page locks, read/write locks, transaction locks, and various mechanisms prioritizing either reads or writes all play a role. The highest performance is achieved without locks, so partitioning databases and tables, using redundant data, and minimizing consistency transactions can effectively enhance performance. NoSQL sacrifices consistency and transactions in favor of redundancy, achieving distributed systems and high performance.

Understanding the storage mechanism of databases is equally important. It is vital to comprehend how various field types are stored, how data is partitioned, and how it is managed, such as Oracle’s data files, tablespaces, segments, etc. Understanding this mechanism can alleviate much I/O load. For example, in MySQL, you can use show engines; to see the support for various storage engines, each with different focuses, which can lead to varying performance based on business needs or database design.

Database distributed strategies are also crucial. The simplest methods involve replication or mirroring, requiring knowledge of distributed consistency algorithms, master-master synchronization, and master-slave synchronization. Understanding the mechanisms behind these technologies enables horizontal scaling at the database level.

⑵ SQL Statement Optimization

For SQL statement optimization, the first step is to use tools like MySQL SQL Query Analyzer, Oracle SQL Performance Analyzer, or Microsoft SQL Query Analyzer. Virtually all RDBMS systems offer such tools to help identify performance issues in your SQL applications. You can also use the explain command to examine the execution plan for your SQL statements.

Another important point is that various database operations require a significant amount of memory, so the server’s memory must be sufficient, especially when dealing with multi-table queries.

Based on my limited SQL knowledge, here are a few SQL statements that may lead to performance issues:

Full table scans. For example: select * from user where lastname = “xxxx”; such SQL statements generally result in full table scans, leading to linear O(n) complexity. The more records there are, the worse the performance (e.g., searching 100 records takes 50ms, while searching one million records takes 5 minutes). In this case, we can improve performance in two ways: by partitioning tables to reduce record counts or by creating indexes (e.g., indexing the lastname field). Indexes function like key-value data structures, where the key is the field after where, and the value is the physical row number. The complexity of searching an index is generally O(log(n)), achieved through B-Tree indexing (e.g., searching 100 records takes 50ms, while searching one million records takes 100ms).

Indexes. For indexed fields, avoid performing calculations, type conversions, functions, null checks, or field concatenations on these fields, as these operations can degrade index performance. Generally, indexes appear in WHERE or ORDER BY clauses, so avoid performing calculations within WHERE and ORDER BY clauses, or using NOT, or functions.

Multi-table queries. The most common operations in relational databases involve multi-table queries, primarily using EXISTS, IN, and JOIN keywords (for various JOIN types, refer to the illustrated SQL JOIN article). Modern database engines are generally good at optimizing SQL statements, and while EXISTS and IN/EXISTS yield different results, their performance is generally similar. Some argue that EXISTS performs better than IN, and IN performs better than JOIN; however, this depends on your data, schema, and SQL statement complexity. For simple cases, they are generally comparable, so avoid excessive nesting and keep SQL statements simple. It’s better to use several simpler SQL statements than one enormous nested SQL statement.

JOIN operations. It is said that the order of JOIN operations can affect performance, but as long as the results of the JOIN are the same, performance remains unaffected since the database engine will optimize it. There are three JOIN implementation algorithms: nested loops, sorted merge, and hash join (MySQL only supports the first).

The nested loop is akin to common multi-level nested loops. Remember that the database indexing algorithm uses B-Tree, which is O(log(n)). Thus, the overall algorithm complexity should be O(log(n)) * O(log(m)).

Hash joins aim to address the O(log(n)) complexity of nested loops by using a temporary hash table.

Sorted merge means that both tables are sorted by the query field and then merged. Naturally, indexed fields are usually sorted.

As always, the best method depends on the data and SQL statements you are working with.

Partial result sets. In MySQL, the LIMIT keyword, Oracle’s rownum, and SQL Server’s TOP are used to limit the number of returned results. This gives the database engine ample optimization space. Generally, returning the top n records requires using ORDER BY. Ensure that indexes are created on the ORDER BY fields. Having indexed ORDER BY will prevent the performance of your select statement from being affected by the number of records. This technique is typically displayed on the frontend via pagination, with MySQL using OFFSET and SQL Server using FETCH NEXT. This fetch method is not ideal due to linear complexity, so if we can determine the starting value for the second page of the ORDER BY field, we can directly use a >= expression in the WHERE clause to select. This technique is known as seeking rather than fetching, and seeking is significantly more efficient than fetching.

Strings. As mentioned earlier, string operations are extremely performance-consuming, so use integers whenever possible, such as timestamps or employee IDs.
Full-text searches. Avoid using LIKE for full-text searches. If you need to perform full-text searches, consider using Sphinx.
Others.
Avoid using select *; explicitly specify the fields. When multiple tables are involved, always prefix the field names with the table name to prevent the engine from needing to calculate.
Avoid using HAVING, as it traverses all records, leading to extremely poor performance.
Whenever possible, use UNION ALL instead of UNION.
Having too many indexes can slow down insertions and deletions. Updates that affect many indexes will also slow down performance, but updating only one index will only impact that specific index table.
And more.

4. Practical Techniques for Kernel Performance Tuning

4.1 Adjusting Memory-Related Parameters

In Linux systems, memory management plays a critical role in system performance. Reasonably adjusting memory-related parameters can significantly enhance the system’s operational efficiency and stability.

The vm.swappiness parameter is a key memory parameter that controls the system’s tendency to swap pages to disk swap space when memory is low. Its value ranges from 0 to 100, with a default of 60. When vm.swappiness is set high, close to 100, the system will use disk swap space more frequently. Although this can temporarily meet the system’s memory needs, disk I/O operations are much slower than memory access speeds, leading to a significant decrease in system performance. Conversely, setting vm.swappiness to a lower value, such as 10 or 20, allows the system to avoid using swap space as much as possible, prioritizing physical memory. This is particularly important for applications that require high performance and frequent memory usage.

For example, in a Linux system running a database server, if database operations are frequent and response speed is critical, setting vm.swappiness to 10 can reduce disk swapping operations, improving database read and write performance and enhancing overall system response speed. Adjusting the vm.swappiness parameter is straightforward; you can modify the /etc/sysctl.conf file to add or change the line “vm.swappiness = [desired value]” and then execute “sysctl -p” to apply the configuration.

The vm.overcommit_memory parameter controls whether the system allows memory overcommitment. It has three possible values: 0, 1, and 2. When set to 0, the default setting, the kernel attempts to estimate the remaining available memory on the system and only allows memory allocation requests if they are deemed not to cause memory shortages. When set to 1, the kernel permits overcommitting memory until physical memory is exhausted. This setting is suitable for applications with clearly estimated memory needs that are conservative in memory usage, such as scientific computing tasks where applications can accurately control their memory usage to improve computational efficiency.

When set to 2, the kernel employs a strict memory allocation algorithm to ensure that the entire memory address space (including physical memory and swap space) does not exceed “swap + 50% of RAM.” This is a very conservative setting that effectively prevents system crashes due to excessive memory allocation. Adjusting the vm.overcommit_memory parameter can also be done by modifying the /etc/sysctl.conf file, adding or changing the line “vm.overcommit_memory = [desired value],” and executing “sysctl -p” to apply the changes.

4.2 Adjusting Network-Related Parameters

In today’s interconnected world, the quality of network performance directly impacts the overall system performance. By reasonably adjusting network-related kernel parameters, we can effectively enhance network throughput, reduce latency, and ensure high efficiency and stability in network communications.

The net.core.somaxconn parameter defines the maximum number of TCP connection requests queued. The default value is usually 128, which may be too small in high-concurrency network application scenarios. For instance, in a large e-commerce website server during promotional events, a massive number of users may simultaneously access the server, initiating a plethora of TCP connection requests. If the value of net.core.somaxconn remains at the default of 128, any subsequent connection requests exceeding this value may be dropped, preventing users from accessing the website normally and severely impacting user experience and business operations.

To address such high concurrency situations, we can appropriately increase the value of net.core.somaxconn based on the server’s actual performance and estimated concurrent connection numbers, such as setting it to 1024 or higher. This allows the server to accommodate more waiting connection requests, avoiding connection failures due to queue overflow. To adjust the net.core.somaxconn parameter, we can edit the /etc/sysctl.conf file, adding or modifying the line “net.core.somaxconn = [desired value],” and then executing “sysctl -p” to apply the new configuration.

The net.ipv4.tcp_syncookies parameter is a critical setting for mitigating SYN flood attacks. SYN flood attacks are a common type of network attack where attackers send a large number of forged SYN requests to the target server, exhausting the server’s connection resources and preventing it from processing legitimate connection requests. When net.ipv4.tcp_syncookies is set to 1, the system enables the syncookies mechanism. Under this mechanism, when the server receives a SYN request and finds that the SYN queue is full, it does not drop the request directly. Instead, it calculates a special cookie value based on the information in the received SYN packet and sends it as the sequence number in the SYN + ACK packet back to the client.

The client, upon receiving the SYN + ACK packet, includes this cookie value in the ACK packet returned to the server. The server verifies the cookie value to confirm the legitimacy of the connection request, effectively defending against SYN flood attacks without consuming excessive system resources. Like the previous parameter adjustment methods, we can modify the /etc/sysctl.conf file to add or change the line “net.ipv4.tcp_syncookies = 1” and execute “sysctl -p” to apply the settings.

4.3 Adjusting File System-Related Parameters

The file system is the core of data storage and management in Linux systems, and its performance directly affects the efficiency of file read and write operations, thereby influencing the overall system speed. By optimizing file system-related kernel parameters, we can significantly enhance file system performance to meet high-efficiency demands in various application scenarios.

The fs.file-max parameter specifies the maximum number of file handles that can be opened across all processes in the system. A file handle is a resource used by the system to identify and manage open files, and each process needs to obtain the corresponding file handle while performing file operations. In large-scale data processing applications, such as a data warehouse system, it may be necessary to simultaneously process a large number of files, including reading data files for analysis and writing result files. If the value of fs.file-max is set too low, subsequent file opening operations will fail once the number of opened file handles reaches this limit, preventing the application from functioning normally.

To ensure that such applications can proceed smoothly, we need to increase the value of fs.file-max based on actual business needs and system resource conditions. For example, if the system has ample memory resources and is expected to open tens of thousands of file handles during peak periods, we can set fs.file-max to a large value, such as 1048576. To adjust the fs.file-max parameter, we need to add or modify the line “fs.file-max = [desired value]” in the /etc/sysctl.conf file and execute “sysctl -p” to apply the new configuration, allowing the system to support more file handle opening operations.

The fs.aio-max-nr parameter mainly controls the maximum number of concurrent asynchronous I/O requests allowed in the system. Asynchronous I/O is an efficient file I/O operation method that allows applications to continue executing other tasks after initiating I/O requests without waiting for them to complete, thereby improving the system’s concurrent processing capability and overall performance. In scenarios requiring high performance for I/O operations, such as database read and write operations and real-time processing of big data, a large number of concurrent asynchronous I/O requests can fully utilize system resources and accelerate data transfer speeds. If the value of fs.aio-max-nr is set too low, the number of asynchronous I/O requests the system can handle simultaneously will be limited, preventing it from fully exploiting the advantages of asynchronous I/O.

For instance, in a high-performance database server, it may be necessary to handle thousands of concurrent asynchronous I/O requests to meet the read and write demands of numerous users. In this case, setting fs.aio-max-nr to a large value, such as 102400, can ensure the system has sufficient capacity to manage these concurrent requests, improving the database’s response speed and throughput. Adjusting the fs.aio-max-nr parameter also requires editing the /etc/sysctl.conf file to add or modify the line “fs.aio-max-nr = [desired value],” followed by executing “sysctl -p” to enable the system to manage concurrent asynchronous I/O requests as per the new configuration.

5. In-Depth Analysis of Tuning Cases

5.1 Case Background Introduction

An online education platform has experienced explosive user growth due to rapid business expansion. The system, which previously operated smoothly, began to reveal performance issues under high-concurrency access pressure. Users reported frequent buffering while watching course videos, slow video loading times, and occasional long delays in loading. Response times for course interactions, such as submitting assignments and participating in discussions, also noticeably lengthened, severely affecting the user learning experience.

The platform’s servers are built on the Linux system using a typical LAMP architecture (Linux + Apache + MySQL + PHP). Faced with escalating performance challenges, the platform’s technical team decided to conduct a thorough investigation and optimize the Linux kernel to enhance overall system performance and stability.

5.2 Problem Diagnosis Process

The technical team began by analyzing the system’s operational logs in detail. By reviewing the Apache server logs, they discovered numerous timeout records, indicating that the server struggled to process user requests promptly. Simultaneously, the MySQL database logs contained several slow query records, suggesting that database query performance might be affected.

To further pinpoint performance bottlenecks, the team used the top command to monitor the system’s resource usage in real-time. They found that CPU usage remained high, especially during peak user access periods, nearly reaching 100%. Analyzing the output from the top command revealed that some processes related to video processing and database queries consumed significant CPU resources.

Next, the team utilized the iostat command to check disk I/O conditions. The output indicated that disk read and write speeds were slow, particularly when reading video files, with the disk’s busy percentage (% util) approaching 100%, suggesting that disk I/O might be a performance bottleneck.

In terms of network performance, the team used the iftop command to monitor network bandwidth usage. They found that network bandwidth was heavily utilized during high-concurrency periods, especially for video transmission, affecting other business network requests.

5.3 Implementation of Tuning Measures

In response to the identified issues, the technical team implemented a series of targeted tuning measures.

For CPU optimization, they refined some processes related to video processing by adjusting algorithms and code logic to reduce unnecessary computation. They also enabled the multi-core capabilities of the CPU, distributing tasks across different cores for parallel processing, which improved CPU utilization.

To address the disk I/O issue, they replaced the disk storing video files with a higher-performance solid-state drive (SSD), significantly enhancing read and write speeds. Additionally, they optimized the database’s query statements to minimize unnecessary disk access and added appropriate indexes to accelerate data retrieval.

In terms of network optimization, they implemented streaming technology for video transmission, enabling segmented transmission and caching to reduce network bandwidth usage. They also adjusted network-related kernel parameters, such as increasing the value of net.core.somaxconn to enhance the server’s capacity to handle concurrent connections.

5.4 Display of Tuning Results

After implementing the series of tuning measures, the system’s performance improved significantly. Video loading speeds increased noticeably, buffering issues virtually disappeared, and users enjoyed a smooth experience while watching course videos. Response times for course interactions, such as submitting assignments and participating in discussions, were greatly reduced, allowing users to receive timely feedback.

From a performance metrics perspective, system throughput improved significantly, enabling the handling of more user requests in the same timeframe. System latency also decreased markedly, with average response times dropping from several seconds to under 1 second. CPU usage during high concurrency situations remained within a reasonable range, avoiding prolonged periods of full load. Disk I/O performance improved dramatically, with disk busy percentages (% util) consistently remaining at lower levels. Network bandwidth usage became more balanced, ensuring effective support for network requests across various business operations.

Through this Linux kernel performance tuning, the online education platform successfully met the challenges posed by increased business volume, providing users with higher-quality services while laying a solid foundation for the platform’s continued development.