Linux Network Optimization: A Systematic Guide from Hardware to Kernel

  • 1. Introduction
  • 2. Hardware Layer Optimization
    • 1. Multi-Queue
    • 2. Ring Buffer
    • 3. Network Card Offload Feature Tuning
    • 4. Jumbo Frames (MTU)
  • 3. Kernel Network Stack Optimization
    • 1. Socket Buffer
    • 2. Backlog Queue
    • 3. TIME_WAIT and Short Connection Optimization
    • 4. TCP Congestion Control Algorithms
  • 4. Firewall and NAT Layer Optimization
  • 5. Conclusion

1. Introduction

By default, Linux network configurations prioritize “generality” and compatibility, without specific optimizations for high throughput and high concurrency scenarios. However, in 10 Gigabit internal networks, border gateways, high-performance virtualization platforms, etc., these “default values” often become performance bottlenecks.

This article will systematically break down and provide actionable optimization methods from the hardware layer, kernel layer, and NAT/firewall to help you fully unleash Linux network performance.

2. Hardware Layer Optimization

1. Multi-Queue

  • Significance: Distributing packet sending and receiving across multiple queues increases parallelism and avoids single-queue bottlenecks.
# Check the supported and configured queue sizes for a specific network card
ethtool -l <nic>

# Adjust the queue size for a specific network card
 ethtool -L <nic> combined <number>

Executing the <span>ethtool</span> command on the PVE host shows that the current network card supports a maximum of <span>32</span> queues, but only <span>20</span> are actually enabled. We attempted to set the queue size to <span>32</span> using the <span>ethtool -L</span> command, but it did not take effect. This is because the host has <span>20</span> logical CPU cores, and the number of queues for a network card typically does not exceed the number of available CPU cores. The multi-queue mechanism relies on binding each queue to an independent CPU core to achieve parallel processing of interrupts and load balancing, so the system automatically limits the queue upper limit to prevent resource waste.👉 In actual optimization, it is recommended to set the number of queues reasonably based on the number of logical cores to ensure balanced interrupt allocation, thereby enhancing network concurrency processing capability.

2. Ring Buffer

  • Significance: Enlarging the queue buffer reduces burst packet loss.

In the network card driver, Ring Buffer is a circular structure used to store network packets, divided into receive queues (RX) and transmit queues (TX). Each queue consists of multiple fixed-size buffer slots, where the driver writes packets into these slots for further processing by the kernel protocol stack or hardware.Ring Buffer size determines the network card’s ability to withstand pressure in high concurrency scenarios. If the queue is too small, the buffer may fill up during sudden traffic spikes, leading to packet loss; conversely, setting it too large may increase memory usage and latency. Reasonably adjusting RX/TX queue sizes is one of the important means of high-performance network tuning, especially suitable for 10 Gigabit networks, virtualization platforms, and other high-throughput scenarios.

# Check the supported and configured RX/TX ring parameter sizes for a specific network card
ethtool -g <nic>

# Adjust the RX/TX ring parameter sizes for a specific network card
 ethtool -G <nic> rx <number> tx <number>

For the current PVE host, we checked the maximum supported ring buffer size using the <span>ethtool -g</span> command, which is <span>8160</span>, while the actual configured value is only <span>512</span>. To enhance buffering capacity in high concurrency network scenarios, we successfully adjusted the ring buffer size from <span>512</span> to <span>8160</span> using <span>ethtool -G</span>. This adjustment can effectively reduce the risk of packet loss due to sudden traffic peaks, providing the system with stronger network stability and throughput capability.

3. Network Card Offload Feature Tuning

The default Maximum Transmission Unit (MTU) for Ethernet is 1500 bytes, which limits the size of each transmitted data frame. For example, when transmitting 3200 bytes, the system must split it into multiple smaller packets, leading to decreased resource utilization. By enabling Tx/Rx direction Offload features, the protocol stack can handle large packets up to 64KiB, thereby reducing interrupt requests and fragmentation/reassembly overhead, improving throughput efficiency.

Offload Type Description Effect of Enabling Recommended Scenarios
GSO (Generic Segmentation Offload) Merges multiple small packets into a large TCP packet, segmented by the driver layer or hardware. Reduces CPU load and increases throughput. Enabled by default, usually no changes needed.
TSO (TCP Segmentation Offload) Delegates large TCP data segmentation to hardware. Reduces kernel processing costs and increases sending efficiency. Suitable for large data transmission scenarios.
GRO (Generic Receive Offload) Aggregates multiple small packets into a large packet, merged by the kernel. Reduces protocol stack load and increases receiving efficiency. Enabled by default.
LRO (Large Receive Offload) The network card driver merges multiple received packets into a large packet (before the kernel). Reduces the number of protocol stack processing instances. Not suitable for bridging, virtualization, or networks with kernel forwarding enabled; recommended to disable.
Checksum Offload (Tx/Rx) Calculates and verifies checksums by hardware during sending or receiving. Reduces CPU checksum burden. Most network cards support and enable this by default.
  • Most offloads are enabled by default in modern kernels and require no manual intervention.
  • LRO is not suitable for bridging or virtualization scenarios (such as KVM/container networks) as it affects fragmentation recognition; it is recommended to disable it.
  • If you need to debug network issues or investigate forwarding anomalies, you can temporarily disable related offload features to assist in troubleshooting.
  • It is recommended to use <span>ethtool -k <nic></span> to check the current network card offload support and status.

4. Jumbo Frames (MTU)

  • Significance: In high throughput environments, adjust MTU to 9000 to reduce CPU consumption; requires full link support.
ip link set dev <nic> mtu 9000

3. Kernel Network Stack Optimization

1. Socket Buffer

  • Significance: Increasing the buffer can enhance the maximum value of the TCP sliding window, allowing full utilization of bandwidth in high latency + high bandwidth networks.
# Check the default socket buffer range (in bytes)
sysctl net.core.rmem_max        # Maximum receive buffer
sysctl net.core.wmem_max        # Maximum send buffer
sysctl net.ipv4.tcp_rmem        # TCP receive buffer triplet: min default max
sysctl net.ipv4.tcp_wmem        # TCP send buffer triplet: min default max

# Adjust socket buffer range
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"

2. Backlog Queue

When the server receives a large number of concurrent connection requests, the Linux kernel uses two queues to handle them:

  • Recv-Q: Number of connections waiting for the program to accept() (in queue);
  • Send-Q: Maximum number of connections allowed to queue (i.e., backlog).

If Recv-Q is close to or equal to Send-Q, it indicates that the number of connections is full, and new connections will be dropped by the system.

You can view the port’s queue status in real-time using <span>ss -lnt</span>:

root@pve:~# ss -lnt
State              Recv-Q             Send-Q                           Local Address:Port                           Peer Address:Port
LISTEN             0                  2048                                   0.0.0.0:9221                                0.0.0.0:*

Optimization methods:

# Increase connection queue limits to reduce connection rejections
sysctl -w net.core.somaxconn=4096
sysctl -w net.ipv4.tcp_max_syn_backlog=4096

Note: Performance tuning should be a gradual process. Different devices, drivers, and business scenarios vary greatly, and any parameter changes should first be validated in a test environment or during low-peak periods (monitoring Recv-Q/Send-Q, CPU, latency, etc.), gradually increasing the load while retaining rollback plans and configuration backups. This article only provides ideas and example values; actual deployment should be based on observed data.

3. TIME_WAIT and Short Connection Optimization

Optimizing the accumulation of TIME_WAIT caused by a large number of short connections:

sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=10
  • net.ipv4.tcp_tw_reuse:
    • Effect: Allows sockets in TIME_WAIT state to be reused for new connections (provided the target address and port are the same), improving port reuse efficiency in high concurrency short connection scenarios and reducing TIME_WAIT’s occupation of system resources.
    • Default value: 0 (disabled)
  • net.ipv4.tcp_fin_timeout:
    • Effect: Specifies the duration (in seconds) for which FIN-WAIT-2 state connections are maintained. If a connection is closed but not released in time, the system will retain the socket for a shorter time, reducing resource occupation.
    • Default value: 60

Note: These two parameters are usually optimized in high concurrency connections, large numbers of short connections, and excessive TIME_WAIT leading to port exhaustion scenarios, but should also be cautiously tuned based on actual connection lifecycles and service protocol types (whether connection reuse is supported, etc.).

4. TCP Congestion Control Algorithms

💡 BBR has been officially included in the kernel mainline since Linux 4.9, but is not enabled by default and needs to be manually set to enable.

net.core.default_qdisc = fq
sysctl -w net.ipv4.tcp_congestion_control=bbr

4. Firewall and NAT Layer Optimization

Connection tracking tuning:

sysctl -w net.netfilter.nf_conntrack_max=262144
sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=600
  • net.netfilter.nf_conntrack_max: Sets the maximum number of connections allowed in the conntrack table.
  • net.netfilter.nf_conntrack_tcp_timeout_established: Shortens the timeout for established connections.

Applicable scenarios:

  • High concurrency gateways, NAT, firewalls;
  • Medium to large Kubernetes cluster nodes, etc.

Note: Tuning should be evaluated in conjunction with memory resources and business types; setting the maximum value of the nf_conntrack table too high can also consume more memory. It is recommended to regularly monitor usage with commands like <span>cat /proc/net/nf_conntrack</span> or <span>conntrack -L</span>.

5. Conclusion

  • Hardware Layer (ethtool): Multi-Queue, Ring Buffer, Offload, MTU
  • Kernel Stack: Socket Buffer, Backlog, TIME_WAIT, Congestion Control
  • Firewall/NAT: nf_conntrack table and timeout management
  • Systematic tuning approach:Observe first → Layered optimization → Test and validate by scenario

Leave a Comment