Performance Optimization of the Linux Network Stack: A Complete Link from Packet Reception to Application Layer

The Ultimate Pursuit of Network Performance: In the era of cloud-native and 5G, network performance directly determines the user experience and commercial value of applications. This article delves into the Linux 5.15.4 kernel network subsystem, revealing the secrets of performance optimization from hardware interrupts to application layer reception, aiding in the construction of high-performance network applications.

🎯 The Commercial Value of Network Performance

Cost Analysis of Network Latency

Real Data:

E-commerce Platforms: A 100ms increase in network latency results in a 1% drop in conversion rate, leading to annual losses of tens of millions.
Financial Transactions: A 1ms increase in latency results in a 10% decrease in high-frequency trading profits.
Online Gaming: Latency exceeding 50ms increases user churn rate by 15%.
Video Streaming: Network jitter causing stuttering results in a 30% drop in audience retention.

A Case Study of a Leading E-commerce Platform: Through network stack optimization, the average response time was reduced from 200ms to 50ms during the Double 11 shopping festival:

Page load speed improved by 75%
User conversion rate increased by 12%
Server costs reduced by 30%
Total GMV increased by 8%, valued at over 5 billion

🏗️ In-Depth Analysis of the Linux Network Stack Architecture

The Complete Journey of Network Data Packets

Hardware NIC → Driver → Kernel Network Stack → Application
    ↓         ↓          ↓           ↓
  DMA      Interrupt Handling    Protocol Stack Processing   User Space

Core Data Structure Analysis

1. Network Device Structure (net_device)

// include/linux/netdevice.h (around line 2000)
struct net_device {
    char name[IFNAMSIZ];                    /* Device name */
    struct netdev_name_node *name_node;     /* Name node */
    struct dev_ifalias __rcu *ifalias;      /* Alias */
    /* Network device statistics */
    struct net_device_stats stats;
    atomic_long_t rx_dropped;               /* Received dropped packets */
    atomic_long_t tx_dropped;               /* Sent dropped packets */
    atomic_long_t rx_nohandler;             /* No handler dropped packets */
    /* Device features and capabilities */
    netdev_features_t features;             /* Device features */
    netdev_features_t hw_features;          /* Hardware features */
    netdev_features_t wanted_features;      /* Desired features */
    netdev_features_t vlan_features;        /* VLAN features */
    netdev_features_t hw_enc_features;      /* Hardware encryption features */
    /* MTU settings */
    unsigned int min_mtu;                   /* Minimum MTU */
    unsigned int max_mtu;                   /* Maximum MTU */
    unsigned short type;                    /* Device type */
    /* Device operation function set */
    const struct net_device_ops *netdev_ops;
    const struct ethtool_ops *ethtool_ops;
    const struct header_ops *header_ops;
    /* Queue management */
    unsigned int real_num_tx_queues;        /* Actual number of transmit queues */
    unsigned int real_num_rx_queues;        /* Actual number of receive queues */
    struct netdev_queue *_tx;               /* Transmit queue array */
    struct netdev_rx_queue *_rx;            /* Receive queue array */
    /* Interrupts and NAPI */
    int irq;                                /* Interrupt number */
    /* Address information */
    unsigned char perm_addr[MAX_ADDR_LEN];  /* Permanent MAC address */
    unsigned char addr_len;                 /* Address length */
    /* Protocol-related pointers */
    struct in_device __rcu *ip_ptr;         /* IPv4 device information */
    struct inet6_dev __rcu *ip6_ptr;        /* IPv6 device information */
    /* Device status */
    unsigned long state;                    /* Device status bits */
    struct list_head dev_list;              /* Device list */
    struct list_head napi_list;             /* NAPI list */
    /* Performance related */
    struct pcpu_sw_netstats __percpu *tstats; /* per-CPU statistics */
    struct pcpu_dstats __percpu *dstats;    /* Destination statistics */
};

2. Socket Buffer (sk_buff)

// include/linux/skbuff.h
struct sk_buff {
    union {
        struct {
            struct sk_buff *next;           /* Next node in the list */
            struct sk_buff *prev;           /* Previous node in the list */
            union {
                struct net_device *dev;     /* Associated network device */
                unsigned long dev_scratch;
            };
        };
        struct rb_node rbnode;              /* Red-black tree node */
        struct list_head list;              /* Generic list node */
    };
    union {
        struct sock *sk;                    /* Associated socket */
        int ip_defrag_offset;
    };
    union {
        ktime_t tstamp;                     /* Timestamp */
        u64 skb_mstamp_ns;
    };
    char cb[48] __aligned(8);               /* Control buffer */
    union {
        struct {
            unsigned long _skb_refdst;      /* Destination cache reference */
            void (*destructor)(struct sk_buff *skb); /* Destructor function */
        };
        struct list_head tcp_tsorted_anchor;
    };
    unsigned int len;                       /* Data length */
    unsigned int data_len;                  /* Data area length */
    __u16 mac_len;                          /* MAC header length */
    __u16 hdr_len;                          /* Writable header length */
    /* Checksum related */
    union {
        __wsum csum;                        /* Checksum */
        struct {
            __u16 csum_start;               /* Checksum start position */
            __u16 csum_offset;              /* Checksum offset */
        };
    };
    __u32 priority;                         /* Packet priority */
    __u8 ignore_df:1;                       /* Ignore DF flag */
    __u8 cloned:1;                          /* Is cloned */
    __u8 ip_summed:2;                       /* IP checksum status */
    __u8 nohdr:1;                           /* No header */
    __u8 nfctinfo:3;                        /* netfilter connection tracking info */
    __u8 pkt_type:3;                        /* Packet type */
    __u8 pfmemalloc:1;                      /* Emergency memory allocation */
    __u8 ignore_df:1;
    __u8 nf_trace:1;                        /* netfilter trace */
    __u8 ip_summed:2;
    __u32 hash;                             /* Packet hash value */
    union {
        u32 mark;                           /* Packet mark */
        u32 reserved_tailroom;
    };
    union {
        __be16 inner_protocol;              /* Inner protocol */
        __u8 inner_ipproto;
    };
    __u16 inner_transport_header;           /* Inner transport header offset */
    __u16 inner_network_header;             /* Inner network header offset */
    __u16 inner_mac_header;                 /* Inner MAC header offset */
    __be16 protocol;                        /* Protocol type */
    __u16 transport_header;                 /* Transport header offset */
    __u16 network_header;                   /* Network header offset */
    __u16 mac_header;                       /* MAC header offset */
    /* Data pointers */
    sk_buff_data_t tail;                    /* Data tail pointer */
    sk_buff_data_t end;                     /* Buffer end pointer */
    unsigned char *head;                    /* Buffer head pointer */
    unsigned char *data;                    /* Data pointer */
    unsigned int truesize;                  /* True size */
    refcount_t users;                       /* Reference count */
};

🚀 In-Depth Analysis of the Network Packet Reception Process

Conversion from Hardware Interrupts to Soft Interrupts

The Complete Process of NIC Receiving Packets:

// net/core/dev.c (around line 5698)
int netif_receive_skb(struct sk_buff *skb)
{
    int ret;
    trace_netif_receive_skb_entry(skb);
    ret = netif_receive_skb_internal(skb);
    trace_netif_receive_skb_exit(ret);
    return ret;
}
// Internal processing function
static int netif_receive_skb_internal(struct sk_buff *skb)
{
    int ret;
    net_timestamp_check(netdev_tstamp_prequeue, skb);
    if (skb_defer_rx_timestamp(skb))
        return NET_RX_SUCCESS;
    rcu_read_lock();
    #ifdef CONFIG_RPS
    if (static_branch_unlikely(&amp;rps_needed)) {
        struct rps_dev_flow voidflow, *rflow = &amp;voidflow;
        int cpu = get_rps_cpu(skb-&gt;dev, skb, &amp;rflow);
        if (cpu &gt;= 0) {
            ret = enqueue_to_backlog(skb, cpu, &amp;rflow-&gt;last_qtail);
            rcu_read_unlock();
            return ret;
        }
    }
    #endif
    ret = __netif_receive_skb(skb);
    rcu_read_unlock();
    return ret;
}

NAPI (New API) Mechanism

The Core Advantages of NAPI:

Reduces interrupt frequency, improving system efficiency
Supports batch processing of packets
Automatically switches to polling mode under high load

// include/linux/netdevice.h
struct napi_struct {
    struct list_head poll_list;         /* Polling list */
    unsigned long state;                /* NAPI state */
    int weight;                         /* Weight */
    int defer_hard_irqs_count;          /* Deferred hard interrupt count */
    unsigned long gro_bitmask;          /* GRO bitmask */
    int (*poll)(struct napi_struct *, int); /* Polling function */
    #ifdef CONFIG_NETPOLL
    int poll_owner;
    #endif
    struct net_device *dev;             /* Associated network device */
    struct gro_list gro_hash[GRO_HASH_BUCKETS]; /* GRO hash table */
    struct sk_buff *skb;                /* Currently processed skb */
    struct list_head rx_list;           /* Receive list */
    int rx_count;                       /* Receive count */
    struct hrtimer timer;               /* High-resolution timer */
    struct list_head dev_list;          /* Device list */
    struct hlist_node napi_hash_node;   /* NAPI hash node */
    unsigned int napi_id;               /* NAPI ID */
};

RPS (Receive Packet Steering) Mechanism

How RPS Works:

// net/core/dev.c
static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
                      struct rps_dev_flow **rflowp)
{
    const struct rps_sock_flow_table *sock_flow_table;
    struct netdev_rx_queue *rxqueue = dev-&gt;_rx;
    struct rps_dev_flow_table *flow_table;
    struct rps_map *map;
    int cpu = -1;
    u32 hash;
    if (!rxqueue-&gt;rps_map &amp;&amp; !rxqueue-&gt;rps_flow_table)
        goto done;
    skb_reset_network_header(skb);
    hash = skb_get_hash(skb);
    if (!hash)
        goto done;
    sock_flow_table = rcu_dereference(rps_sock_flow_table);
    if (flow_table &amp;&amp; sock_flow_table) {
        struct rps_dev_flow *rflow;
        u32 next_cpu;
        u32 ident;
        /* Lookup flow table */
        ident = sock_flow_table-&gt;ents[hash &amp; sock_flow_table-&gt;mask];
        if ((ident ^ hash) &amp; ~rps_cpu_mask)
            goto try_rps;
        next_cpu = ident &amp; rps_cpu_mask;
        /* Get device flow */
        rflow = &amp;flow_table-&gt;flows[hash &amp; flow_table-&gt;mask];
        cpu = next_cpu;
        /* Update flow information */
        if (rflow-&gt;filter == hash)
            cpu = rflow-&gt;cpu = next_cpu;
        else {
            rflow-&gt;filter = hash;
            rflow-&gt;cpu = next_cpu;
        }
        *rflowp = rflow;
        cpu = next_cpu;
        goto done;
    }
try_rps:
    map = rcu_dereference(rxqueue-&gt;rps_map);
    if (map) {
        cpu = map-&gt;cpus[reciprocal_scale(hash, map-&gt;len)];
        if (cpu_online(cpu))
            goto done;
    }
done:
    return cpu;
}

⚡ Network Performance Optimization Techniques

1. Interrupt Optimization

Interrupt Affinity Settings:

#!/bin/bash
# Network interrupt optimization script
# Get NIC information
INTERFACE="eth0"
IRQ_LIST=$(grep $INTERFACE /proc/interrupts | awk -F: '{print $1}')
# Set interrupt affinity
CPU_COUNT=$(nproc)
CPU_INDEX=0
for IRQ in $IRQ_LIST; do
    # Bind interrupt to specific CPU
    echo $((1 &lt;&lt; $CPU_INDEX)) &gt; /proc/irq/$IRQ/smp_affinity
    echo "IRQ $IRQ -&gt; CPU $CPU_INDEX"
    CPU_INDEX=$(((CPU_INDEX + 1) % CPU_COUNT))
done
# Disable irqbalance service
systemctl stop irqbalance
systemctl disable irqbalance

2. RPS/RFS Configuration

Receive Packet Steering Optimization:

#!/bin/bash
# RPS/RFS optimization configuration
INTERFACE="eth0"
CPU_COUNT=$(nproc)
# Set RPS CPU mask (using all CPUs)
RPS_MASK=$((2**CPU_COUNT - 1))
echo $RPS_MASK &gt; /sys/class/net/$INTERFACE/queues/rx-0/rps_cpus
# Set RFS table size
echo 32768 &gt; /proc/sys/net/core/rps_sock_flow_entries
echo 4096 &gt; /sys/class/net/$INTERFACE/queues/rx-0/rps_flow_cnt
# Enable RFS
echo 1 &gt; /proc/sys/net/core/rps_sock_flow_entries

3. Network Buffer Tuning

System-Level Network Parameter Optimization:

#!/bin/bash
# Network buffer optimization
# TCP buffer size
echo 'net.core.rmem_default = 262144' &gt;&gt; /etc/sysctl.conf
echo 'net.core.rmem_max = 134217728' &gt;&gt; /etc/sysctl.conf
echo 'net.core.wmem_default = 262144' &gt;&gt; /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' &gt;&gt; /etc/sysctl.conf
# TCP window scaling
echo 'net.ipv4.tcp_window_scaling = 1' &gt;&gt; /etc/sysctl.conf
# TCP congestion control
echo 'net.ipv4.tcp_congestion_control = bbr' &gt;&gt; /etc/sysctl.conf
# Network device queue length
echo 'net.core.netdev_max_backlog = 5000' &gt;&gt; /etc/sysctl.conf
# Connection tracking table size
echo 'net.netfilter.nf_conntrack_max = 1048576' &gt;&gt; /etc/sysctl.conf
# Apply configuration
sysctl -p

🔧 High-Performance Network Application Optimization

User-Space Network Stack (DPDK)

The Core Advantages of DPDK:

Bypasses the kernel, processing network packets directly in user space
Zero-copy technology reduces memory copy overhead
Polling mode avoids interrupt overhead
Large page memory reduces TLB misses

DPDK Application Example:

// DPDK network application example
#include &lt;rte_eal.h&gt;
#include &lt;rte_ethdev.h&gt;
#include &lt;rte_mbuf.h&gt;
#define RX_RING_SIZE 1024
#define TX_RING_SIZE 1024
#define NUM_MBUFS 8191
#define MBUF_CACHE_SIZE 250
#define BURST_SIZE 32
static struct rte_eth_conf port_conf = {
    .rxmode = {
        .max_rx_pkt_len = RTE_ETHER_MAX_LEN,
    },
};
int main(int argc, char *argv[])
{
    struct rte_mempool *mbuf_pool;
    uint16_t portid = 0;
    uint16_t nb_ports;
    int ret;
    /* Initialize EAL */
    ret = rte_eal_init(argc, argv);
    if (ret &lt; 0)
        rte_exit(EXIT_FAILURE, "Error with EAL initialization\n");
    /* Check number of ports */
    nb_ports = rte_eth_dev_count_avail();
    if (nb_ports &lt; 1)
        rte_exit(EXIT_FAILURE, "Error: no ethernet ports detected\n");
    /* Create memory pool */
    mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", NUM_MBUFS,
        MBUF_CACHE_SIZE, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
    if (mbuf_pool == NULL)
        rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");
    /* Configure port */
    ret = rte_eth_dev_configure(portid, 1, 1, &amp;port_conf);
    if (ret != 0)
        return ret;
    /* Set receive queue */
    ret = rte_eth_rx_queue_setup(portid, 0, RX_RING_SIZE,
                                rte_eth_dev_socket_id(portid), NULL, mbuf_pool);
    if (ret &lt; 0)
        return ret;
    /* Set transmit queue */
    ret = rte_eth_tx_queue_setup(portid, 0, TX_RING_SIZE,
                                rte_eth_dev_socket_id(portid), NULL);
    if (ret &lt; 0)
        return ret;
    /* Start port */
    ret = rte_eth_dev_start(portid);
    if (ret &lt; 0)
        return ret;
    /* Main loop */
    for (;;) {
        struct rte_mbuf *bufs[BURST_SIZE];
        uint16_t nb_rx, nb_tx;
        /* Receive packets */
        nb_rx = rte_eth_rx_burst(portid, 0, bufs, BURST_SIZE);
        if (unlikely(nb_rx == 0))
            continue;
        /* Process packets */
        for (int i = 0; i &lt; nb_rx; i++) {
            /* Add packet processing logic here */
            process_packet(bufs[i]);
        }
        /* Transmit packets */
        nb_tx = rte_eth_tx_burst(portid, 0, bufs, nb_rx);
        /* Free unsent packets */
        if (unlikely(nb_tx &lt; nb_rx)) {
            for (int i = nb_tx; i &lt; nb_rx; i++)
                rte_pktmbuf_free(bufs[i]);
        }
    }
    return 0;
}

XDP (eXpress Data Path) Optimization

XDP Program Example:

// XDP packet filtering program
#include &lt;linux/bpf.h&gt;
#include &lt;linux/if_ether.h&gt;
#include &lt;linux/ip.h&gt;
#include &lt;linux/tcp.h&gt;
#include &lt;bpf/bpf_helpers.h&gt;
SEC("xdp_filter")
int xdp_filter_func(struct xdp_md *ctx)
{
    void *data_end = (void *)(long)ctx-&gt;data_end;
    void *data = (void *)(long)ctx-&gt;data;
    struct ethhdr *eth = data;
    struct iphdr *ip;
    struct tcphdr *tcp;
    /* Check Ethernet header */
    if ((void *)(eth + 1) &gt; data_end)
        return XDP_ABORTED;
    /* Only process IP packets */
    if (eth-&gt;h_proto != __constant_htons(ETH_P_IP))
        return XDP_PASS;
    ip = (void *)(eth + 1);
    if ((void *)(ip + 1) &gt; data_end)
        return XDP_ABORTED;
    /* Only process TCP packets */
    if (ip-&gt;protocol != IPPROTO_TCP)
        return XDP_PASS;
    tcp = (void *)ip + (ip-&gt;ihl * 4);
    if ((void *)(tcp + 1) &gt; data_end)
        return XDP_ABORTED;
    /* Filter specific ports */
    if (tcp-&gt;dest == __constant_htons(80) || 
        tcp-&gt;dest == __constant_htons(443)) {
        return XDP_PASS;  /* Allow HTTP/HTTPS traffic */
    }
    return XDP_DROP;  /* Drop other traffic */
}
char _license[] SEC("license") = "GPL";

📊 Network Performance Monitoring and Analysis

Key Performance Indicators

Network Performance Monitoring Script:

#!/bin/bash
# Network performance monitoring script
INTERFACE="eth0"
DURATION=60
INTERVAL=1

echo "Starting network performance monitoring ($DURATION seconds)..."
# Create result file
RESULT_FILE="network_performance_$(date +%Y%m%d_%H%M%S).log"
{
    echo "Timestamp,Received Packets,Sent Packets,Received Bytes,Sent Bytes,Dropped Packets,Error Count"
    for ((i=0; i&lt;DURATION; i++)); do
        TIMESTAMP=$(date +%s)
        # Get network statistics
        RX_PACKETS=$(cat /sys/class/net/$INTERFACE/statistics/rx_packets)
        TX_PACKETS=$(cat /sys/class/net/$INTERFACE/statistics/tx_packets)
        RX_BYTES=$(cat /sys/class/net/$INTERFACE/statistics/rx_bytes)
        TX_BYTES=$(cat /sys/class/net/$INTERFACE/statistics/tx_bytes)
        RX_DROPPED=$(cat /sys/class/net/$INTERFACE/statistics/rx_dropped)
        RX_ERRORS=$(cat /sys/class/net/$INTERFACE/statistics/rx_errors)
        echo "$TIMESTAMP,$RX_PACKETS,$TX_PACKETS,$RX_BYTES,$TX_BYTES,$RX_DROPPED,$RX_ERRORS"
        sleep $INTERVAL
    done
} &gt; $RESULT_FILE

echo "Monitoring complete, results saved to: $RESULT_FILE"
# Generate performance report
{
    echo "Network Performance Analysis Report"
    echo "================"
    echo "Monitoring Interface: $INTERFACE"
    echo "Monitoring Duration: $DURATION seconds"
    echo
    # Calculate averages
    tail -n +2 $RESULT_FILE | awk -F',' '
    BEGIN { 
        rx_packets_total = 0; tx_packets_total = 0;
        rx_bytes_total = 0; tx_bytes_total = 0;
        count = 0;
    }
    {
        if (NR == 2) {
            rx_packets_start = $2; tx_packets_start = $3;
            rx_bytes_start = $4; tx_bytes_start = $5;
        }
        rx_packets_end = $2; tx_packets_end = $3;
        rx_bytes_end = $4; tx_bytes_end = $5;
        count++;
    }
    END {
        rx_pps = (rx_packets_end - rx_packets_start) / (count - 1);
        tx_pps = (tx_packets_end - tx_packets_start) / (count - 1);
        rx_bps = (rx_bytes_end - rx_bytes_start) / (count - 1);
        tx_bps = (tx_bytes_end - tx_bytes_start) / (count - 1);
        printf "Average Receive Rate: %.2f packets/second, %.2f MB/second\n", rx_pps, rx_bps/1024/1024;
        printf "Average Send Rate: %.2f packets/second, %.2f MB/second\n", tx_pps, tx_bps/1024/1024;
    }'
} &gt;&gt; $RESULT_FILE

echo "Performance analysis complete!"

🚀 Enterprise-Level Network Optimization Practice

High-Concurrency Web Server Optimization

Nginx + Network Stack Optimization:

# nginx.conf Network optimization configuration
worker_processes auto;
worker_cpu_affinity auto;
events {
    worker_connections 65535;
    use epoll;
    multi_accept on;
    accept_mutex off;
}
http {
    # TCP optimization
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    # Connection optimization
    keepalive_timeout 65;
    keepalive_requests 1000;
    # Buffer optimization
    client_body_buffer_size 128k;
    client_max_body_size 10m;
    client_header_buffer_size 1k;
    large_client_header_buffers 4 4k;
    output_buffers 1 32k;
    postpone_output 1460;
    # Compression optimization
    gzip on;
    gzip_vary on;
    gzip_min_length 1024;
    gzip_types text/plain text/css application/json application/javascript;
}

Database Connection Pool Optimization

MySQL Network Parameter Tuning:

-- MySQL network-related parameter optimization
SET GLOBAL max_connections = 2000;
SET GLOBAL max_connect_errors = 100000;
SET GLOBAL connect_timeout = 10;
SET GLOBAL net_read_timeout = 30;
SET GLOBAL net_write_timeout = 60;
SET GLOBAL net_buffer_length = 32768;
SET GLOBAL max_allowed_packet = 1073741824;
-- TCP-related optimizations
SET GLOBAL skip_name_resolve = 1;
SET GLOBAL back_log = 512;

💡 Summary and Best Practices

The Core Principles of Network Optimization

Reduce Interrupt Overhead: Use NAPI, interrupt affinity, and interrupt coalescing
Increase Parallelism: RPS/RFS, multi-queue NICs, CPU affinity
Reduce Copies: Zero-copy technology, user-space network stacks
Optimize Buffers: Properly set buffer sizes, avoid packet loss
Protocol Optimization: TCP parameter tuning, congestion control algorithm selection

Performance Optimization Checklist

System-Level Optimization:

[ ] Interrupt affinity configuration
[ ] RPS/RFS enablement and configuration
[ ] Network buffer parameter tuning
[ ] TCP congestion control algorithm selection
[ ] NIC multi-queue configuration

Application-Level Optimization:

[ ] Connection pool configuration optimization
[ ] Use of asynchronous I/O models
[ ] Data serialization optimization
[ ] Cache strategy implementation
[ ] Load balancing configuration

Monitoring and Debugging:

[ ] Network performance metrics monitoring
[ ] Packet loss and error analysis
[ ] Latency and throughput testing
[ ] Bottleneck identification and localization
[ ] Capacity planning and forecasting

Network performance optimization is a systematic project that requires full-stack optimization from hardware to applications. By deeply understanding the working principles of the Linux network stack and combining them with actual business scenarios, significant improvements in network application performance and user experience can be achieved.

Follow the “Cloud and Digitalization” public account for more practical experience in Linux network optimization.

This article is based on an analysis of the Linux 5.15.4 kernel source code, providing optimization solutions validated in production environments.