Monitoring and Tuning the Linux Network Stack: Receiving Data (1) – Network Device InitializationMonitoring and Tuning the Linux Network Stack: Receiving Data (2) – Interrupts and NAPIMonitoring and Tuning the Linux Network Stack: Receiving Data (3) – Monitoring Network DevicesMonitoring and Tuning the Linux Network Stack: Receiving Data (4) – Soft InterruptsMonitoring and Tuning the Linux Network Stack: Receiving Data (5) – Interrupt TuningMonitoring and Tuning the Linux Network Stack: Receiving Data (6) – Starting to Process Network DataMonitoring and Tuning the Linux Network Stack: Receiving Data (7) – Monitoring Network Data Processing and TuningMoving up the network stack with netif_receive_skb

Advancing packets through the network protocol stack using `netif_receive_skb`.

Continuing our previous discussion on netif_receive_skb, it is called in several places, with the two most common (which we have already discussed) being:

Called in the napi_skb_finish function if the packet will not be merged into an existing Generic Receive Offload (GRO) flow;
Called in the napi_gro_complete function if the protocol layer indicates that it is time to refresh that flow.

Note:netif_receive_skb and its subsequent function calls run in the context of the soft interrupt handling loop. When using tools like top, you will see the time spent in this part counted as sitime or si.

netif_receive_skb first checks a system control (sysctl) value to determine whether the user requests timestamping of the packet before or after it enters the backlog queue. If this setting is enabled, the timestamp is recorded immediately before the packet enters the Receive Packet Steering (RPS) (and the CPU-affiliated backlog queue). If this setting is disabled, the timestamp is recorded after the packet enters the queue. If RPS is enabled, this method distributes the load of timestamping across multiple CPUs but introduces some latency.

Tuning: Timestamping Received (RX) Packets

You can tune when packets are timestamped after being received by adjusting a system control parameter called net.core.netdev_tstamp_prequeue:

Disabling timestamping of received packets by adjusting the system control parameter

$ sudo sysctl -w net.core.netdev_tstamp_prequeue=0

The default value for this parameter is 1. For the exact meaning of this setting, refer to the explanation in the previous section.

The netif_receive_skb Function

After handling the timestamping, the netif_receive_skb function takes different actions depending on whether the Receive Packet Steering (RPS) feature is enabled. We will start with the simpler case: RPS is disabled.

RPS Disabled (Default Setting)

If RPS is not enabled, the __netif_receive_skb function is called, which performs some logging work and then calls the __netif_receive_skb_core function to pass the data further into the protocol stack.

We will take a detailed look at how the __netif_receive_skb_core function works, but first, let’s examine the code execution path when RPS is enabled, as this path also calls the __netif_receive_skb_core function.

RPS Enabled

If RPS is enabled, after handling the timestamp options mentioned above, the netif_receive_skb function performs some calculations to determine which CPU’s backlog queue should be used. This is done by calling the get_rps_cpu function. Here is the code from the net/core/dev.c file:

cpu = get_rps_cpu(skb-&gt;dev, skb, &amp;rflow);

if (cpu &gt;= 0) {
  ret = enqueue_to_backlog(skb, cpu, &amp;rflow-&gt;last_qtail);
  rcu_read_unlock();
  return ret;
}

get_rps_cpu considers the aforementioned Receive Flow Steering (RFS) and enhanced RFS (aRFS) settings to ensure that data is queued to the desired CPU’s backlog queue by calling the enqueue_to_backlog function.

The enqueue_to_backlog Function

This function first obtains a pointer to the softnet_data structure for the remote CPU, which contains a pointer to the input_pkt_queue. Next, it checks the length of the input_pkt_queue for the remote CPU. Here is the code from the net/core/dev.c file:

qlen = skb_queue_len(&amp;sd-&gt;input_pkt_queue);
if (qlen &lt;= netdev_max_backlog &amp;&amp; !skb_flow_limit(skb, qlen)) {

First, the length of the input_pkt_queue is compared with netdev_max_backlog. If the queue length exceeds this value, the data will be discarded. Similarly, flow limits are checked, and if exceeded, the data will also be discarded. In both cases, the drop count in the softnet_data structure will increase. Note that this refers to the softnet_data structure of the CPU to which the data is about to be queued. Refer to the earlier section on /proc/net/softnet_stat<code> to learn how to obtain drop counts for monitoring purposes.

enqueue_to_backlog is not called in many places. It is called during the processing of packets with RPS enabled and is also called from the netif_rx function. Most drivers should not use netif_rx, but should use netif_receive_skb. If you are not using RPS and your driver does not use netif_rx, then increasing the backlog queue length will not have any noticeable effect on the system, as it will not be utilized at all.

Note: You need to check the driver you are using. If it calls netif_receive_skb and you are not using RPS, then increasing netdev_max_backlog will not yield any performance improvement, as no data will enter the input_pkt_queue.

Assuming the input_pkt_queue is sufficiently small and the flow limit (which will be detailed below) has not been reached (or has been disabled), then the data can be queued. The logic here is a bit complex but can be summarized as follows:

If the queue is empty: Check if NAPI on the remote CPU has been started. If not, check if an Inter-Processor Interrupt (IPI) has been queued to be sent. If not, queue an IPI and start the NAPI processing loop by calling ____napi_schedule. Then continue to queue the data.
If the queue is not empty, or the above operations have been completed, enqueue the data.

The code uses goto statements, which can be a bit difficult to follow, so read carefully. Here is the code from the net/core/dev.c file:

  if (skb_queue_len(&amp;sd-&gt;input_pkt_queue)) {
enqueue:
         __skb_queue_tail(&amp;sd-&gt;input_pkt_queue, skb);
         input_queue_tail_incr_save(sd, qtail);
         rps_unlock(sd);
         local_irq_restore(flags);
         return NET_RX_SUCCESS;
 }

/* Schedule NAPI for backlog device
  * We can use non atomic operation since we own the queue lock
  */
if (!__test_and_set_bit(NAPI_STATE_SCHED, &amp;sd-&gt;backlog.state)) {
         if (!rps_ipi_queued(sd))
                 ____napi_schedule(sd, &amp;sd-&gt;backlog);
 }
goto enqueue;

Flow Limiting

RPS (Receive Packet Steering) distributes the packet processing load across multiple CPUs, but a single large flow can monopolize CPU processing time, preventing smaller flows from being processed. Flow limiting is a feature that can be used to limit the number of packets from each flow queued to the backlog queue to a certain range. This helps ensure that even with larger flows pushing packets, smaller flows can still be processed.

The above if statement from net/core/dev.c checks for flow limiting by calling the skb_flow_limit function:

if (qlen &lt;= netdev_max_backlog &amp;&amp; !skb_flow_limit(skb, qlen)) {

This code checks if there is still space in the queue and whether the flow limit has not been reached. By default, flow limiting is disabled. To enable flow limiting, you must specify a bitmap (similar to the RPS bitmap).

RPS Flow Limiting

RPS allows multiple CPUs to share kernel receive processing work without introducing packet reordering. The downside of sending all packets from the same flow to the same CPU is that if the packet rates of different flows vary, it can lead to an unbalanced CPU load. In extreme cases, a single flow can dominate the traffic. This situation indicates a problem, especially in common server workloads with many concurrent connections, such as misconfigurations or denial-of-service attacks with spoofed source addresses.

Flow limiting is an optional feature of RPS that prioritizes smaller flows during CPU contention by slightly preemptively dropping packets from larger flows. This feature only takes effect when the target CPU of RPS or RFS is nearing saturation. Once the input packet queue for a CPU exceeds half of the maximum queue length (set by the system parameter net.core.netdev_max_backlog), the kernel begins to count the number of packets from each flow over the last 256 packets. When a new packet arrives, if the number of packets from a flow exceeds the set ratio (default is half), the new packet will be dropped. Packets from other flows will only be dropped when the input packet queue reaches netdev_max_backlog. When the input packet queue length is below the threshold, no packets will be dropped, so flow limiting does not directly cut off connections, allowing even large flows to maintain connections.

https://github.com/torvalds/linux/blob/v3.13/Documentation/networking/scaling.txt#L166-L188

Monitoring: Monitoring Packet Loss Due to Full Input Packet Queue or Flow Limiting

Refer to the earlier section on monitoring /proc/net/softnet_stat. The "dropped" field is a counter that increments each time data is dropped instead of being queued to the CPU's input packet queue.

Tuning

Tuning: Adjusting netdev_max_backlog to Prevent Packet Loss

Before adjusting this tuning parameter value, please review the notes in the previous section.

If you are using RPS, or if your driver calls the netif_rx function, increasing the value of netdev_max_backlog can help prevent packet loss during the enqueue_to_backlog process.

Example: Use the sysctl command to increase the backlog queue length to 3000.

$ sudo sysctl -w net.core.netdev_max_backlog=3000

The default value is 1000.

Tuning: Adjusting NAPI Weight for Backlog Queue Polling Loop

You can adjust the weight of the NAPI poller for the backlog queue by setting the net.core.dev_weight system parameter (sysctl). Adjusting this value determines the share of the overall budget that the backlog queue polling loop can occupy (see the earlier section on adjusting net.core.netdev_budget):

Example: Use the sysctl command to increase the weight of the NAPI polling backlog queue processing loop.

$ sudo sysctl -w net.core.dev_weight=600

The default value is 64.

Remember, the backlog queue processing and the polling functions registered by the device driver run in the context of soft interrupts and are subject to overall budget and time constraints, as mentioned earlier.

Tuning: Enabling Flow Limiting and Adjusting Flow Limit Hash Table Size

Set the size of the flow limit table using the system control parameter (sysctl).

$ sudo sysctl -w net.core.flow_limit_table_len=8192

The default value is 4096.

This change only affects newly allocated flow hash tables. Therefore, if you want to increase the size of the table, you should do so before enabling flow limiting.

To enable flow limiting, you need to specify a bitmask in /proc/sys/net/core/flow_limit_cpu_bitmap, similar to the RPS bitmask, to indicate which CPUs have flow limiting enabled.

Backlog Queue NAPI Poller

The backlog queue for each CPU accesses NAPI (New API) in the same way as the device driver. The system provides a polling function to handle packets in the context of soft interrupts. It also provides a weight, just like the device driver.

This NAPI structure is provided during the network system initialization. In the net/core/dev.c file, the following code is found in the net_dev_init function:

sd-&gt;backlog.poll = process_backlog;
sd-&gt;backlog.weight = weight_p;
sd-&gt;backlog.gro_list = NULL;
sd-&gt;backlog.gro_count = 0;

The backlog queue’s NAPI structure differs from the device driver’s NAPI structure in that its weight parameter is adjustable, while device drivers typically hard-code their NAPI weights to 64. We will learn how to adjust this weight using system control parameters (sysctl) in the tuning section below.

Processing the Backlog Queue (process_backlog)

process_backlog is a loop that runs until its weight (as described in the previous section) is exhausted or there is no more data in the backlog queue.

Each segment of data in the backlog queue is removed from the queue and passed to the __netif_receive_skb function. The code execution path after data reaches __netif_receive_skb is the same as described earlier when RPS is disabled. That is, __netif_receive_skb performs some logging work before calling __netif_receive_skb_core to pass the network data to the protocol layer.

process_backlog follows the same NAPI convention as the device driver: if not all weight is used, NAPI is disabled. As mentioned earlier, when calling enqueue_to_backlog, the polling loop is restarted by calling ____napi_schedule.

This function returns the amount of work completed, and net_rx_action (described earlier) subtracts this amount from the budget (which can be adjusted via net.core.netdev_budget as mentioned earlier).

`__netif_receive_skb_core` Passes Data to Packet Capture Points and Protocol Layer

__netif_receive_skb_core is responsible for passing data to the protocol stack. Before doing this, it checks if there are any packet capture points installed to capture all incoming packets. A typical example is the AF_PACKET address family, which is usually accessed via the libpcap library.

If such a capture point exists, the data is first passed there before being passed to the protocol layer.

Packet Capture Point Passing

If a packet capture point is installed (usually implemented via libpcap), the packet will be passed to that capture point through the following code in net/core/dev.c:

list_for_each_entry_rcu(ptype, &amp;ptype_all, list) {
  if (!ptype-&gt;dev || ptype-&gt;dev == skb-&gt;dev) {
    if (pt_prev)
      ret = deliver_skb(skb, pt_prev, orig_dev);
    pt_prev = ptype;
  }
}

If you want to understand the path of data in pcap, you can read the <code>net/packet/af_packet.c file.

Note: The capture points for packet capture tools like tcpdump or wireshark are located here.

Protocol Layer Passing

Once the packet capture point has processed the data, __netif_receive_skb_core passes the data to the protocol layer. It does this by extracting the protocol field from the data and traversing the list of delivery functions registered for that protocol type.

This can be seen in the net/core/dev.c file in the __netif_receive_skb_core function:

type = skb-&gt;protocol;
list_for_each_entry_rcu(ptype,
                &amp;ptype_base[ntohs(type) &amp; PTYPE_HASH_MASK], list) {
        if (ptype-&gt;type == type &amp;&amp;
            (ptype-&gt;dev == null_or_dev || ptype-&gt;dev == skb-&gt;dev ||
             ptype-&gt;dev == orig_dev)) {
                if (pt_prev)
                        ret = deliver_skb(skb, pt_prev, orig_dev);
                pt_prev = ptype;
        }
}

The above ptype_base identifier is defined in net/core/dev.c as a linked list hash table:

struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;

Each protocol layer uses a helper function called ptype_head to compute the hash table slot for a given entry and adds a filter to the corresponding linked list:

static inline struct list_head *ptype_head(const struct packet_type *pt)
{
        if (pt-&gt;type == htons(ETH_P_ALL))
                return &amp;ptype_all;
        else
                return &amp;ptype_base[ntohs(pt-&gt;type) &amp; PTYPE_HASH_MASK];
}

Filters can be added to the list by calling the dev_add_pack function. This is how the protocol layer registers to receive network data delivery for its protocol type.

Now you know how network data travels from the network interface card to the protocol layer.

Protocol Layer Registration

Now that we understand how data is passed from the network device subsystem to the protocol stack, let’s look at how the protocol layer registers itself.

This blog post will analyze the IP protocol stack as an example, as it is a commonly used protocol relevant to most readers.

IP Protocol Layer

The IP protocol layer inserts itself into the ptype_base hash table so that the network device layer described earlier will pass data to it.

This process occurs in the net/ipv4/af_inet.c file in the inet_init function:

dev_add_pack(&amp;ip_packet_type);

This line of code registers the IP packet type structure defined in the net/ipv4/af_inet.c file:

static struct packet_type ip_packet_type __read_mostly = {
        .type = cpu_to_be16(ETH_P_IP),
        .func = ip_rcv,
};

__netif_receive_skb_core will call deliver_skb (as mentioned earlier), and deliver_skb will call func (in this case, ip_rcv).

Source

https://blog.packagecloud.io/monitoring-tuning-linux-networking-stack-receiving-data

Monitoring and Tuning the Linux Network Stack: Submitting Packets to the Protocol Stack

Advancing packets through the network protocol stack using `<span>netif_receive_skb</span>`.

Tuning: Timestamping Received (RX) Packets

Disabling timestamping of received packets by adjusting the system control parameter

The netif_receive_skb Function

RPS Disabled (Default Setting)

RPS Enabled

The enqueue_to_backlog Function

Flow Limiting

Monitoring: Monitoring Packet Loss Due to Full Input Packet Queue or Flow Limiting

Tuning

Tuning: Adjusting NAPI Weight for Backlog Queue Polling Loop

Tuning: Enabling Flow Limiting and Adjusting Flow Limit Hash Table Size

Backlog Queue NAPI Poller

Processing the Backlog Queue (process_backlog)

`<span>__netif_receive_skb_core</span>` Passes Data to Packet Capture Points and Protocol Layer

Packet Capture Point Passing

Protocol Layer Passing

Protocol Layer Registration

IP Protocol Layer

Source

Leave a Comment Cancel reply

Advancing packets through the network protocol stack using <span>netif_receive_skb</span>.

Tuning: Timestamping Received (RX) Packets

Disabling timestamping of received packets by adjusting the system control parameter

The netif_receive_skb Function

RPS Disabled (Default Setting)

RPS Enabled

The enqueue_to_backlog Function

Flow Limiting

Monitoring: Monitoring Packet Loss Due to Full Input Packet Queue or Flow Limiting

Tuning

Tuning: Adjusting NAPI Weight for Backlog Queue Polling Loop

Tuning: Enabling Flow Limiting and Adjusting Flow Limit Hash Table Size

Backlog Queue NAPI Poller

Processing the Backlog Queue (process_backlog)

<span>__netif_receive_skb_core</span> Passes Data to Packet Capture Points and Protocol Layer

Packet Capture Point Passing

Protocol Layer Passing

Protocol Layer Registration

IP Protocol Layer

Source

Related posts

Leave a Comment Cancel reply

Advancing packets through the network protocol stack using `<span>netif_receive_skb</span>`.

`<span>__netif_receive_skb_core</span>` Passes Data to Packet Capture Points and Protocol Layer