Monitoring and Tuning the Linux Networking Stack: Receiving Data
In Brief
This blog post explains how computers running the Linux kernel receive packets and how to monitor and tune various components of the networking stack as packets flow from the network to user-space programs.
Update: We have published a corresponding article: “Monitoring and Tuning the Linux Networking Stack: Sending Data”1.
Update: You can check out the “Illustrated Guide to Monitoring and Tuning the Linux Networking Stack: Receiving Data”2, which adds some charts to the information presented below.
1 https://packagecloud.io/blog/monitoring-tuning-linux-networking-stack-sending-data/
2 https://packagecloud.io/blog/illustrated-guide-monitoring-tuning-linux-networking-stack-receiving-data/
It is impossible to tune or monitor the Linux networking stack without reading the kernel source code and gaining a deep understanding of what is happening.
I hope this blog post serves as a reference for those in need.
Special Thanks
Special thanks to everyone at Private Internet Access, who hired us to research this information in conjunction with other network studies and generously allowed us to expand and publish this information based on that research.
The information presented here is based on work done for Private Internet Access, which was originally published in a series of five articles, with the first article available here3.
3 https://www.privateinternetaccess.com/blog/2016/01/linux-networking-stack-from-the-ground-up-part-1/
General Advice on Monitoring and Tuning the Linux Networking Stack
Update: We have published a corresponding article: “Monitoring and Tuning the Linux Networking Stack: Sending Data”.
Update: You can check out the “Illustrated Guide to Monitoring and Tuning the Linux Networking Stack: Receiving Data”, which adds some charts to the information presented below.
The networking stack is very complex, and there is no one-size-fits-all solution. If the performance and health of the network are critical to you or your business, you have no choice but to invest a significant amount of time, effort, and money to understand how the various parts of the system interact.
Ideally, you should consider measuring packet loss at every layer of the networking stack. This way, you can identify and narrow down the components that need tuning.
I believe many operations personnel go astray here: they think they can directly copy a set of system control (sysctl) settings or values from the /proc filesystem. This may work in some cases, but in reality, the entire system is very subtle and interconnected, and if you want to do meaningful monitoring or tuning, you must strive to understand how the system operates. Otherwise, you can just use the default settings, which should be sufficient until further optimization is needed (and the necessary investment to derive appropriate settings is made).
Many of the example settings provided in this blog post are for illustrative purposes only and are not recommendations or rejections of any specific configuration or default settings. Before adjusting any settings, you should establish a reference framework to clarify what needs to be monitored to detect meaningful changes.
Adjusting network settings on machines connected to the network is very risky; you can easily lock yourself out or completely disrupt the network connection. Do not adjust these settings on production machines; if possible, adjustments should be made on new machines before deploying them into production.
Overview
As a reference, you may need to have a device data sheet on hand. This article will examine the Intel I350 Ethernet controller controlled by the igb driver. You can click here4 to view the data sheet (warning: it is a large PDF file) for reference.
4 http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/ethernet-controller-i350-datasheet.pdf
The approximate path of a packet from the network (network card) to entering the socket receive buffer is as follows:
-
The driver loads and initializes.
-
The packet arrives at the network card (NIC) from the network.
-
The packet is copied to a ring buffer in kernel memory via direct memory access (DMA).
-
The hardware generates an interrupt, notifying the system that there is a packet in memory.
-
If the polling loop is not already running, the driver calls the new API (NAPI) to start a polling loop.
-
The ksoftirqd process runs on each CPU in the system, registered at system startup. The ksoftirqd process retrieves packets from the ring buffer by calling the NAPI polling function registered by the device driver during initialization.
-
The memory area in the ring buffer where network data has been written is unmapped.
-
The data transferred to memory via DMA is passed to the network layer as a “socket buffer (skb)” for further processing.
-
If packet steering is enabled, or if the NIC has multiple receive queues, incoming network data frames are distributed across multiple CPUs.
-
The network data frames are passed to each protocol layer.
-
The protocol layer processes the data.
-
The protocol layer adds the data to the receive buffer associated with the socket.
The following sections will explore the entire process in detail.
The protocol layers to be discussed next are the IP and UDP protocol layers. Much of the information presented in this article can also serve as a reference for other protocol layers.
Detailed Analysis
Update: We have published a sister article: “Monitoring and Tuning the Linux Networking Stack: Sending Data”.
Update: Check out the “Illustrated Guide to Monitoring and Tuning the Linux Networking Stack: Receiving Data”, which adds some charts to the information presented below.
This blog post will analyze the Linux kernel version 3.13.0, providing links to code on GitHub and code snippets.
To fully understand how the Linux kernel receives packets, a deep dive is necessary. We need to carefully study and understand how network drivers work, so that understanding the various parts of the networking stack will be clearer later.
This blog post will focus on the igb network driver. This driver is used for a common server NIC – the Intel Ethernet Controller I350. So, let’s start by understanding how the igb network driver works.
Network Device Driver
Initialization
The driver registers an initialization function, which the kernel calls when the driver is loaded. This function is registered using the <span>module_init</span>
macro.
The igb initialization function (<span>igb_init_module</span>
) and its registration using <span>module_init</span>
can be found in the <span>drivers/net/ethernet/intel/igb/igb_main.c</span>
file.
Both are quite straightforward:
/**
* igb_init_module - Driver Registration Routine
*
* igb_init_module is the first routine called when the driver is
* loaded. All it does is register with the PCI subsystem.
**/
static int __init igb_init_module(void)
{
int ret;
pr_info("%s - version %s\n", igb_driver_string, igb_driver_version);
pr_info("%s\n", igb_copyright);
/* ... */
ret = pci_register_driver(&igb_driver);
return ret;
}
module_init(igb_init_module);
As we will see next, most of the work of initializing the device is done when calling <span>pci_register_driver</span>
.
PCI Initialization
The Intel I350 NIC is a PCI Express device.
PCI devices identify themselves through a series of registers in the PCI configuration space.
When compiling the device driver, a macro called <span>MODULE_DEVICE_TABLE</span>
(from <span>include/module.h</span>
) is used to export a PCI device ID table that identifies the devices that this driver can control. Later, we will see that this table is also registered as part of a structure.
The kernel uses this table to determine which device driver should be loaded to control the device.
The operating system determines which devices are connected to the system and which driver should communicate with those devices in this way.
This table for the igb driver and the PCI device IDs can be found in <span>drivers/net/ethernet/intel/igb/igb_main.c</span>
and <span>drivers/net/ethernet/intel/igb/e1000_hw.h</span>
files:
static DEFINE_PCI_DEVICE_TABLE(igb_pci_tbl) = {
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_BACKPLANE_1GBPS) },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_SGMII) },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_BACKPLANE_2_5GBPS) },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I211_COPPER), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_COPPER), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_FIBER), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SERDES), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SGMII), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_COPPER_FLASHLESS), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SERDES_FLASHLESS), board_82575 },
/* ... */
};
MODULE_DEVICE_TABLE(pci, igb_pci_tbl);
As mentioned in the previous section, the <span>pci_register_driver</span>
function is called in the driver’s initialization function.
This function is used to register a pointer structure. Most of the pointers in this structure are function pointers, but the PCI device ID table is also registered. The kernel will use these functions registered by the driver to start the PCI device.
The following content comes from <span>drivers/net/ethernet/intel/igb/igb_main.c</span>
file:
static struct pci_driver igb_driver = {
.name = igb_driver_name,
.id_table = igb_pci_tbl,
.probe = igb_probe,
.remove = igb_remove,
/* ... */
};
PCI Probe
Once the device is identified by its PCI ID, the kernel can choose the appropriate driver to control that device. Each PCI driver registers a probe function in the kernel’s PCI system. For devices that have not yet been claimed by any device driver, the kernel will call this function. Once a device is claimed, no further inquiries will be made to other drivers about that device. Most drivers have a lot of code to prepare the device for use. What needs to be done varies by driver.
Some typical operations include:
-
Enabling the PCI device.
-
Requesting memory ranges and I/O ports.
-
Setting the DMA mask.
-
Registering ethtool features supported by the driver (to be detailed later).
-
Performing any necessary watchdog tasks (for example, e1000e has a watchdog task to check if the hardware is hung).
-
Other device-specific operations, such as resolving compatibility issues or handling hardware-specific features.
-
Creating, initializing, and registering a
<span>struct net_device_ops</span>
structure. This structure contains function pointers for various operations such as opening the device, sending data over the network, setting the MAC address, etc. -
Creating, initializing, and registering a high-level
<span>struct net_device</span>
structure representing the network device.
Let’s take a quick look at some of these operations in the <span>igb_probe</span>
function of the igb driver.
Diving into PCI Initialization
The following code from the <span>igb_probe</span>
function performs some basic PCI configuration. The code is sourced from <span>drivers/net/ethernet/intel/igb/igb_main.c</span>
file:
err = pci_enable_device_mem(pdev);
/* ... */
err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
/* ... */
err = pci_request_selected_regions(pdev, pci_select_bars(pdev,
IORESOURCE_MEM),
igb_driver_name);
pci_enable_pcie_error_reporting(pdev);
pci_set_master(pdev);
pci_save_state(pdev);
First, the device is initialized using <span>pci_enable_device_mem</span>
. If the device is in a suspended state, this will wake it up and enable memory resources, etc.
Next, the DMA mask is set. The device is capable of reading and writing to 64-bit memory addresses, so <span>dma_set_mask_and_coherent</span>
is called with <span>DMA_BIT_MASK(64)</span>
.
Memory regions are reserved by calling <span>pci_request_selected_regions</span>
, enabling PCI Express advanced error reporting (provided the PCI AER driver is loaded), calling <span>pci_set_master</span>
to enable DMA, and calling <span>pci_save_state</span>
to save the PCI configuration space.
Wow, that’s quite a lot.
More Information on Linux PCI Drivers
A detailed explanation of how PCI devices work is beyond the scope of this article, but the following resources are excellent, including this great talk5, this wiki page6, and this text file in the Linux kernel7.
5 https://bootlin.com/doc/pci-drivers.pdf
6 http://wiki.osdev.org/PCI
7 https://github.com/torvalds/linux/blob/v3.13/Documentation/PCI/pci.txt
Network Device Initialization
<span>igb_probe</span>
function performs some important network device initialization tasks. In addition to PCI-specific operations, it also performs more general networking and network device-related operations:
-
Registering a
<span>struct net_device_ops</span>
structure. -
Registering ethtool operations.
-
Obtaining the default MAC address from the NIC.
-
Setting network device feature flags.
-
Many other operations.
Let’s go through these operations one by one, and we will find them quite interesting.
Network Device Operations Structure (<span>struct net_device_ops</span>
)
<span>struct net_device_ops</span>
structure contains function pointers for many important operations necessary for the network subsystem to control the device. We will refer to this structure multiple times in the subsequent content of this article.
In the <span>igb_probe</span>
function, this <span>net_device_ops</span>
structure is associated with a <span>struct net_device</span>
structure. (The code is from <span>drivers/net/ethernet/intel/igb/igb_main.c</span>
)
static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
{
/* ... */
netdev->netdev_ops = &igb_netdev_ops;
Moreover, the functions pointed to by the <span>net_device_ops</span>
structure are also set in the same file. The file path is:<span>drivers/net/ethernet/intel/igb/igb_main.c</span>
.
static const struct net_device_ops igb_netdev_ops = {
.ndo_open = igb_open,
.ndo_stop = igb_close,
.ndo_start_xmit = igb_xmit_frame,
.ndo_get_stats64 = igb_get_stats64,
.ndo_set_rx_mode = igb_set_rx_mode,
.ndo_set_mac_address = igb_set_mac,
.ndo_change_mtu = igb_change_mtu,
.ndo_do_ioctl = igb_ioctl,
/* ... */
As you can see, this structure contains several interesting fields, such as <span>ndo_open</span>
, <span>ndo_stop</span>
, <span>ndo_start_xmit</span>
, and <span>ndo_get_stats64</span>
, which hold the addresses of the functions implemented by the igb driver.
We will explore some of these fields in more detail later.
Ethtool Registration
Ethtool is a command-line program that you can use to get and set various driver and hardware options. On Ubuntu systems, you can install it by running <span>apt-get install ethtool</span>
.
Common uses of ethtool include collecting detailed statistics from network devices. Other noteworthy ethtool settings will be introduced later.
The ethtool program communicates with the device driver using the <span>ioctl</span>
system call. The device driver registers a series of functions for ethtool operations to call, and the kernel acts as the intermediary.
When an <span>ioctl</span>
call is initiated from ethtool, the kernel finds the ethtool structure registered by the corresponding driver and executes the registered function. The implementation of ethtool functions by the driver can perform various operations, from changing a simple software flag in the driver to adjusting the actual NIC hardware’s operation by writing values to device registers.
The igb driver registers its ethtool operations by calling <span>igb_set_ethtool_ops</span>
in the <span>igb_probe</span>
function:
static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
{
/* ... */
igb_set_ethtool_ops(netdev);
All ethtool-related code for the igb driver, as well as the <span>igb_set_ethtool_ops</span>
function, can be found in the <span>drivers/net/ethernet/intel/igb/igb_ethtool.c</span>
file.
File path:<span>drivers/net/ethernet/intel/igb/igb_ethtool.c</span>
void igb_set_ethtool_ops(struct net_device *netdev)
{
SET_ETHTOOL_OPS(netdev, &igb_ethtool_ops);
}
Above this, you can find the <span>igb_ethtool_ops</span>
structure, where the ethtool functions supported by the igb driver are set in the corresponding fields.
File path:<span>drivers/net/ethernet/intel/igb/igb_ethtool.c</span>
static const struct ethtool_ops igb_ethtool_ops = {
.get_settings = igb_get_settings,
.set_settings = igb_set_settings,
.get_drvinfo = igb_get_drvinfo,
.get_regs_len = igb_get_regs_len,
.get_regs = igb_get_regs,
/* ... */
It is up to each driver to decide which ethtool functions are relevant and which should be implemented. Unfortunately, not all drivers implement all ethtool functions.
One interesting ethtool function is <span>get_ethtool_stats</span>
, which, if implemented by the driver, generates detailed statistical counters that are either tracked by the driver at the software level or tracked by the device itself.
The monitoring section below will demonstrate how to use ethtool to obtain these detailed statistics.
Interrupt Requests (IRQ)
How does the NIC inform the rest of the system that data is ready for processing when data frames are written to RAM via direct memory access (DMA)?
Traditionally, the NIC generates an interrupt request (IRQ) to indicate that data has arrived. There are three common types of interrupt requests: MSI-X, MSI, and traditional interrupt requests. These types will be briefly introduced later. The principle of generating interrupt requests when data is written to RAM via DMA is straightforward, but if a large number of data frames arrive, it can lead to a large number of interrupt requests being generated. The more interrupt requests generated, the less CPU time is available for user processes and other higher-level tasks.
The new API (NAPI) was introduced as a mechanism to reduce the number of interrupt requests generated by network devices when packets arrive. While NAPI can reduce the number of interrupt requests, it cannot eliminate them entirely.
We will explore the reasons for this in the following sections.
New API (NAPI)
NAPI differs from traditional data collection methods in several important ways. NAPI allows device drivers to register a polling function, which the NAPI subsystem will call to collect data frames.
The expected usage of NAPI in network device drivers is as follows:
-
The driver enables NAPI, but it is initially in a disabled state.
-
A data packet arrives and is transferred to memory by the NIC via DMA.
-
The NIC generates an interrupt request (IRQ), triggering the IRQ handler in the driver.
-
The driver wakes up the NAPI subsystem using a soft interrupt (to be detailed later). This will start collecting packets by calling the polling function registered by the driver in a separate execution thread.
-
The driver should disable further IRQs from the NIC. This is to allow the NAPI subsystem to process packets without interference from the device.
-
Once there is no more work to do, the NAPI subsystem will be disabled, and the device’s IRQs will be re-enabled.
-
This process starts again from step 2.
This method of collecting data frames reduces overhead compared to traditional methods, as multiple data frames can be processed at once without having to handle an IRQ for each individual data frame.
The device driver implements a polling function and registers it with NAPI by calling <span>netif_napi_add</span>
. When registering the NAPI polling function using <span>netif_napi_add</span>
, the driver also specifies a weight. Most drivers hard-code a value of 64. This value and its significance will be described in more detail later.
Typically, the driver registers its NAPI polling function during driver initialization.
NAPI Initialization in the igb Driver
The igb driver accomplishes this through a long chain of calls:
-
<span>igb_probe</span>
calls<span>igb_sw_init</span>
. -
<span>igb_sw_init</span>
calls<span>igb_init_interrupt_scheme</span>
. -
<span>igb_init_interrupt_scheme</span>
calls<span>igb_alloc_q_vectors</span>
. -
<span>igb_alloc_q_vectors</span>
calls<span>igb_alloc_q_vector</span>
. -
<span>igb_alloc_q_vector</span>
calls<span>netif_napi_add</span>
.
This call process triggers some high-level operations:
-
If the device supports MSI-X,
<span>pci_enable_msix</span>
will be called to enable it. -
Various settings are calculated and initialized, especially determining the number of send and receive queues used by the device and driver for sending and receiving packets.
-
For each send and receive queue to be created,
<span>igb_alloc_q_vector</span>
will be called once. -
Each time
<span>igb_alloc_q_vector</span>
is called, it will call<span>netif_napi_add</span>
to register a polling function for that queue, along with a<span>napi_struct</span>
instance that will be passed to it when the polling function is called to collect packets.
Let’s take a look at <span>igb_alloc_q_vector</span>
to understand how the polling callback function and its private data are registered.
File path:<span>drivers/net/ethernet/intel/igb/igb_main.c</span>
static int igb_alloc_q_vector(struct igb_adapter *adapter,
int v_count, int v_idx,
int txr_count, int txr_idx,
int rxr_count, int rxr_idx)
{
/* ... */
/* allocate q_vector and rings */
q_vector = kzalloc(size, GFP_KERNEL);
if (!q_vector)
return -ENOMEM;
/* initialize NAPI */
netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64);
/* ... */
}
The above code allocates memory for a receive queue and registers the <span>igb_poll</span>
function with the NAPI subsystem. It provides a reference to a <span>napi_struct</span>
structure associated with this newly created receive (RX) queue (i.e., the above <span>&q_vector->napi</span>
).
When the NAPI subsystem needs to collect packets from this RX queue, it will pass this reference when calling the <span>igb_poll</span>
function.
This will be very important when we later study the flow of data from the driver up through the networking protocol stack.
Source
https://blog.packagecloud.io/monitoring-tuning-linux-networking-stack-receiving-data/