In-Depth Analysis of Linux NUMA Systems: From Principles to Practice

1 Overview of NUMA Systems

1.1 What is NUMA?

Non-Uniform Memory Access (NUMA) is a memory architecture designed for multiprocessor systems. In traditional Symmetric Multiprocessing (SMP) systems, all processors share a single centralized memory controller, and all memory accesses have the same latency, a structure known as Uniform Memory Access (UMA). However, as the number of processors increases, the memory controller and system bus become performance bottlenecks, limiting the scalability of the system.

NUMA architecture addresses this scalability issue by dividing processors and memory into multiple nodes. In a NUMA system, each node contains one or more processors and locally installed memory, with nodes connected via high-speed interconnects (such as crossbar switches or point-to-point links). The speed at which processors access local node memory is significantly faster than accessing memory from remote nodes, resulting in the “non-uniform” memory access characteristic.

From the perspective of Linux, the NUMA platforms of interest are primarily the so-called Cache-Coherent NUMA (ccNUMA) systems. In ccNUMA systems, all memory is visible and can be accessed from any CPU connected to any unit, with cache coherence handled in hardware by processor caches and/or system interconnects.

1.2 Metaphor for NUMA Hardware Architecture

To better understand the NUMA architecture, we can use a vivid metaphor: comparing a NUMA system to a large library system:

• Single Library (UMA System): All readers (processors) must access the same central library (memory), regardless of where the readers are located. As the number of readers increases, the roads to the library become congested, increasing access latency.
• Library Network (NUMA System): Each city area (node) has its own library (local memory), and readers (processors) first access the local library. If the required book is not in the local library, it can be obtained from other libraries through inter-library loan services (interconnect), but this takes additional time.

In this metaphor, local access corresponds to readers accessing the library in their area, which is fast; while remote access corresponds to borrowing books from other libraries through inter-library loans, which is slow and increases the burden on the system. An excellent library management system (NUMA optimization) will try to ensure that readers find books in their local library as much as possible, reducing the frequency of inter-library loans.

1.3 Linux Software Abstraction for NUMA

The Linux kernel divides the system’s hardware resources into multiple software abstractions called “nodes“, which map to the physical units of the hardware platform while abstracting some architectural details. Like physical units, software nodes may contain zero or more CPUs, memory, and/or IO buses.

For each memory node, Linux builds an independent memory management subsystem, which has its own free page list, used page list, usage statistics, and locks to mediate access. This design allows the kernel to manage memory resources more finely, optimizing local access and reducing remote access.

Linux supports the simulation of NUMA nodes, allowing a NUMA environment to be simulated even on non-NUMA hardware. For NUMA simulation, Linux will partition existing nodes or the system memory of non-NUMA platforms into multiple nodes. This is very useful for testing NUMA kernel and application functionalities and can also be used as a memory resource management mechanism.

2 Linux NUMA Core Architecture

2.1 Node Management and Data Structures

In the Linux kernel, each NUMA node is represented by a pglist_data structure (defined as pg_data_t via typedef). This structure is central to NUMA memory management and contains all relevant information for managing the node’s memory. The same structure is used even in UMA architectures, referred to as contig_page_data.

Below is a key definition of the pglist_data structure (a simplified version based on the kernel source code):

typedef struct pglist_data {
    struct zone node_zones[MAX_NR_ZONES];          // Array of memory zones in the node
    struct zonelist node_zonelists[MAX_ZONELISTS]; // List of zones used for allocation
    int nr_zones;                                  // Number of zones in the node
    struct page *node_mem_map;                     // Array of page descriptors in the node
    unsigned long node_start_pfn;                  // Starting page frame number of the node
    unsigned long node_present_pages;              // Total number of physical pages in the node
    unsigned long node_spanned_pages;              // Total number of pages spanned by the node (including holes)
    int node_id;                                   // Node ID
    struct pglist_data *pgdat_next;                // Next node
    wait_queue_head_t kswapd_wait;                 // kswapd wait queue
    struct task_struct *kswapd;                    // Memory reclaim thread
    int kswapd_order;                              // kswapd reclaim order
    enum zone_type kswapd_highest_zoneidx;         // Highest zone index reclaimed by kswapd
} pg_data_t;

Each node is divided into multiple zones, which represent memory ranges with specific characteristics. The main zone types defined in Linux include:

• ZONE_DMA: Memory area suitable for DMA operations, used by peripherals that cannot access all addressable memory.
• ZONE_DMA32: Similar to ZONE_DMA, but suitable for devices that can only access 32-bit address space.
• ZONE_NORMAL: Normal memory area, accessible by the kernel at any time.
• ZONE_HIGHMEM: High memory, referring to those parts of physical memory that are not permanently mapped in the kernel page table (needed only on 32-bit architectures).
• ZONE_MOVABLE: Movable memory area, where contents can be moved between different physical pages, used for memory hotplugging and fragmentation control.
• ZONE_DEVICE: Memory for persistent memory (PMEM) and devices like GPUs.

The following diagram illustrates the relationship between nodes, zones, and pages:

In-Depth Analysis of Linux NUMA Systems: From Principles to Practice

2.2 Memory Zones and Zonelist

When the kernel needs to allocate memory, it must determine from which zone to allocate. To efficiently manage memory access across different zones, Linux establishes a zonelist for each node, which specifies the zones/nodes to access when the selected zone/node cannot satisfy the allocation request.

The sorting strategy of the zonelist is an important design decision. By default, Linux uses a Node-ordered zonelist, meaning that before using remote nodes sorted by NUMA distance, the kernel will attempt to fall back to other zones in the same node. This strategy balances local allocation needs with the scarcity of special zones (like DMA).

Below is a simplified representation of the zonelist construction process:

// Simplified zonelist construction process
struct zonelist *build_zonelists(pg_data_t *pgdat) {
    struct zonelist *zonelist;
    int local_node = pgdat->node_id;
    
    // First, try to allocate from the local node
    for (enum zone_type zone = ZONE_NORMAL; zone >= ZONE_DMA; zone--) {
        struct zone *z = &pgdat->node_zones[zone];
        if (zone_managed(z)) {
            zonelist->_zonerefs[zonelist->_nr_zonerefs++] = {
                .zone = z,
                .zone_idx = zone
            };
        }
    }
    
    // Then try remote nodes in order of distance
    for (int distance = 1; distance < MAX_DISTANCE; distance++) {
        for (each node at this distance) {
            for (enum zone_type zone = ZONE_NORMAL; zone >= ZONE_DMA; zone--) {
                // Add remote zones to zonelist
            }
        }
    }
}

2.3 Page Allocation Mechanism

Page allocation in Linux NUMA systems follows the principle of local allocation priority. By default, Linux attempts to satisfy memory allocation requests from the node where the requesting CPU is allocated. Specifically, the kernel will try to allocate from the first node in the appropriate zone list of the requesting source node.

This local allocation strategy helps maintain the “localization” of subsequent accesses to the allocated memory, reducing performance losses from remote access. However, the effectiveness of this optimization depends on whether the task continues to run on the initial node. If the task migrates between nodes, it may move away from its initial node and kernel data structures.

When local node memory is insufficient, the kernel will look for other nodes in the order defined by the zonelist. This process is transparent to applications but may lead to performance degradation. The core process of page allocation is as follows:

1. Local Node Allocation: Attempt to allocate memory from the appropriate zone of the current CPU’s node.
2. Fallback Mechanism: If the local node cannot satisfy the request, try other nodes in the order of the zonelist.
3. Reclaim Mechanism: If all nodes are out of memory, trigger the memory reclaim mechanism (kswapd) to attempt to free memory.
4. OOM Handling: If reclaim still cannot satisfy the demand, trigger the Out-of-Memory killer to terminate processes to free memory.

3 NUMA Scheduling Mechanism

3.1 Scheduling Domains and Scheduling Groups

The Linux scheduler understands the NUMA topology of the system through the concepts of scheduling domains (sched_domain) and scheduling groups (sched_group). Scheduling domains form a hierarchy, from the physical CPU core level to the NUMA node level, and then to higher-level domains (such as multi-socket systems).

Each scheduling domain contains one or more scheduling groups, which represent the processor resources available for load balancing within that domain. For NUMA systems, NUMA domains are particularly important, as they represent all CPUs within a single NUMA node.

Below is a simplified representation of the key data structures of scheduling domains:

struct sched_domain {
    struct sched_domain *parent;          // Parent domain
    struct sched_domain *child;           // Child domain
    struct sched_group *groups;           // Scheduling groups in this domain
    unsigned long span[];                 // CPU bitmap covered by this domain
    int flags;                           // Domain attribute flags
    // ... other fields
};

struct sched_group {
    struct sched_group *next;             // Next group in the list
    unsigned long span[];                 // CPU bitmap covered by this group
    int group_weight;                     // Group weight
    // ... other fields
};

The following diagram illustrates a typical scheduling domain hierarchy in a NUMA system:

3.2 NUMA Load Balancing

The Linux scheduler ensures a reasonable distribution of tasks across the system through a load balancing mechanism. For NUMA systems, load balancing strategies are particularly important, as improper task migration can lead to a significant amount of remote memory access.

The main features of NUMA load balancing include:

• Hierarchical Balancing: Load balancing occurs at various levels of the scheduling domain hierarchy, from SMT domains to NUMA domains.
• NUMA Affinity: The scheduler tries to keep tasks on their initial NUMA node to maintain memory access locality.
• Periodic Balancing: The system periodically checks CPU load conditions and migrates tasks when imbalances are detected.

For NUMA domains, the load balancing strategy is more conservative, as cross-node task migration is costly. The scheduler weighs the pros and cons of computational load balancing and memory access locality, prioritizing the latter.

The core algorithm process for NUMA load balancing is as follows:

// Simplified NUMA balancing algorithm
void numa_balancing(void) {
    for_each_node(node) {
        // Check for imbalance between nodes
        if (check_imbalance(node)) {
            // Find migratable tasks
            struct task_struct *task = find_migrable_task(node);
            if (task) {
                // Calculate migration cost
                int cost = calculate_migration_cost(task, node);
                if (cost < THRESHOLD) {
                    // Perform task migration
                    migrate_task(task, node);
                }
            }
        }
    }
}

3.3 NUMA-Aware Scheduling

The modern Linux kernel implements NUMA-aware scheduling, optimizing task execution in NUMA systems through multiple mechanisms:

1. Automatic NUMA Balancing: Kernel threads kswapd and specific NUMA balancing threads periodically scan task address spaces to detect whether pages are correctly placed or if data should be migrated to the local memory node where the task is running.
2. Page Placement Policy: When a task first accesses a page, the kernel allocates it in the local memory of the current running node.
3. Page Migration: If the scheduler detects that a task frequently accesses pages from remote nodes, it may decide to migrate the pages to the node where the task is currently running or migrate the task to the node where the pages reside.
4. Load Hints: The kernel collects memory access patterns of tasks to guide scheduling and migration decisions.

Automatic NUMA balancing can be tuned through files in the /sys/kernel/debug/sched/numa_balancing directory:

• scan_period_min_ms: Minimum time (in milliseconds) to scan the task’s virtual memory.
• scan_delay_ms: Initial scan delay when a task is created.
• scan_period_max_ms: Maximum time (in milliseconds) to scan the task’s virtual memory.
• scan_size_mb: Number of MB to scan in a single scan.

4 NUMA Memory Allocation Strategies

4.1 Memory Policy Interface

Linux provides a memory policy mechanism that allows user-space applications and kernel components to control the node location of memory allocations. Memory policies are key optimization tools in NUMA systems, defining the rules the kernel should follow when allocating memory.

The main types of memory policies include:

• MPOL_DEFAULT: Default policy, using local allocation.
• MPOL_BIND: Restricts memory allocation to a specific set of nodes.
• MPOL_PREFERRED: Prefers to allocate from a specified node but allows fallback to other nodes.
• MPOL_INTERLEAVE: Interleaves memory pages across multiple nodes.
• MPOL_LOCAL (Linux 4.17+): Prefers to allocate from the local node, similar to the default policy but stricter.

User-space applications can manage memory policies using system calls like mbind(), set_mempolicy(), and move_pages(). Additionally, the libnuma library provides a more convenient interface for managing NUMA policies.

4.2 Binding Policies and CPUsets

Binding policies allow processes or memory regions to be restricted to specific NUMA nodes for execution and memory allocation. This is particularly useful for performance-sensitive applications, ensuring memory access locality.

Linux provides a CPUsets mechanism that allows system administrators to partition the system into multiple independent CPU and memory node partitions. Each cpuset can contain specific CPUs and memory nodes, and tasks can be restricted to run within a specific cpuset.

Below is an example of using cpuset:

# Create cpuset
mkdir /dev/cpuset/application-set
echo 0-3 > /dev/cpuset/application-set/cpus
echo 0 > /dev/cpuset/application-set/mems
echo 1234 > /dev/cpuset/application-set/tasks

CPUsets combined with NUMA memory policies can provide fine-grained resource control, which is particularly useful in virtualization and containerization environments.

4.3 Interleaved Allocation Policy

Interleaved allocation policy allocates memory pages in a round-robin fashion across multiple nodes, similar to RAID 0 striping techniques. This policy is suitable for streaming workloads or applications with random access patterns that cannot benefit from data locality.

Interleaved allocation can balance memory loads across multiple nodes, preventing a single node from becoming a bottleneck. Below is an example code for using the interleaved policy:

#include <numa.h>
#include <numaif.h>

void enable_interleave(void) {
    struct bitmask *bm = numa_allocate_nodemask();
    numa_bitmask_setbit(bm, 0);  // Node 0
    numa_bitmask_setbit(bm, 1);  // Node 1
    numa_set_membind(bm);        // Set interleaved policy
    numa_free_nodemask(bm);
}

4.4 Preferred Node Policy

Preferred node policy indicates that the kernel should prefer to allocate memory from a specified node, but allows fallback to other nodes if that node runs out of memory. This policy provides a balance between flexibility and locality.

Preferred node policy is suitable for applications that perform best on specific nodes but can tolerate occasional remote memory access.

5 NUMA Performance Optimization and Debugging

5.1 Performance Monitoring Tools

Effective performance monitoring is the foundation of NUMA optimization. Linux provides a range of tools to monitor the performance of NUMA systems:

1. numastat: Displays statistics on NUMA memory allocation, including the number of local and remote allocations for each node.

$ numastat
node0           node1
numa_hit        12345678        12345678
numa_miss       1234            1234
numa_foreign    1234            1234
interleave_hit  5678            5678
local_node      12345678        12345678
other_node      1234            1234

2. numactl: Used to view NUMA topology and control the NUMA policy of processes.

# View system NUMA topology
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 16384 MB
node 0 free: 8192 MB
node 1 cpus: 4 5 6 7
node 1 size: 16384 MB
node 1 free: 8192 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

3. lstopo: Generates a graphical topology of the system.
4. perf: Performance analysis tool that can detect NUMA-related performance events, such as remote memory access.

# Monitor remote memory access
$ perf stat -e node-loads,node-load-misses <command>

5.2 NUMA Debugging Techniques

The Linux kernel provides various mechanisms for debugging NUMA issues:

1. /proc filesystem: View the NUMA memory distribution of processes through /proc/*/numa_maps.

$ cat /proc/self/numa_maps
00400000 default file=/bin/cat mapped=6 N0=6
7f33b26c2000 default file=/bin/cat mapped=27 N0=27
...

2. debugfs interface: Control automatic NUMA balancing parameters through /sys/kernel/debug/sched/numa_balancing.
3. Tracing Points: Use ftrace to trace NUMA-related events.

# Enable NUMA balancing tracing
$ echo 1 > /sys/kernel/debug/tracing/events/sched/sched_move_numa/enable
$ cat /sys/kernel/debug/tracing/trace_pipe

5.3 Performance Optimization Best Practices

Depending on the characteristics of the application and the workload of the system, different NUMA optimization strategies can be adopted:

The following table summarizes the optimization strategies for different workload types:

Workload Type	Characteristics	Recommended Strategy	Tools/Techniques
CPU Intensive	High computation density, low memory access	Local allocation, CPU binding	numactl –membind, taskset
Memory Intensive	High memory bandwidth demand	Memory interleaving, large pages	numactl –interleave, hugepages
Data Locality Sensitive	Stable working set, predictable access patterns	Bind to a single node	numactl –cpunodebind
Mixed Workload	Diverse access patterns	Hierarchical strategy, dynamic balancing	auto-NUMA balancing, MPOL_PREFERRED

When optimizing in practice, the following steps should be followed:

1. Performance Analysis: Use monitoring tools to identify current performance bottlenecks and the proportion of remote access.
2. Strategy Selection: Choose the appropriate memory policy based on application characteristics.
3. Testing and Validation: Test the effectiveness of the strategy under simulated loads.
4. Production Deployment: Deploy the optimization strategy in the production environment and continue monitoring.

6 Conclusion

Linux’s support for NUMA systems is a complex and mature subsystem that enables applications to fully leverage the performance potential of NUMA architectures through multi-layered abstractions and optimization strategies. From node management to memory allocation, from scheduling strategies to load balancing, the Linux NUMA implementation reflects the adaptability of modern operating systems in complex hardware environments.

Key implementation points include:

1. Hierarchical Abstraction: Managing physical memory through a three-level structure of nodes, zones, and pages.
2. Locality Priority: Default local allocation strategy and scheduling domain-aware load balancing.
3. Policy Flexibility: A rich memory policy interface to meet different application needs.
4. Automatic Optimization: Automatic NUMA balancing mechanisms adapt to dynamic workloads.

For developers and system administrators, understanding how Linux NUMA works is crucial for optimizing high-performance applications. By effectively utilizing NUMA strategies and monitoring tools, significant performance improvements can be achieved for memory-intensive workloads on multiprocessor systems.

As computing architectures evolve, the importance of NUMA optimization will continue to grow. New hardware platforms such as ARM64 and RISC-V provide good support for NUMA, and the Linux NUMA subsystem will continue to evolve to meet the demands of future computing.