Abstract

On the Linux 6.15 test kernel, a series of patches submitted by Google engineer Zecheng Li significantly reduces pointer jumps and cache misses by merging the cfs_rq (Completely Fair Scheduler run queue) with the sched_entity (scheduling entity) structure and switching to a per-CPU allocator. In large-scale cgroup hierarchical scenarios, this patch reduces LLC (Last Level Cache) miss rates by up to 31% on AMD machines and improves kernel IPC by 25%, while other benchmarks show almost no regressions. This patch is expected to be merged in Linux 6.18.

Background: CFS Scheduler and Data Locality Issues

The Completely Fair Scheduler (CFS) in Linux is the default CPU scheduler. In CFS, each task_group maintains a struct cfs_rq (run queue) on each CPU to store scheduling-related data. At the same time, each cfs_rq corresponds to a struct sched_entity (scheduling entity), which indicates the “position” of this run queue in the parent runqueue.

In the original implementation, the task_group managed these structures through two arrays of pointer arrays tg->cfs_rq and tg->se, requiring dereferencing pointers and jumping to the corresponding memory location for each access. Although this memory is allocated locally in NUMA via kzalloc_node, frequent access to different CPU/task_group structures under multi-level cgroup hierarchies still leads to a large number of LLC misses.

Core Improvements: Structure Co-location + Per-CPU Allocation

The two key optimizations in the patch series are:

1. Embedding `sched_entity` into `cfs_rq`

Previously, accessing the parent run queue’s sched_entity required multiple pointer jumps:

cfs_rq-&gt;tg-&gt;se[cpu]

After optimization, the sched_entity field is directly embedded into the cfs_rq, allowing for direct offset calculations to read it, significantly reducing indirect pointer access. The trade-off is that the root task_group (root cgroup) also needs to allocate sched_entity memory, which was previously unnecessary, but the additional memory overhead is negligible at CPU count * sizeof(sched_entity).

2. Using a per-CPU allocator for `cfs_rq`

In hot paths, accessing cfs_rq often involves iterating over multiple task_groups on the same CPU, so per-CPU allocation can improve cache affinity and hit rates while also eliminating the original pointer array memory.

Comparison of Memory Layout Before and After Optimization

Original Layout

tg -&gt; cfs_rq pointers -&gt; cfs_rq[]
tg -&gt; se pointers -&gt; sched_entity[]

Optimized Layout

tg percpu offset -&gt;
  [CPU0] cfs_rq + sched_entity
  [CPU1] cfs_rq + sched_entity
  ...

This “structure co-location” + per-CPU layout significantly improves data locality.

Performance Testing: Significant Decrease in Cache Miss Rates

The author constructed a tree-like cgroup hierarchy (adjusting width/depth) on Intel and AMD machines, running the schbench load with an 80% CPU quota for throttling tests, with a bandwidth period of 10ms. The results (excerpted from the patch description):

Kernel LLC Misses	depth=3 width=10	depth=5 width=4
AMD-orig	[2218.98, 2241.89]M	[2599.80, 2645.16]M
AMD-opt	[1957.62, 1981.55]M	[2380.47, 2431.86]M
Change	-11.69%	-8.24%
Intel-orig	[1580.53, 1604.90]M	[2125.37, 2208.68]M
Intel-opt	[1066.94, 1100.19]M	[1543.77, 1570.83]M
Change	-31.96%	-28.13%

Additionally, the kernel IPC on the AMD platform improved by 25%, and there was also a 3% improvement on the Intel platform. In other workloads without CPU share limitations (sysbench, hackbench, ebizzy), there were almost no noticeable performance regressions.

Application Scenarios: Scheduling Optimization for Large-Scale Cgroup Hierarchies

This optimization primarily targets the following scenarios:

Scheduling and throttling functions under large-scale cgroup hierarchies (thousands of instances)
Hot paths with frequent access to scheduling queues and entities (such as the scheduler’s pick_next_task function)

For small-scale, simple scenarios, the benefits may not be significant; however, in cloud platforms, multi-tenant scheduling, and container quotas, the improvement in cache hit rates can directly translate to lower latency and higher throughput.

Conclusion and Outlook

Zecheng Li’s patch series reduces cache misses in the scheduler’s critical path through “structure co-location + per-CPU allocation,” achieving considerable performance improvements with almost no side effects. Future attention can be focused on:

Further layout optimizations of scheduler data structures
Further adjustments in NUMA cross-node scenarios
Regression testing to ensure stability under various edge cases

This patch series v4 is actively discussed on LKML and is expected to be merged in Linux 6.18.

Patch and discussion links:

https://patchew.org/linux/[email protected]/
https://www.spinics.net/lists/kernel/msg5835594.html
https://patchew.org/linux/[email protected]/[email protected]/
https://lkml.rescloud.iu.edu/2507.0/01653.html

Thank you for reading. Feedback and comments are welcome.

New Optimization in Linux Scheduler: Merging cfs_rq and sched_entity for Dramatic Cache Hit Rate Improvement

Abstract

Background: CFS Scheduler and Data Locality Issues

Core Improvements: Structure Co-location + Per-CPU Allocation

1. Embedding `<span><span>sched_entity</span></span>` into `<span><span>cfs_rq</span></span>`

2. Using a per-CPU allocator for `<span><span>cfs_rq</span></span>`

Comparison of Memory Layout Before and After Optimization

Performance Testing: Significant Decrease in Cache Miss Rates

Application Scenarios: Scheduling Optimization for Large-Scale Cgroup Hierarchies

Conclusion and Outlook

Leave a Comment Cancel reply

Abstract

Background: CFS Scheduler and Data Locality Issues

Core Improvements: Structure Co-location + Per-CPU Allocation

1. Embedding <span><span>sched_entity</span></span> into <span><span>cfs_rq</span></span>

2. Using a per-CPU allocator for <span><span>cfs_rq</span></span>

Comparison of Memory Layout Before and After Optimization

Performance Testing: Significant Decrease in Cache Miss Rates

Application Scenarios: Scheduling Optimization for Large-Scale Cgroup Hierarchies

Conclusion and Outlook

Related posts

Leave a Comment Cancel reply

1. Embedding `<span><span>sched_entity</span></span>` into `<span><span>cfs_rq</span></span>`

2. Using a per-CPU allocator for `<span><span>cfs_rq</span></span>`