New Optimization in Linux Scheduler: Merging cfs_rq and sched_entity for Dramatic Cache Hit Rate Improvement

Abstract

On the Linux 6.15 test kernel, a series of patches submitted by Google engineer Zecheng Li significantly reduces pointer jumps and cache misses by merging the <span><span>cfs_rq</span></span> (Completely Fair Scheduler run queue) with the <span><span>sched_entity</span></span> (scheduling entity) structure and switching to a per-CPU allocator. In large-scale cgroup hierarchical scenarios, this patch reduces LLC (Last Level Cache) miss rates by up to 31% on AMD machines and improves kernel IPC by 25%, while other benchmarks show almost no regressions. This patch is expected to be merged in Linux 6.18.

Background: CFS Scheduler and Data Locality Issues

The Completely Fair Scheduler (CFS) in Linux is the default CPU scheduler. In CFS, each task_group maintains a <span><span>struct cfs_rq</span></span> (run queue) on each CPU to store scheduling-related data. At the same time, each <span><span>cfs_rq</span></span> corresponds to a <span><span>struct sched_entity</span></span> (scheduling entity), which indicates the “position” of this run queue in the parent runqueue.

In the original implementation, the task_group managed these structures through two arrays of pointer arrays <span><span>tg->cfs_rq</span></span> and <span><span>tg->se</span></span>, requiring dereferencing pointers and jumping to the corresponding memory location for each access. Although this memory is allocated locally in NUMA via <span><span>kzalloc_node</span></span>, frequent access to different CPU/task_group structures under multi-level cgroup hierarchies still leads to a large number of LLC misses.

Core Improvements: Structure Co-location + Per-CPU Allocation

The two key optimizations in the patch series are:

1. Embedding <span><span>sched_entity</span></span> into <span><span>cfs_rq</span></span>

Previously, accessing the parent run queue’s <span><span>sched_entity</span></span> required multiple pointer jumps:

cfs_rq-&gt;tg-&gt;se[cpu]

After optimization, the <span><span>sched_entity</span></span> field is directly embedded into the <span><span>cfs_rq</span></span>, allowing for direct offset calculations to read it, significantly reducing indirect pointer access. The trade-off is that the root task_group (root cgroup) also needs to allocate <span><span>sched_entity</span></span> memory, which was previously unnecessary, but the additional memory overhead is negligible at CPU count * sizeof(sched_entity).

2. Using a per-CPU allocator for <span><span>cfs_rq</span></span>

In hot paths, accessing <span><span>cfs_rq</span></span> often involves iterating over multiple task_groups on the same CPU, so per-CPU allocation can improve cache affinity and hit rates while also eliminating the original pointer array memory.

Comparison of Memory Layout Before and After Optimization

Original Layout

tg -&gt; cfs_rq pointers -&gt; cfs_rq[]
tg -&gt; se pointers -&gt; sched_entity[]

Optimized Layout

tg percpu offset -&gt;
  [CPU0] cfs_rq + sched_entity
  [CPU1] cfs_rq + sched_entity
  ...

This “structure co-location” + per-CPU layout significantly improves data locality.

Performance Testing: Significant Decrease in Cache Miss Rates

The author constructed a tree-like cgroup hierarchy (adjusting width/depth) on Intel and AMD machines, running the schbench load with an 80% CPU quota for throttling tests, with a bandwidth period of 10ms. The results (excerpted from the patch description):

Kernel LLC Misses depth=3 width=10 depth=5 width=4
AMD-orig [2218.98, 2241.89]M [2599.80, 2645.16]M
AMD-opt [1957.62, 1981.55]M [2380.47, 2431.86]M
Change -11.69% -8.24%
Intel-orig [1580.53, 1604.90]M [2125.37, 2208.68]M
Intel-opt [1066.94, 1100.19]M [1543.77, 1570.83]M
Change -31.96% -28.13%

Additionally, the kernel IPC on the AMD platform improved by 25%, and there was also a 3% improvement on the Intel platform. In other workloads without CPU share limitations (sysbench, hackbench, ebizzy), there were almost no noticeable performance regressions.

Application Scenarios: Scheduling Optimization for Large-Scale Cgroup Hierarchies

This optimization primarily targets the following scenarios:

  • Scheduling and throttling functions under large-scale cgroup hierarchies (thousands of instances)
  • Hot paths with frequent access to scheduling queues and entities (such as the scheduler’s pick_next_task function)

For small-scale, simple scenarios, the benefits may not be significant; however, in cloud platforms, multi-tenant scheduling, and container quotas, the improvement in cache hit rates can directly translate to lower latency and higher throughput.

Conclusion and Outlook

Zecheng Li’s patch series reduces cache misses in the scheduler’s critical path through “structure co-location + per-CPU allocation,” achieving considerable performance improvements with almost no side effects. Future attention can be focused on:

  • Further layout optimizations of scheduler data structures
  • Further adjustments in NUMA cross-node scenarios
  • Regression testing to ensure stability under various edge cases

This patch series v4 is actively discussed on LKML and is expected to be merged in Linux 6.18.

Patch and discussion links:

Thank you for reading. Feedback and comments are welcome.

Leave a Comment