Understanding the EEVDF Scheduler in Linux Kernel

Linux 6.6 uses the EEVDF (Earliest Eligible Virtual Deadline First) scheduler to replace the CFS (Completely Fair Scheduler). Linux 6.12 completes the EEVDF scheduler by improving the handling of sleep process latency and supporting process time slice settings. The EEVDF scheduler considers schedule latency as one of the factors in process scheduling, ensuring fair allocation of process run time while prioritizing the earliest deadline among processes that have not received their deserved run time, thereby ensuring process scheduling latency and addressing the issue where the original CFS scheduler could only allocate process run time fairly without meeting process latency requirements.

The difference between the time a process should receive and the time it actually receives is called lag. A positive lag value indicates that the process has not received enough time, while a negative value indicates that the process has received more time than it deserves. If a process’s lag value is greater than or equal to 0, it is considered eligible. The process’s virtual deadline is the process’s virtual run time plus its virtual time slice. The EEVDF scheduler selects the process with the earliest virtual deadline from eligible processes. This allows latency-sensitive processes with shorter time slices to be scheduled preferentially, helping to improve their response times.

The EEVDF scheduler improves upon the CFS scheduler, still using the fair scheduling class (fair_sched_class) and calculating virtual run time in the same way as the CFS scheduler. I recommend reading my previous article on the “Completely Fair Scheduling Algorithm.”

1. Weighted Average Virtual Run Time

cfs_rq.avg_vruntime is the weighted relative virtual run time of all scheduling entities in the fair run queue “(se->vruntime – cfs_rq.min_vruntime) × se->load.weight” (with the relative benchmark being cfs_rq.min_vruntime), where se->vruntime is the virtual run time of scheduling entity se, cfs_rq.min_vruntime is the minimum virtual run time in the fair run queue, and se->load.weight is the weight of scheduling entity se.

cfs_rq.avg_load is the total weight of all scheduling entities. I believe the names “avg_vruntime” and “avg_load” are poorly chosen and should be renamed to “weighted_rel_vruntime_sum” and “weight_sum”.

“cfs_rq.avg_vruntime / cfs_rq.avg_load” is the weighted average relative virtual run time of the fair run queue, and adding cfs_rq.min_vruntime gives the weighted average virtual run time of the fair run queue.

What is the purpose of the weighted average virtual run time?

Assuming process 1 has a virtual run time of 100 and a weight of 2048, while process 2 has a virtual run time of 200 and a weight of 1024. The weighted average virtual run time is (100×2048+200×1024)/(2048+1024)=400/3≈133.

Process 1	Process 2	Description
Weight	2048	1024
Virtual Run Time	100	200	Virtual run time = actual run time × 1024/weight
Actual Run Time	200	200	Total is 400
Deserved Actual Run Time	266	133	Distributing 400 proportionally by weight
Deserved Virtual Run Time	133	133

The total actual run time T=100×2048/1024 + 200×1024/1024

Process 1’s deserved actual run time t1=T × 2048/(2048+1024)

Process 1’s deserved virtual run time v1=t1 × 1024/2048

=(100×2048/1024 + 200×1024/1024) × 2048/(2048+1024) × 1024/2048

=(100×2048 + 200×1024)/(2048+1024)

It can be seen that the deserved virtual run time for process 1 is the weighted average virtual run time, hence the weighted average virtual run time is the deserved virtual run time for each process.

When adding a scheduling entity to the fair run queue, add “(se->vruntime – cfs_rq.min_vruntime) × se->load.weight” to cfs_rq.avg_vruntime and add its weight to cfs_rq.avg_load, as shown in the code below.

enqueue_task_fair() -> enqueue_entity() -> __enqueue_entity() -> avg_vruntime_add()

kernel/sched/fair.c
static void
avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    unsigned long weight = scale_load_down(se->load.weight);
    s64 key = entity_key(cfs_rq, se);

    cfs_rq->avg_vruntime += key * weight;
    cfs_rq->avg_load += weight;
}

When removing a scheduling entity from the fair run queue, subtract “(se->vruntime – cfs_rq.min_vruntime) × se->load.weight” from cfs_rq.avg_vruntime and subtract its weight from cfs_rq.avg_load, as shown in the code below.

dequeue_task_fair() -> dequeue_entities() -> dequeue_entity() -> __dequeue_entity() -> avg_vruntime_sub()

kernel/sched/fair.c
static void
avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    unsigned long weight = scale_load_down(se->load.weight);
    s64 key = entity_key(cfs_rq, se);

    cfs_rq->avg_vruntime -= key * weight;
    cfs_rq->avg_load -= weight;
}

When cfs_rq.min_vruntime monotonically increases, subtract “cfs_rq.avg_load × delta” from cfs_rq.avg_vruntime, as shown in the code below.

update_min_vruntime() -> __update_min_vruntime()

kernel/sched/fair.c
static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
{
    u64 min_vruntime = cfs_rq->min_vruntime;
    s64 delta = (s64)(vruntime - min_vruntime);
    if (delta > 0) {
        avg_vruntime_update(cfs_rq, delta);
        min_vruntime = vruntime;
    }
    return min_vruntime;
}

static inline
void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
{
    cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta;
}

2. Virtual Lag and Virtual Deadline

Calculating virtual lag: se->vlag = (cfs_rq.min_vruntime + (cfs_rq.avg_vruntime / cfs_rq.avg_load)) – se->vruntime, which is the weighted average virtual run time of the fair run queue minus the virtual run time of the scheduling entity. n takes the maximum of “2 times the time slice” and TICK_NSEC (the maximum nanosecond value corresponding to one tick), and limit is “(n × 1024) / se->load.weight”, where se->vlag is in the range of [-limit, limit]. Refer to the function entity_lag() in the file “kernel/sched/fair.c”.

Calculating virtual deadline: se->deadline = se->vruntime + ((se->slice × 1024) / se->load.weight), where se->slice is the time slice length.

An eligible scheduling entity is one that meets the condition “se->vruntime <= (cfs_rq.min_vruntime + (cfs_rq.avg_vruntime / cfs_rq.avg_load))”, meaning that the virtual run time of the scheduling entity is less than or equal to the weighted average virtual run time of the fair run queue, which implies that se->vlag is greater than or equal to 0. Refer to the function vruntime_eligible() in the file “kernel/sched/fair.c”.

Process 1’s virtual lag is 133-100=33, and the actual lag is (33×2048/1024)=66. Process 2’s virtual lag is 133-200=-67, and the actual lag is (-67×1024/1024)=-67.

The sum of actual lags = 66+(-67)=-1, which is close to 0. The total actual lag for all processes should be 0.

Process 1	Process 2	Description
Weight	2048	1024
Virtual Run Time	100	200	Virtual run time = actual run time × 1024/weight
Actual Run Time	200	200	Total is 400
Deserved Actual Run Time	266	133	Distributing 400 proportionally by weight
Deserved Virtual Run Time	133	133
Actual Lag	66	-67	Deserved actual run time – actual run time
Virtual Lag	33	-67	Deserved virtual run time – virtual run time

3. Time Slice Length

sched_entity.slice is the time slice length for normal processes, with a default value of sysctl_sched_base_slice. sysctl_sched_base_slice is the base time slice, measured in nanoseconds, with a default value of 0.75 milliseconds, which can be modified through the file “/sys/kernel/debug/sched/base_slice_ns”.

Setting a shorter time slice for latency-sensitive processes will allow them to receive earlier virtual deadlines, causing the process scheduler to select them faster. Starting from Linux 6.12, it is supported to use the field sched_runtime of the struct sched_attr when calling the system call sched_setattr() to set time slices for normal processes, with a range of 0.1 to 100 milliseconds (including 0.1 milliseconds and 100 milliseconds).

The handling of the system call sched_setattr() is as follows, where se->custom_slice equals 1 indicates that a custom time slice is being used, meaning the process has set a time slice.

SYSCALL_DEFINE3(sched_setattr) -> sched_setattr() -> __sched_setscheduler() -> __setscheduler_params()

kernel/sched/syscalls.c
static void __setscheduler_params(struct task_struct *p,
const struct sched_attr *attr)
{
    ...
    if (dl_policy(policy)) {
        ...
    } else if (fair_policy(policy)) {
        p->static_prio = NICE_TO_PRIO(attr->sched_nice);
        if (attr->sched_runtime) {
            p->se.custom_slice = 1;
            p->se.slice = clamp_t(u64, attr->sched_runtime,
                            NSEC_PER_MSEC/10,   /* HZ=1000 * 10 */
                            NSEC_PER_MSEC*100); /* HZ=100  / 10 */
        } else {
            p->se.custom_slice = 0;
            p->se.slice = sysctl_sched_base_slice;
        }
    }
    ...
}

4. Selecting Processes

The fair run queue’s red-black tree in Linux 6.6 is sorted by virtual run time.

se->min_deadline is the minimum virtual deadline of the subtree rooted at se, equal to the minimum of se->deadline, left->min_deadline, and right->min_deadline, where left is the left child of se and right is the right child.

The function pick_eevdf() selects the process with the earliest virtual deadline from eligible scheduling entities.

In Linux 6.8, the fair run queue’s red-black tree was modified to sort by virtual deadline (the patch “sched/eevdf: Sort the rbtree by virtual deadline” modified the function entity_before() in the file “kernel/sched/fair.c”).

se->min_vruntime is the minimum virtual run time of the subtree rooted at se, equal to the minimum of se->vruntime, left->min_vruntime, and right->min_vruntime, where left is the left child of se and right is the right child.

The function pick_eevdf(): If the leftmost scheduling entity in the red-black tree (the one with the earliest virtual deadline) is eligible, select it; otherwise, traverse from the root node, checking if se has a left subtree and if the minimum virtual run time of the left subtree is eligible; if so, traverse the left subtree; otherwise, if se is eligible, select se.

5. Setting Virtual Run Time and Virtual Deadline on Enqueue

When a normal process sleeps or migrates to another processor, it retains its virtual lag when removed from the fair run queue. When the normal process is awakened or migrated to the current processor, its virtual run time is set to the weighted average virtual run time of the fair run queue minus the process’s virtual lag (by default enabling the scheduling feature PLACE_LAG, correcting the virtual lag to “se->vlag × (weight_sum + se->load.weight) / weight_sum”, where weight_sum is the total weight of the fair run queue), and its virtual deadline is set to the virtual run time plus the virtual time slice. The code is as follows.

enqueue_task_fair() -> enqueue_entity() -> place_entity()

kernel/sched/fair.c
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    u64 vslice, vruntime = avg_vruntime(cfs_rq);
    s64 lag = 0;

    if (!se->custom_slice)
        se->slice = sysctl_sched_base_slice;
    vslice = calc_delta_fair(se->slice, se);

    /*
     * Default scheduling feature PLACE_LAG enabled
     * lag = se->vlag * (weight_sum + se->load.weight) / weight_sum
     */
    if (sched_feat(PLACE_LAG) &amp;&& cfs_rq->nr_running) {
        struct sched_entity *curr = cfs_rq->curr;
        unsigned long load;

        lag = se->vlag;
        load = cfs_rq->avg_load;
        if (curr &amp;&& curr->on_rq)
            load += scale_load_down(curr->load.weight);

        lag *= load + scale_load_down(se->load.weight);
        if (WARN_ON_ONCE(!load))
            load = 1;
        lag = div_s64(lag, load);
    }

    se->vruntime = vruntime - lag;
    ...
    se->deadline = se->vruntime + vslice;
}

6. Virtual Lag of Sleeping Processes

How to manage the virtual lag of sleeping processes? The implementation method in Linux 6.12 is: “When an ineligible process sleeps, it remains in the run queue but is marked as delayed dequeue; its virtual lag will increase based on the elapsed virtual run time. Once its virtual lag becomes 0 or positive, the scheduler will remove it from the run queue.” The result of this implementation is that processes that sleep briefly will not escape negative lag values, but processes that sleep for a long time will eventually be exempted from lag debt. Positive lag values will be retained indefinitely until the process runs again.

When an ineligible process sleeps, it is marked for delayed dequeue, as shown in the code below.

__schedule() -> block_task() -> dequeue_task(rq, p, DEQUEUE_SLEEP | flags)
-> p->sched_class->dequeue_task()= dequeue_task_fair()
-> dequeue_entities() -> dequeue_entity()

kernel/sched/fair.c
static bool
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    bool sleep = flags &amp; DEQUEUE_SLEEP;
    ...
    if (flags &amp; DEQUEUE_DELAYED) {
        ...
    } else {
        bool delay = sleep;
        ...
        if (sched_feat(DELAY_DEQUEUE) &amp;&& delay &amp;&&
            !entity_eligible(cfs_rq, se)) { /* Default scheduling feature DELAY_DEQUEUE enabled */
            ...
            se->sched_delayed = 1;
            return false;
        }
    }
    ...
}

When the process scheduler selects a delayed dequeue process, it removes it from the run queue, as shown in the code below.

pick_task_fair() -> pick_next_entity()
kernel/sched/fair.c
static struct sched_entity *
pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
{
    ...
    struct sched_entity *se = pick_eevdf(cfs_rq);
    if (se->sched_delayed) {
        dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
        return NULL;
    }
    return se;
}

dequeue_entities() -> dequeue_entity()
kernel/sched/fair.c
static bool
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    ...
    if (flags &amp; DEQUEUE_DELAYED) {
        SCHED_WARN_ON(!se->sched_delayed);
    } else {
        ...
    }
    ...
    if (se != cfs_rq->curr)
        __dequeue_entity(cfs_rq, se);
    se->on_rq = 0;
    ...
    if (flags &amp; DEQUEUE_DELAYED)
        finish_delayed_dequeue_entity(se);
    ...
}

static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
{
    se->sched_delayed = 0;
    if (sched_feat(DELAY_ZERO) &amp;&& se->vlag > 0) /* Default scheduling feature DELAY_ZERO enabled */
        se->vlag = 0;
}

To wake up a delayed dequeue sleeping process, the code is as follows.

wake_up_process() -> try_to_wake_up() -> ttwu_queue() -> ttwu_do_activate() -> activate_task() -> enqueue_task()
-> p->sched_class->enqueue_task() = enqueue_task_fair()
kernel/sched/fair.c
static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
    ...
    for_each_sched_entity(se) {
        if (se->on_rq) {
            if (se->sched_delayed)
                requeue_delayed_entity(se);
            break;
        }
        ...
    }
    ...
}

7. Retaining Relative Virtual Deadline When Migrating Processes

By default, the scheduling feature PLACE_REL_DEADLINE is enabled, which retains the relative virtual deadline when migrating a normal process from one processor to another, as shown in the code below.

kernel/sched/fair.c
static bool
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    ...
    if (sched_feat(PLACE_REL_DEADLINE) &amp;&& !sleep) {
        se->deadline -= se->vruntime;
        se->rel_deadline = 1;
    }
    ...
}

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    ...
    if (sched_feat(PLACE_REL_DEADLINE) &amp;&& se->rel_deadline) {
        se->deadline += se->vruntime;
        se->rel_deadline = 0;
        return;
    }
    ...
}

8. References

(1) An EEVDF CPU scheduler for Linux, https://lwn.net/Articles/925371/

(2) Completing the EEVDF scheduler, https://lwn.net/Articles/969062/

(3) Kernel documentation “Documentation/scheduler/sched-eevdf.rst”