Linux Process Scheduling – Real-Time Scheduler

Background

  • Read the fucking source code! –By Lu Xun
  • A picture is worth a thousand words. –By Gorky

Note:

  1. Kernel version: 4.14
  2. ARM64 processor, Cortex-A53, dual-core
  3. Tools used: Source Insight 3.5, Visio

1. Overview

In the Linux kernel, real-time processes always have a higher priority than ordinary processes. The scheduling of real-time processes is managed by the Real Time Scheduler (RT scheduler), while ordinary processes are managed by the CFS scheduler. The scheduling policies supported for real-time processes are: SCHED_FIFO and SCHED_RR.

The previous series of articles analyzed the CFS scheduler, covering topics such as CPU load, group scheduling, and bandwidth control. This article will analyze the RT scheduler from these perspectives, making it easier to understand if you have read the previous articles.

Without further ado, let’s get to the point.

2. Data Structures

It is necessary to list the key structures:

  • struct rq: Run queue, one for each CPU;
  • struct rt_rq: Real-time run queue, used to manage scheduling entities for real-time tasks;
  • struct sched_rt_entity: Real-time scheduling entity, used for participating in scheduling, functions similarly to struct sched_entity;
  • struct task_group: Group scheduling structure;
  • struct rt_bandwidth: Bandwidth control structure;

As usual, let’s start with a diagram to clarify the relationships between these structures:

Linux Process Scheduling - Real-Time Scheduler

  • From the structural organization in the diagram, it is basically consistent with the CFS scheduler. The difference is that the CFS scheduler organizes scheduling entities using a red-black tree, while the RT scheduler uses a priority queue to organize real-time scheduling entities;
  • The rt_rq run queue maintains 100 priority queues (linked lists), with priorities ranging from 0 to 99, from high to low;
  • The objects managed by the scheduler are scheduling entities; the task task_struct and task group task_group both participate in scheduling management through the embedded scheduling entity data structure;
  • The task_group task group scheduling maintains its own rt_rq for each CPU to store its sub-tasks or sub-task groups, which can be cascaded downwards, thus forming a tree structure;

Among the structures mentioned above, struct rq and struct task_group have been analyzed in previous articles. Below, I will provide brief comments on the key structures related to the RT run queue:

struct sched_rt_entity {    struct list_head    run_list;   // Used to add to the priority queue    unsigned long            timeout;    // Set timeout    unsigned long            watchdog_stamp;     // Used to record jiffies value    unsigned int            time_slice;     // Time slice, 100ms,    unsigned short            on_rq;    unsigned short            on_list;    struct sched_rt_entity        *back;  // Temporarily used for connecting RT scheduling entities from top to bottom    #ifdef CONFIG_RT_GROUP_SCHED    struct sched_rt_entity        *parent;    // Points to the parent RT scheduling entity    /* rq on which this entity is (to be) queued: */    struct rt_rq            *rt_rq;     // RT scheduling entity's real-time run queue, being scheduled    /* rq "owned" by this entity/group: */    struct rt_rq            *my_q;  // RT scheduling entity's owned real-time run queue, used to manage sub-tasks or sub-groups    #endif} __randomize_layout;    /* Real-Time classes' related field in a runqueue: */struct rt_rq {    struct rt_prio_array active;    // Priority queue, a linked list of 100 priorities, and defines a bitmap for quick querying    unsigned int rt_nr_running; // Number of active tasks in the RT run queue    unsigned int rr_nr_running;    #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED    struct {        int curr; /* highest queued rt task prio */     // Current highest priority of RT task    #ifdef CONFIG_SMP        int next; /* next highest */        // Priority of the next RT task to run, if two tasks have the highest priority, then curr == next    #endif    } highest_prio;    #endif    #ifdef CONFIG_SMP    unsigned long rt_nr_migratory;      // This value increases or decreases when the task is not bound to a specific CPU, used for task migration    unsigned long rt_nr_total;      // Used for overload check    int overloaded;     // If the RT run queue is overloaded, tasks will be pushed to other CPUs    struct plist_head pushable_tasks;   // Priority list, used to push overloaded tasks    #endif /* CONFIG_SMP */    int rt_queued;  // Indicates that the RT run queue has been added to rq queue    int rt_throttled;   // Used for throttling operations    u64 rt_time;    // Accumulated run time, exceeds local rt_runtime, then throttling occurs    u64 rt_runtime;    // Allocated run time for local pool    /* Nests inside the rq lock: */    raw_spinlock_t rt_runtime_lock;    #ifdef CONFIG_RT_GROUP_SCHED    unsigned long rt_nr_boosted;    // Used to solve priority inversion issues    struct rq *rq;      // Points to the run queue    struct task_group *tg;  // Points to the task group    #endif};struct rt_bandwidth {    /* nests inside the rq lock: */    raw_spinlock_t        rt_runtime_lock;    ktime_t            rt_period;      // Time period    u64            rt_runtime;     // Run time within a time period, exceeds limits, default value is 0.95ms    struct hrtimer        rt_period_timer;    // Time period timer    unsigned int        rt_period_active;};

3. Process Analysis

3.1 Runtime Statistics

The update of runtime statistics is completed in the update_curr_rt function:

Linux Process Scheduling - Real-Time Scheduler

  • The update_curr_rt function mainly includes two parts:
  1. Statistics update processing of runtime;
  2. If the runtime exceeds the allocated time, perform time balancing processing and determine whether throttling is needed. If throttling is performed, the RT queue will be dequeued and rescheduled;

To understand it more intuitively, let’s look at two more images:

The update of sched_rt_avg_update is illustrated as follows:

Linux Process Scheduling - Real-Time Scheduler

  • rq->age_stamp: Set the starting time when the run queue runs for the first time after the CPU starts, and subsequently updated periodically;
  • rt_avg: Accumulated RT average runtime, halved every 0.5 seconds, used to calculate the percentage of time used by RT in CFS load balancing;

3.2 Group Scheduling

The RT scheduler is basically similar to the CFS scheduler in group scheduling. For group scheduling of the CFS scheduler, please refer to (IV) Linux Process Scheduling - Group Scheduling and Bandwidth Control.

Let’s take a look at the organizational structure diagram of the RT scheduler group scheduling:

Linux Process Scheduling - Real-Time Scheduler

  • The system allocates an RT run queue and an RT scheduling entity for each CPU. The task group participates in scheduling through the included RT scheduling entity;
  • The RT queue of the task group task_group is used to store tasks or sub-task groups belonging to that group, thus forming a cascading structure;

Let’s take a look at the actual organizational diagram:

Linux Process Scheduling - Real-Time Scheduler

3.3 Bandwidth Control

Please refer to (IV) Linux Process Scheduling - Group Scheduling and Bandwidth Control.

In bandwidth control, the scheduling time period for the RT scheduler is set to 1 second, and the runtime is set to 0.95 seconds:

/* * period over which we measure -rt task CPU usage in us. * default: 1s */unsigned int sysctl_sched_rt_period = 1000000;/* * part of the period that we allow rt tasks to run in us. * default: 0.95s */int sysctl_sched_rt_runtime = 950000;

These two values can be set in user mode through /sys/fs/cgroup/cpu/rt_runtime_us and /sys/fs/cgroup/cpu/rt_period_us.

Let’s look at the function call flow:

Linux Process Scheduling - Real-Time Scheduler

  • The init_rt_bandwidth function is called when creating and allocating RT task groups, and its job is to initialize the related fields of the rt_bandwidth structure: setting the time period rt_period and runtime limit rt_runtime, both set to default values;
  • These can be set from user mode by operating the corresponding nodes under /sys/fs/cgroup/cpu for rt_period and rt_runtime; the final function called is tg_set_rt_bandwidth, which traverses the task group from bottom to top to set the time period and limit runtime;
  • When enqueue_rt_entity queues the RT scheduling entity, it ultimately triggers the execution of the start_rt_bandwidth function. When the high-precision timer expires, it calls the do_sched_rt_period_timer function;
  • The do_sched_rt_period_timer function checks the relationship between the accumulated runtime rt_time of the RT run queue and the set limit runtime rt_runtime to determine whether throttling is needed. If throttling has already been performed in this function, it will call balance_time to balance time across multiple CPUs. In simple terms, this means taking some time from the rt_rq queue of other CPUs and adding it to the current CPU’s rt_rq queue, thus increasing the limit runtime rt_runtime of the current RT run queue while decreasing the limit runtime of other CPUs’ rt_rq run queues.

Let’s look at an illustrative effect diagram:

Linux Process Scheduling - Real-Time Scheduler

3.4 Scheduler Function Analysis

Let’s look at a diagram from earlier:

Linux Process Scheduling - Real-Time Scheduler

Now let’s check the code for the RT scheduler instance:

const struct sched_class rt_sched_class = {    .next            = &fair_sched_class,    .enqueue_task        = enqueue_task_rt,    .dequeue_task        = dequeue_task_rt,    .yield_task        = yield_task_rt,    .check_preempt_curr    = check_preempt_curr_rt,    .pick_next_task        = pick_next_task_rt,    .put_prev_task        = put_prev_task_rt,    #ifdef CONFIG_SMP    .select_task_rq        = select_task_rq_rt,    .set_cpus_allowed       = set_cpus_allowed_common,    .rq_online              = rq_online_rt,    .rq_offline             = rq_offline_rt,    .task_woken        = task_woken_rt,    .switched_from        = switched_from_rt,    #endif    .set_curr_task          = set_curr_task_rt,    .task_tick        = task_tick_rt,    .get_rr_interval    = get_rr_interval_rt,    .prio_changed        = prio_changed_rt,    .switched_to        = switched_to_rt,    .update_curr        = update_curr_rt,};

3.4.1 pick_next_task_rt

The pick_next_task_rt function is used by the scheduler to select the next task to execute. The process is as follows:

Linux Process Scheduling - Real-Time Scheduler

  • Unlike the CFS scheduler, the RT scheduler processes tasks in a domain consisting of multiple CPUs using pull/push methods. This means that if there are no high-priority tasks in the current CPU’s run queue, it will consider finding a higher-priority task from another CPU’s run queue to execute, ensuring that tasks are processed according to priority. Additionally, the current CPU will push tasks to other lower-priority CPU run queues.
  • The processing logic of _pick_next_task_rt is relatively simple. If the real-time scheduling entity is a task, it directly looks for the highest priority task in the bitmap of the priority queue. If the real-time scheduling entity is a task_group, it continues to traverse downwards;

Regarding task pull/push, Linux provides struct plist, which is a priority-based doubly linked list, with the task organization relationship illustrated in the following diagram:

Linux Process Scheduling - Real-Time Scheduler

The general illustration of pull_rt_task is as follows:

Linux Process Scheduling - Real-Time Scheduler

  • If the priority tasks on the current CPU are not high, find a higher-priority task from another CPU’s pushable_tasks list to execute;

3.4.2 enqueue_task_rt/dequeue_task_rt

When RT tasks are dequeued and enqueued, the two interfaces enqueue_task_rt/dequeue_task_rt are used to complete the process. The calling process is as follows:

Linux Process Scheduling - Real-Time Scheduler

  • enqueue_task_rt and dequeue_task_rt will both call the dequeue_rt_stack interface. When the requested rt_se corresponds to a task group, it will dequeue the scheduling entity from the top to the requested rt_se;
  • When tasks are added to the RT run queue, if multiple tasks can be allocated to multiple CPUs, the overload is set for task migration;

Feeling a bit tired, let’s wrap it up.

Leave a Comment