Background
-
Read the fucking source code!
–By Lu Xun -
A picture is worth a thousand words.
–By Gorky
Note:
-
Kernel version: 4.14 -
ARM64 processor, Cortex-A53, dual-core -
Tools used: Source Insight 3.5, Visio
1. Overview
In the Linux kernel, real-time processes always have a higher priority than ordinary processes. The scheduling of real-time processes is managed by the Real Time Scheduler (RT scheduler)
, while ordinary processes are managed by the CFS scheduler
. The scheduling policies supported for real-time processes are: SCHED_FIFO
and SCHED_RR
.
The previous series of articles analyzed the CFS scheduler
, covering topics such as CPU load
, group scheduling
, and bandwidth control
. This article will analyze the RT scheduler
from these perspectives, making it easier to understand if you have read the previous articles.
Without further ado, let’s get to the point.
2. Data Structures
It is necessary to list the key structures:
-
struct rq
: Run queue, one for each CPU; -
struct rt_rq
: Real-time run queue, used to manage scheduling entities for real-time tasks; -
struct sched_rt_entity
: Real-time scheduling entity, used for participating in scheduling, functions similarly tostruct sched_entity
; -
struct task_group
: Group scheduling structure; -
struct rt_bandwidth
: Bandwidth control structure;
As usual, let’s start with a diagram to clarify the relationships between these structures:
-
From the structural organization in the diagram, it is basically consistent with the CFS scheduler
. The difference is that theCFS scheduler
organizes scheduling entities using a red-black tree, while theRT scheduler
uses a priority queue to organize real-time scheduling entities; -
The rt_rq
run queue maintains 100 priority queues (linked lists), with priorities ranging from 0 to 99, from high to low; -
The objects managed by the scheduler are scheduling entities; the task task_struct
and task grouptask_group
both participate in scheduling management through the embedded scheduling entity data structure; -
The task_group
task group scheduling maintains its ownrt_rq
for each CPU to store its sub-tasks or sub-task groups, which can be cascaded downwards, thus forming a tree structure;
Among the structures mentioned above, struct rq
and struct task_group
have been analyzed in previous articles. Below, I will provide brief comments on the key structures related to the RT run queue:
struct sched_rt_entity { struct list_head run_list; // Used to add to the priority queue unsigned long timeout; // Set timeout unsigned long watchdog_stamp; // Used to record jiffies value unsigned int time_slice; // Time slice, 100ms, unsigned short on_rq; unsigned short on_list; struct sched_rt_entity *back; // Temporarily used for connecting RT scheduling entities from top to bottom #ifdef CONFIG_RT_GROUP_SCHED struct sched_rt_entity *parent; // Points to the parent RT scheduling entity /* rq on which this entity is (to be) queued: */ struct rt_rq *rt_rq; // RT scheduling entity's real-time run queue, being scheduled /* rq "owned" by this entity/group: */ struct rt_rq *my_q; // RT scheduling entity's owned real-time run queue, used to manage sub-tasks or sub-groups #endif} __randomize_layout; /* Real-Time classes' related field in a runqueue: */struct rt_rq { struct rt_prio_array active; // Priority queue, a linked list of 100 priorities, and defines a bitmap for quick querying unsigned int rt_nr_running; // Number of active tasks in the RT run queue unsigned int rr_nr_running; #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED struct { int curr; /* highest queued rt task prio */ // Current highest priority of RT task #ifdef CONFIG_SMP int next; /* next highest */ // Priority of the next RT task to run, if two tasks have the highest priority, then curr == next #endif } highest_prio; #endif #ifdef CONFIG_SMP unsigned long rt_nr_migratory; // This value increases or decreases when the task is not bound to a specific CPU, used for task migration unsigned long rt_nr_total; // Used for overload check int overloaded; // If the RT run queue is overloaded, tasks will be pushed to other CPUs struct plist_head pushable_tasks; // Priority list, used to push overloaded tasks #endif /* CONFIG_SMP */ int rt_queued; // Indicates that the RT run queue has been added to rq queue int rt_throttled; // Used for throttling operations u64 rt_time; // Accumulated run time, exceeds local rt_runtime, then throttling occurs u64 rt_runtime; // Allocated run time for local pool /* Nests inside the rq lock: */ raw_spinlock_t rt_runtime_lock; #ifdef CONFIG_RT_GROUP_SCHED unsigned long rt_nr_boosted; // Used to solve priority inversion issues struct rq *rq; // Points to the run queue struct task_group *tg; // Points to the task group #endif};struct rt_bandwidth { /* nests inside the rq lock: */ raw_spinlock_t rt_runtime_lock; ktime_t rt_period; // Time period u64 rt_runtime; // Run time within a time period, exceeds limits, default value is 0.95ms struct hrtimer rt_period_timer; // Time period timer unsigned int rt_period_active;};
3. Process Analysis
3.1 Runtime Statistics
The update of runtime statistics is completed in the update_curr_rt
function:
-
The update_curr_rt
function mainly includes two parts:
-
Statistics update processing of runtime; -
If the runtime exceeds the allocated time, perform time balancing processing and determine whether throttling is needed. If throttling is performed, the RT queue will be dequeued and rescheduled;
To understand it more intuitively, let’s look at two more images:
The update of sched_rt_avg_update
is illustrated as follows:
-
rq->age_stamp
: Set the starting time when the run queue runs for the first time after the CPU starts, and subsequently updated periodically; -
rt_avg
: Accumulated RT average runtime, halved every 0.5 seconds, used to calculate the percentage of time used by RT in CFS load balancing;
3.2 Group Scheduling
The RT scheduler
is basically similar to the CFS scheduler
in group scheduling. For group scheduling of the CFS scheduler
, please refer to (IV) Linux Process Scheduling - Group Scheduling and Bandwidth Control
.
Let’s take a look at the organizational structure diagram of the RT scheduler
group scheduling:
-
The system allocates an RT run queue and an RT scheduling entity for each CPU. The task group participates in scheduling through the included RT scheduling entity; -
The RT queue of the task group task_group
is used to store tasks or sub-task groups belonging to that group, thus forming a cascading structure;
Let’s take a look at the actual organizational diagram:
3.3 Bandwidth Control
Please refer to (IV) Linux Process Scheduling - Group Scheduling and Bandwidth Control
.
In bandwidth control, the scheduling time period for the RT scheduler
is set to 1 second, and the runtime is set to 0.95 seconds:
/* * period over which we measure -rt task CPU usage in us. * default: 1s */unsigned int sysctl_sched_rt_period = 1000000;/* * part of the period that we allow rt tasks to run in us. * default: 0.95s */int sysctl_sched_rt_runtime = 950000;
These two values can be set in user mode through /sys/fs/cgroup/cpu/rt_runtime_us
and /sys/fs/cgroup/cpu/rt_period_us
.
Let’s look at the function call flow:
-
The init_rt_bandwidth
function is called when creating and allocating RT task groups, and its job is to initialize the related fields of thert_bandwidth
structure: setting the time periodrt_period
and runtime limitrt_runtime
, both set to default values; -
These can be set from user mode by operating the corresponding nodes under /sys/fs/cgroup/cpu
forrt_period
andrt_runtime
; the final function called istg_set_rt_bandwidth
, which traverses the task group from bottom to top to set the time period and limit runtime; -
When enqueue_rt_entity
queues the RT scheduling entity, it ultimately triggers the execution of thestart_rt_bandwidth
function. When the high-precision timer expires, it calls thedo_sched_rt_period_timer
function; -
The do_sched_rt_period_timer
function checks the relationship between the accumulated runtimert_time
of the RT run queue and the set limit runtimert_runtime
to determine whether throttling is needed. If throttling has already been performed in this function, it will callbalance_time
to balance time across multiple CPUs. In simple terms, this means taking some time from the rt_rq queue of other CPUs and adding it to the current CPU’s rt_rq queue, thus increasing the limit runtimert_runtime
of the current RT run queue while decreasing the limit runtime of other CPUs’ rt_rq run queues.
Let’s look at an illustrative effect diagram:
3.4 Scheduler Function Analysis
Let’s look at a diagram from earlier:
Now let’s check the code for the RT scheduler instance:
const struct sched_class rt_sched_class = { .next = &fair_sched_class, .enqueue_task = enqueue_task_rt, .dequeue_task = dequeue_task_rt, .yield_task = yield_task_rt, .check_preempt_curr = check_preempt_curr_rt, .pick_next_task = pick_next_task_rt, .put_prev_task = put_prev_task_rt, #ifdef CONFIG_SMP .select_task_rq = select_task_rq_rt, .set_cpus_allowed = set_cpus_allowed_common, .rq_online = rq_online_rt, .rq_offline = rq_offline_rt, .task_woken = task_woken_rt, .switched_from = switched_from_rt, #endif .set_curr_task = set_curr_task_rt, .task_tick = task_tick_rt, .get_rr_interval = get_rr_interval_rt, .prio_changed = prio_changed_rt, .switched_to = switched_to_rt, .update_curr = update_curr_rt,};
3.4.1 pick_next_task_rt
The pick_next_task_rt
function is used by the scheduler to select the next task to execute. The process is as follows:
-
Unlike the CFS scheduler
, theRT scheduler
processes tasks in adomain
consisting of multiple CPUs usingpull/push
methods. This means that if there are no high-priority tasks in the current CPU’s run queue, it will consider finding a higher-priority task from another CPU’s run queue to execute, ensuring that tasks are processed according to priority. Additionally, the current CPU will push tasks to other lower-priority CPU run queues. -
The processing logic of _pick_next_task_rt
is relatively simple. If the real-time scheduling entity is atask
, it directly looks for the highest priority task in the bitmap of the priority queue. If the real-time scheduling entity is atask_group
, it continues to traverse downwards;
Regarding task pull/push
, Linux provides struct plist
, which is a priority-based doubly linked list, with the task organization relationship illustrated in the following diagram:
The general illustration of pull_rt_task
is as follows:
-
If the priority tasks on the current CPU are not high, find a higher-priority task from another CPU’s pushable_tasks
list to execute;
3.4.2 enqueue_task_rt/dequeue_task_rt
When RT tasks are dequeued and enqueued, the two interfaces enqueue_task_rt/dequeue_task_rt
are used to complete the process. The calling process is as follows:
-
enqueue_task_rt
anddequeue_task_rt
will both call thedequeue_rt_stack
interface. When the requestedrt_se
corresponds to a task group, it will dequeue the scheduling entity from the top to the requestedrt_se
; -
When tasks are added to the RT run queue, if multiple tasks can be allocated to multiple CPUs, the overload is set for task migration;
Feeling a bit tired, let’s wrap it up.