Concurrency Control in Linux Device Drivers (Part 1)
Note: The content of this article is excerpted from “Detailed Explanation of Linux Device Driver Development – Based on the Latest Linux 4.0 Kernel”. This series will record the key points of knowledge focused on while reading the book. Interested readers are recommended to read the original book.
7.1 Concurrency and Race Conditions
Concurrency refers to multiple execution units being executed simultaneously and in parallel, while access to shared resources (hardware resources and software global variables, static variables, etc.) by concurrent execution units can easily lead to race conditions.
In the Linux kernel, the main race conditions occur in the following situations:
1. Multiple CPUs in Symmetric Multiprocessing (SMP)
SMP is a tightly coupled, shared memory system model characterized by multiple CPUs using a common system bus, allowing access to shared peripherals and memory. In the case of SMP, race conditions may occur between processes on CPU0 and CPU1, between a process on CPU0 and an interrupt on CPU1, and between interrupts on CPU0 and CPU1.
2. Processes on a Single CPU and Preempting Processes
The Linux kernel supports preemptive scheduling, where a process executing in the kernel may exhaust its timeslice or be interrupted by a higher-priority process. The situation where a process accesses shared resources while being preempted is similar to multiple CPUs in SMP.
3. Interrupts (hard interrupts, soft interrupts, Tasklets, bottom halves) and Processes
Interrupts can interrupt a running process, and if the interrupt service routine accesses resources that the process is accessing, a race condition can occur. Additionally, interrupts can be interrupted by new higher-priority interrupts. Therefore, multiple interrupts can also lead to concurrency and race conditions. After Linux 2.6.35, nested interrupts were eliminated, and the new kernel does not nest interrupts by default. Apart from SMP, which is true parallelism, other cases are “macro parallelism, micro serial” on a single core, but the issues they cause are similar to those in SMP.
The way to solve race conditions is to ensure mutual exclusion access to shared resources. Mutual exclusion access means that when one execution unit accesses shared resources, other execution units are prohibited from accessing them. The code area that accesses shared resources is called the Critical Section, and critical sections need to be protected by some mutual exclusion mechanism. Interrupt masking, atomic operations, spinlocks, semaphores, and mutexes are mutual exclusion methods that can be used in Linux device drivers.
7.2 Compiler Reordering and Execution Reordering
To understand the locking mechanism of the Linux kernel, it is also necessary to understand the characteristics of compilers and processors. For the following code, the writer allocates a new <span>struct foo</span> structure and initializes its <span>a,b,c</span>, then assigns the structure’s address to the global <span>gp</span> pointer.
struct foo {
int a;
int b;
int c;
};
struct foo *gp = NULL;
...
p = kmalloc(sizeof(*p), GFP_KERNEL);
p->a = 1;
p->b = 2;
p->c = 3;
gp = p;
However, if the reader simply does the following, the program’s execution may not meet expectations:
p = gp;
if (p != NULL) {
do_something_with(p->a, p->b, p->c);
}
There are two possible reasons that could cause the program to fail: one possibility is compiler reordering, and the other possibility is execution reordering. Regarding compilation, the order of instructions generated from the C language sequence <span>p->a=1; p->b=2; p->c=3; gp=p;</span> may have the assignment instruction for <span>gp</span> occurring before the assignments to <span>a,b,c</span>. Modern high-performance compilers have the capability to optimize instruction order in the target code.
Compilers can reorder memory access instructions to reduce logically unnecessary memory accesses and improve cache hit rates and CPU <span>Load/Store</span> efficiency. Therefore, after enabling compiler optimization, it is normal to see that the generated assembly code does not strictly follow the logical order of the code.To solve the compiler reordering issue, barriers must be used. Barriers can be set in the code using <span>barrier()</span><span> to prevent compiler optimization. For the compiler, setting a barrier ensures that the statements before and after the barrier do not get reordered.</span>
#define barrier() __asm__ __volatile__("": : : "memory")
Regarding the issue of compiler reordering, the C language <span>volatile</span> keyword has a weak effect; it mainly just prevents the merging of memory access behaviors. For the C compiler, <span>volatile</span> indicates that besides the current execution thread, there may be other threads that can also change that memory, so its meaning is “volatile”. Additionally, <span>volatile</span> does not protect critical resources.
Compiler reordering is a behavior of the compiler, while execution reordering is a behavior of the processor at runtime. Execution reordering means that even if the order of compiled binary instructions is arranged according to <span>p->a=1; p->b=2; p->c=3; gp=p;</span>, during execution on the processor, later issued instructions may still be limited; this is the processor’s “Out-of-Order Execution” strategy.
Advanced CPUs can reorder memory access instructions based on their cache organization characteristics. Accessing consecutive addresses may be executed first because it has a higher cache hit rate. Some also allow non-blocking memory access, meaning that if a previous memory access instruction misses the cache and causes a long delay, subsequent memory access instructions can be executed first to fetch data from the cache. Therefore, even if the instructions appear to be in the correct order from an assembly perspective, their execution order is unpredictable. For example, ARM v6/v7 processors will optimize the following instruction order:
LDR r0, [r1]
STR r2, [r3]
Assuming the first <span>LDR</span> instruction causes a cache miss, the cache will fill the line and require more clock cycles to complete. The ARM v6/v7 processor will recognize the next instruction <span>STR</span> and will not wait for the first instruction <span>LDR</span> to complete, thus executing the <span>STR</span> instruction first instead of waiting for the <span>LDR</span> instruction to finish.
For most architectures, although each CPU executes out of order, this reordering is not visible for single-core program execution because a single CPU will wait when encountering a dependency point (where subsequent instructions depend on the execution results of previous instructions), so we do not perceive this out-of-order process. However, this waiting process at the dependency point is not visible to other cores in SMP processors. For example, if executed on CPU0:
while (f == 0);
print x;
And on CPU1:
x = 42;
f = 1;
At this point, it cannot be assumed that the value of <span>x</span> printed on CPU0 is necessarily <span>42</span>, because even if on CPU1 <span>f=1</span> is compiled after <span>x=42</span>, the execution may still complete before <span>x=42</span> is finished, so the value printed on CPU0 may not be <span>42</span>.
To solve the visibility issue of one core’s memory behavior to another core, processors introduce some memory barrier instructions, such as ARM processor barrier instructions including:
- • DMB (Data Memory Barrier): Ensures that all explicit memory accesses before the DMB instruction are completed before any explicit memory accesses after the DMB.
- • DSB (Data Synchronization Barrier): Waits for all instructions before the DSB instruction to complete (all explicit memory accesses before this instruction are completed, and all cache, branch prediction, and TLB maintenance operations before this instruction are completed).
- • ISB (Instruction Synchronization Barrier): Flushes the pipeline so that all instructions executed after the ISB are fetched from cache or memory.
The spinlocks, mutexes, and other mutual exclusion logic in the Linux kernel require the use of the above instructions: barrier instructions are called when requesting to acquire a lock; barrier instructions are also needed when unlocking. In the Linux kernel, read and write barriers are defined as <span>mb()</span><span>, read barriers </span><code><span>rmb()</span><span>, write barriers </span><code><span>wmb()</span><span>, and APIs for register read and write such as </span><code><span>__iormb()</span><span>, </span><code><span>__iowmb()</span><span>, etc. The difference between the APIs for reading and writing registers </span><code><span>readl_relaxed()</span><span> and </span><code><span>readl()</span><span>, </span><code><span>write_relaxed()</span><span> and </span><code><span>writel()</span><span> lies in the presence or absence of barriers.</span>
#define readb(c) ({ u8 __v = readb_relaxed(c); __iormb(); __v; })
#define readw(c) ({ u16 __v = readw_relaxed(c); __iormb(); __v; })
#define readl(c) ({ u32 __v = readl_relaxed(c); __iormb(); __v; })
#define writeb(v,c) ({ __iowmb(); writeb_relaxed(v,c); })
#define writew(v,c) ({ __iowmb(); writew_relaxed(v,c); })
#define writel(v,c) ({ __iowmb(); writel_relaxed(v,c); })
If you write the start address, end address, and size of DMA using <span>writel_relaxed()</span><span>, you must call </span><code><span>writel()</span><span> to start the DMA.</span>
writel_relaxed(DMA_SRC_REG, src_addr);
writel_relaxed(DMA_DST_REG, dst_addr);
writel_relaxed(DMA_SIZE_REG, size);
writel (DMA_ENABLE, 1);
7.3 Interrupt Masking
A simple and effective method to avoid race conditions within a single CPU is to mask system interrupts before entering the critical section, but this is not recommended in driver programming, as drivers usually need to consider cross-platform characteristics and do not assume they are running on a single core. CPUs generally have the capability to mask and unmask interrupts, which can ensure that the currently executing kernel execution path is not preempted by interrupt handlers, preventing certain race conditions from occurring.
Specifically, interrupt masking will prevent concurrency between interrupts and processes, and since Linux kernel operations such as process scheduling rely on interrupts, concurrency between kernel preempted processes is also avoided.
Usage of interrupt masking:
local_irq_disable();// Disable interrupts
...
critical_section // Critical section
...
local_irq_enable(); // Enable interrupts
The underlying implementation principle is to prevent the CPU itself from responding to interrupts. For example, for ARM processors, the underlying implementation masks the <span>ARM CPSR</span><span> I bit:</span>
static inline void arch_local_irq_disable(void)
{
asm volatile(
" cpsid i @ arch_local_irq_disable"
:
:
: "memory", "cc");
}
Since many important operations in Linux, such as asynchronous I/O and process scheduling, rely on interrupts, interrupts are crucial for kernel operation. During the period when interrupts are masked, all interrupts cannot be processed, so long-term masking of interrupts is very dangerous and may lead to data loss or even system crashes. This requires that after masking interrupts, the current kernel execution path should quickly complete the critical section code.
<span>local_irq_disable()</span><span> and </span><code><span>local_irq_enable()</span><span> can only disable and enable interrupts for the current CPU, so they do not solve race conditions caused by SMP multi-CPU. Therefore, using interrupt masking alone is usually not a recommended method to avoid race conditions (in other words, using </span><code><span>local_irq_disable/enable()</span><span> in drivers usually indicates a </span><code><span>bug</span><span>), and it is suitable to be used in conjunction with the spinlocks that will be introduced later.</span><span> Unlike </span><code><span>local_irq_disable()</span><span>, </span><code><span>local_irq_save(flags)</span><span> not only disables interrupts but also saves the current CPU's interrupt flag information, while </span><code><span>local_irq_restore(flags)</span><span> performs the opposite operation of </span><code><span>local_irq_save</span><span>. For ARM processors, this actually saves and restores </span><code><span>CPSR</span><span>.</span><span> If you only want to disable the bottom half of interrupts, you should use </span><code><span>local_bh_disable()</span><span>, and the bottom half that is enabled by </span><code><span>local_bh_disable()</span><span> should call </span><code><span>local_bh_enable()</span><span>.</span>
7.4 Atomic Operations
Atomic operations can ensure that modifications to an integer data type are exclusive. The Linux kernel provides a series of functions to implement atomic operations in the kernel, which are divided into two categories, targeting bits and integer variables for atomic operations. Both bit and integer variable atomic operations depend on the underlying CPU’s atomic operations, so all these functions are closely related to the CPU architecture. For ARM processors, the underlying implementation uses <span>LDREX</span><span> and </span><code><span>STREX</span><span> instructions. For example, the implementation of </span><code><span>atomic_inc()</span><span> will call </span><code><span>atomic_add()</span><span>, which is as follows:</span>
static inline void atomic_add(int i, atomic_t *v) {
unsigned long tmp;
int result;
prefetchw(&v->counter);
__asm__ __volatile__("@ atomic_add\n"
"1: ldrex %0, [%3]\n"
" add %0, %0, %4\n"
" strex %1, %0, [%3]\n"
" teq %1, #0\n"
" bne 1b"
: "=&r" (result), "=&r" (tmp), "+Qo"(v->counter)
: "r" (&v->counter), "Ir" (i)
: "cc");
}
<span>LDREX</span> and <span>STREX</span> instructions are used in pairs, allowing the bus to monitor whether there are other entities accessing that address between <span>LDREX</span> and <span>STREX</span><span>. If there is concurrent access, when executing the </span><code><span>STREX</span> instruction, the first register’s value is set to 1 (Non-Exclusive Access) and the store operation fails; if there is no concurrent access, <span>STREX</span> sets 0 in the first register (Exclusive Access) and the store operation succeeds.
In this example, if two concurrent entities call <span>LDREX+STREX</span> simultaneously, as shown in Figure 7.6, at time T3, CPU0’s <span>STREX</span> will fail, while at time T4, CPU1’s <span>STREX</span> will succeed. Thus, only CPU1 executes successfully, and the failure of CPU0’s <span>STREX</span> will not satisfy the condition of the <span>teq%1, #0</span> judgment statement, so the failed CPU0 will re-enter <span>LDREX</span>.<span>LDREX</span> and <span>STREX</span> process applies not only to concurrency between multiple cores but also to concurrency within the same core.
Integer Atomic Operations
1. Set the value of an atomic variable
void atomic_set(atomic_t *v, int i); /* Set the atomic variable's value to i */ atomic_t v = ATOMIC_INIT(0); /* Define atomic variable v and initialize to 0 */
2. Get the value of an atomic variable
atomic_read(atomic_t *v);
3. Atomic variable increment/decrement
void atomic_add(int i, atomic_t *v); /* Increment atomic variable by i */
void atomic_sub(int i, atomic_t *v); /* Decrement atomic variable by i */
4. Atomic variable increment/decrement by 1
void atomic_inc(atomic_t *v); /* Increment atomic variable by 1 */
void atomic_dec(atomic_t *v); /* Decrement atomic variable by 1 */
5. Test and operate
int atomic_inc_and_test(atomic_t *v);
int atomic_dec_and_test(atomic_t *v);
int atomic_sub_and_test(int i, atomic_t *v);
The above operations perform increment, decrement, and subtraction on the atomic variable, and after the operation (note that there is no addition), test whether it is 0, returning true if it is, otherwise false.
6. Operate and return
int atomic_add_return(int i, atomic_t *v);
int atomic_sub_return(int i, atomic_t *v);
int atomic_inc_return(atomic_t *v);
int atomic_dec_return(atomic_t *v);
The above operations perform addition, subtraction, increment, and decrement on the atomic variable and return the new value.
Bit Atomic Operations
1. Set a bit
void set_bit(nr, void *addr);
The above operation sets the <span>nr</span> bit of the address <span>addr</span>, meaning to write the bit as 1.
2. Clear a bit
void clear_bit(nr, void *addr);
The above operation clears the <span>nr</span> bit of the address <span>addr</span>, meaning to write the bit as 0.
3. Change a bit
void change_bit(nr, void *addr);
The above operation toggles the <span>nr</span> bit of the address <span>addr</span>.
4. Test a bit
int test_bit(nr, void *addr);
The above operation returns the <span>nr</span> bit of the address <span>addr</span>.
5. Test and operate a bit
int test_and_set_bit(nr, void *addr);
int test_and_clear_bit(nr, void *addr);
int test_and_change_bit(nr, void *addr);
The above <span>test_and_xxx_bit(nr, void*addr)</span> operations are equivalent to executing <span>test_bit(nr, void*addr)</span> followed by <span>xxx_bit(nr, void*addr)</span>. Using atomic variables ensures that a device can only be opened by one process:
static atomic_t xxx_available = ATOMIC_INIT(1); /* Define atomic variable */
static int xxx_open(struct inode *inode, struct file *filp)
{
if (!atomic_dec_and_test(&xxx_available)) {
atomic_inc(&xxx_available);
return - EBUSY; /* Already opened */
}
return 0; /* Success */
}
static int xxx_release(struct inode *inode, struct file *filp)
{
atomic_inc(&xxx_available); /* Release device */
return 0;
}
The next part will share content on spinlocks, semaphores, mutexes, and more.