Understanding the Low-Level Implementation of Atomic Operations in the Linux Kernel (armv8-aarch64)

Typically, a line of code likea = a + 1 translates into three assembly instructions:

ldr x0, &aadd x0,x0,#1str x0,&a

That is, (1) read the variable a from memory into the X0 register, (2) add 1 to the X0 register, (3) write the value of X0 back to memory a.

Since there are three instructions, there is a possibility of concurrency, which means the returned result may not be as expected.

In the Linux kernel operating system, functions are provided to access atomic variables to solve the above problem. Some of the atomic operation APIs are as follows:

atomic_read
atomic_add_return(i,v)
atomic_add(i,v)
atomic_inc(v)
atomic_add_unless(v,a,u)
atomic_inc_not_zero(v)
atomic_sub_return(i,v)
atomic_sub_and_test(i,v)
atomic_sub(i,v)
atomic_dec(v)
atomic_cmpxchg(v,old,new)

So how does the operating system (which is just software) ensure atomic operations? (It still relies on hardware). What is the hardware principle?

The above API functions actually call the following__lse_atomic_add_return##name macro at a lower level, and the core of this code is theldadd instruction, which is a feature added in armv8.1 called LSE (Large System Extension).

(linux/arch/arm64/include/asm/atomic_lse.h)
static inline int __lse_atomic_add_return##name(int i, atomic_t *v) 
{ 
	u32 tmp; 

	asm volatile( 
		__LSE_PREAMBLE 
	" ldadd" #mb " %w[i], %w[tmp], %[v]\n" 
	" add %w[i], %w[i], %w[tmp]" 
	: [i] "+r" (i), [v] "+Q" (v-&gt;counter), [tmp] "=&amp;r" (tmp) 
	: "r" (v) 
	: cl); 

	return i; 
}

So what if the system does not have the LSE extension, i.e., armv8.0? The prototype implementation is as follows, and the core of this code is theldxr and stxr instructions.

(linux/arch/arm64/include/asm/atomic_ll_sc.h)
static inline void __ll_sc_atomic_##op(int i, atomic_t *v)
{ 
	unsigned long tmp; 
	int result; 

	asm volatile("// atomic_" #op "\n" 
		__LL_SC_FALLBACK( 
	" prfm pstl1strm, %2\n" 
	"1: ldxr %w0, %2\n" 
	" " #asm_op " %w0, %w0, %w3\n" 
	" stxr %w1, %w0, %2\n" 
	" cbnz %w1, 1b\n") 
	: "=&amp;r" (result), "=&amp;r" (tmp), "+Q" (v-&gt;counter) 
	: __stringify(constraint) "r" (i)); 
}

So how was it implemented before armv8.0, such as in armv7? The implementation is as follows, and the core of this code is theldrex and strex instructions.

(linux/arch/arm/include/asm/atomic.h)
static inline void atomic_##op(int i, atomic_t *v) 
{ 
	unsigned long tmp; 
	int result; 

	prefetchw(&amp;v-&gt;counter); 
	__asm__ __volatile__("@ atomic_" #op "\n" 
	"1: ldrex %0, [%3]\n" 
	" " #asm_op " %0, %0, %4\n" 
	" strex %1, %0, [%3]\n" 
	" teq %1, #0\n" 
	" bne 1b" 
	: "=&amp;r" (result), "=&amp;r" (tmp), "+Qo" (v-&gt;counter) 
	: "r" (&amp;v-&gt;counter), "Ir" (i) 
	: "cc"); 
}

Summary:

In the early days, atomic operations were implemented using the ARM exclusive mechanism, with the exclusive-related instructions beingldrex and strex. However, after armv8, the exclusive mechanism instructions changed toldxr and stxr. Due to the presence of many processors in a large system and intense competition, using exclusive load and store instructions may require multiple attempts to succeed, leading to poor performance. To address this issue, armv8.1 introduced additional atomic operation instructions such asldadd and related instructions.

Recommended Courses	“From Beginner to Expert in Armv8/Armv9 Architecture” (Three Sessions)
“Trustzone/TEE/Security from Beginner to Expert” Standard Edition
“Arm Selected – Platinum VIP Course” 💋💋💋💋💋💋💋💋
🌍Consult via WeChat: sami01_2023

Related posts

Leave a Comment Cancel reply