Understanding Memory Ordering in ARM Architecture

10. Memory Ordering

The older versions of the ARM architecture executed all instructions in program order, ensuring that each instruction was fully executed before the next one began.

Newer processors have adopted various optimizations related to the order of instruction execution and the manner of memory access. As we see, the core executes instructions significantly faster than external memory. Caches and write buffers are used to partially hide the latency caused by this speed difference. One potential effect of this is the reordering of memory accesses. The order in which the core executes load and store instructions may not be the same as the order observed by external devices.

In the image above, three instructions are listed in program order. The first instruction performs a write operation to external memory, which in this example enters the write buffer (access 1). Next are two read operations, one that misses in the cache (access 2) and the other that hits in the cache (access 3). These two read operations may complete before the write associated with access 1 is finished in the write buffer. The hit-miss behavior in the cache means that a load that hits in the cache (like access 3) can complete before a load that missed earlier in the program (like access 2).

The hardware can still maintain the illusion that instructions execute in the order you wrote them. Generally speaking, you only need to worry about such effects in a few cases. For instance, if you are modifying the CP15 register, copying, or otherwise changing code in memory, you may need to explicitly make the core wait for these operations to complete.

For high-performance cores that support speculative data access, out-of-order execution, and cache coherence protocols for additional performance gains, the likelihood of reordering is greater. Typically, in a single-core system, the effects of such reordering are invisible to you. The hardware handles many potential hazards, ensuring that data dependencies are observed and that reads return the correct values to allow for potential modifications caused by previous writes.

However, in cases where multiple cores communicate through shared memory (or otherwise share data), memory ordering issues become more critical. Generally, the exact memory ordering you are most concerned about is at the points where multiple execution threads must synchronize.

Processors compliant with the ARMv7-A architecture adopt a weakly ordered memory model, meaning that the order of memory accesses does not have to match the program order of load and store operations. This model allows for reordering between memory read operations (like LDR, LDM, and LDD instructions), store operations, and certain other instructions. Reads and writes to normal memory can be reordered by the hardware, with this reordering being limited only by data dependencies and explicit memory barrier instructions. In cases where stronger ordering rules are required, this is conveyed to the core through the memory type attributes of the memory’s translation table entries. Enforcing ordering rules on the core limits possible hardware optimizations, thus reducing performance and increasing power consumption.

10.1 ARM Memory Ordering Model

Cortex-A series processors adopt a weakly ordered memory model. However, within this model, specific memory regions can be marked as strongly ordered. In this case, memory transactions are guaranteed to occur in the order they are issued.

Three mutually exclusive memory types are defined. All memory regions are configured as one of these three types:

Strongly ordered
Device
Normal

Additionally, for normal memory, it can be specified whether the memory is shareable (accessible by other agents). For normal memory, internal and external cacheable attributes can also be specified.

In the table below, A1 and A2 are two operations accessing different addresses. A1 occurs in the program code before A2, but the write operation may be issued out of order.

10.1.1 Strongly-Ordered and Device Memory

Accesses to strongly ordered and device memory have the same memory ordering model. The access rules for this memory are as follows:

The number and size of accesses will be preserved. Accesses are atomic and will not be interrupted mid-way.
Both read and write accesses may have side effects on the system. Accesses are never cached. Speculative accesses will not be performed.
Accesses cannot be unaligned.
The order of accesses to device memory is guaranteed to correspond to the program order of the instructions accessing the device memory. This guarantee applies only to accesses within the same peripheral or memory block. The size of this block is defined by the implementation but must be at least 1KB.
In the ARMv7 architecture, the core may reorder normal memory accesses around accesses to strongly ordered or device memory.

The only difference between device memory and strongly ordered memory is:

Writes to strongly ordered memory only complete when the write reaches the accessed peripheral or memory component.
Writes to device memory may complete before reaching the accessed peripheral or memory component.

System peripherals are almost always mapped as device memory.

Regions of device memory type can be described using shareable attributes. On some ARMv6 processors, the shareable attributes of device access are used to determine the memory interface that will be used for access, with accesses to non-shareable regions of memory using dedicated interfaces, i.e., private peripheral ports. This mechanism is not used on ARMv7 processors.

Note: These memory ordering rules only provide guarantees regarding explicit memory accesses (those caused by load and store instructions). The architecture does not provide similar guarantees regarding the ordering of instruction fetch or translation table traversals with respect to such explicit memory accesses.

10.1.2 Normal Memory

Normal memory is used to describe most regions of the memory system. All ROM and RAM devices are considered normal memory.

Normal memory has the following properties:

The core can repeat reads and certain write accesses.
The core can prefetch or speculatively access additional memory locations without side effects (if MMU access permissions allow). However, the core will not perform speculative writes.
Unaligned accesses are allowed.
The core hardware can combine multiple accesses into fewer, larger accesses. For example, multiple byte writes can be combined into a double word write.

Normal memory regions must also have cacheable attributes. For details on supported caching strategies, the ARM architecture supports two levels of caching attributes for normal memory, namely inner cache and outer cache. The mapping of these cache levels to the physical cache levels of the implementation is defined by the implementation. Inner cache refers to the innermost cache, which always includes the core’s level 1 cache. The implementation may have no outer cache at all, or it may apply outer cache attributes to level 2 or level 3 caches. For instance, in a system containing a Cortex-A9 processor and an L2C-310 level 2 cache controller, L2C-310 is regarded as outer cache. The level 2 cache of Cortex-A8 can be configured to use either inner cache or outer cache strategies.

Shareability Normal memory must also be specified as shareable or non-shareable. Non-shareable normal memory regions are accessible only by that core. The core does not need to make accesses to that location coherent with other cores. If other cores share this memory, any coherence issues must be handled in software. For instance, this can be achieved by having each core perform cache maintenance and barrier operations.

External shareable attributes allow defining systems with multiple levels of coherence control. For example, an internally shareable domain may consist of a Cortex-A15 cluster and a Cortex-A7 cluster. Within a cluster, the data caches of the cores are coherent for all data accesses with internal shareability attributes. Meanwhile, an externally shareable domain may include this cluster and a graphics processor with multiple cores. An external shareable domain can contain multiple internally shareable domains, but an internally shareable domain can belong to only one external shareable domain.

Regions set with shareable attributes are accessible by other agents in the system. Within the same shareable domain, accesses to memory in that region by other processors are coherent. This means you do not have to worry about the effects of data or caches. If there are no shareable attributes, you must explicitly manage coherence without maintaining cache coherence.

The ARMv7 architecture allows you to specify shareable memory as internally shareable or externally shareable (the latter means that the location is both internally and externally shareable).

10.2 Memory Barriers

Memory barriers are a type of instruction that requires the core to apply ordering constraints on memory operations before and after the memory barrier instruction in the program, ensuring that operations occurring before the barrier are observed to occur in the specified order. In other architectures, such instructions are also referred to as “memory fences.”

The term “memory barrier” can also refer to a compiler mechanism that prevents the compiler from scheduling data access instructions across the barrier during optimization. For example, in GCC, you can use the memory clobber mechanism of inline assembly to indicate that the instruction modifies memory, so the optimizer cannot reorder memory accesses after the barrier. Its syntax is as follows:

asm volatile("" ::: "memory");

ARM’s RVCT includes a similar internal function called __schedule_barrier().

However, here we are discussing hardware memory barriers provided by dedicated ARM assembly language instructions. As we have seen, optimizations such as caching, write buffers, and out-of-order execution may lead to the execution order of memory operations being inconsistent with the order in the code. Typically, this reordering is invisible to you. Application developers generally do not need to worry about memory barrier issues. However, in some cases, such as in device drivers or when you have multiple data observers, you may need to deal with these ordering issues to ensure data synchronization.

The ARM architecture specifies several memory barrier instructions that allow you to force the core to wait for memory accesses to complete. These instructions are available in both ARM and Thumb code and are applicable in both user mode and privileged mode. In earlier versions of the architecture, these instructions were executed in ARM code only through CP15 operations. Although these operations have now been deprecated, they are still retained for compatibility.

Let us begin by describing the practical effects of these instructions in a single-core system. This is a simplified version described in the ARM Architecture Reference Manual, aimed at introducing the use of these instructions. The term “explicit access” refers to data accesses caused by load or store instructions in the program, excluding instruction fetches.

Data Synchronization Barrier (DSB)

This instruction forces the core to wait until all pending explicit data accesses are complete before executing any other instruction phase. It has no effect on instruction prefetching.

Data Memory Barrier (DMB)

This instruction ensures that all memory accesses that occur before the barrier are observed in program order in the system, and then explicit memory accesses can occur in program order after the barrier. It does not affect the order of other instructions executing on the core, nor does it affect instruction fetching.

Instruction Synchronization Barrier (ISB)

This instruction flushes the pipeline and prefetch buffers in the core so that all instructions fetched after ISB come from cache or memory. It ensures that context-changing operations such as CP15 or ASID changes, TLB or branch predictor operations, are visible to instructions fetched after ISB that were executed before the ISB instruction. This operation itself does not cause synchronization between data and instruction caches, but is necessary as part of such operations.

Several options can be specified with DMB or DSB instructions to provide access types and applicable shareable domains, as follows:

SY: This is the default option, indicating that the barrier applies to the entire system, including all cores and peripherals.
ST: A barrier that only waits for store operations to complete.
ISH: A barrier that applies only to the internally shareable domain.
ISHST: A barrier that combines ST and ISH, i.e., only stores to the internally shareable domain.
NSH: A barrier that applies only to the unified point (PoU).
NSHST: A barrier that only waits for store operations to complete and applies only to the unified point.
OSH: A barrier operation that applies only to the externally shareable domain.
OSHST: A barrier that only waits for store operations to complete and applies only to the externally shareable domain.

To better understand these operations, you need to use broader DMB and DSB operation definitions in multi-core systems. The term “processor (or agent)” in the following text does not necessarily refer only to cores but can also refer to DSPs, DMA controllers, hardware accelerators, or any modules accessing shared memory.

The effect of the DMB instruction is to enforce the order of memory accesses within the shareable domain. All processors within the shareable domain will ensure that all explicit memory accesses that occur before the DMB instruction complete before any explicit memory accesses observed after the DMB.

The DSB instruction has the same effect as DMB, but in addition, it synchronizes memory accesses with the entire instruction stream, not just other memory accesses. This means that when DSB is issued, execution will pause until all outstanding explicit memory accesses are complete. Execution resumes normally once all outstanding reads are complete and the write buffer is empty.

It is easier to understand the effects of barriers through an example. Consider the case of a four-core Cortex-A9 cluster. The cluster forms a single internally shareable domain. When a core in the cluster executes a DMB instruction, that core will ensure that all data memory accesses in program order before the barrier complete before any explicit memory accesses after the barrier. This guarantees that all cores in the cluster see accesses on either side of the barrier in the same order. If using the DMB ISH variant, there is no guarantee that external observers (such as DMA controllers or DSPs) will also see the same order.

10.2.1 Memory Barrier Use Example

Consider two cores A and B, and two addresses (Addr1 and Addr2) in normal memory stored in the core registers. Each core executes two instructions as shown in the following example:In this scenario, there are no ordering requirements, and no order can be determined for any transactions. Addresses Addr1 and Addr2 are independent, and both cores (A and B) have no requirement to execute load and store operations in order in the program, nor do they need to be concerned with the activities of the other core.

Therefore, this code can have four legal results, meaning that core A’s register R1 and core B’s register R3 can ultimately obtain four different combinations of values from memory:

A gets old value, B gets old value.
A gets old value, B gets new value.
A gets new value, B gets old value.
A gets new value, B gets new value.

If a third core C is introduced, you must also note that the storage order observed by C does not need to be the same as that of the other cores. A and B can both see old values in Addr1 and Addr2 while C seeing new values is also completely permissible.

Now consider a scenario where core B’s code waits for core A to set a flag before reading memory, for example, in the case of a message being passed from core A to core B. We might have code similar to the following example:Again, this may not execute as expected. There is no reason to prevent core B from speculatively reading from [Msg] before reading [Flag]. This is normal weakly ordered memory, and the core is unaware of any potential dependencies between the two. You must explicitly enforce this dependency by inserting memory barriers. In this example, you actually need two memory barriers. Core A requires a DMB between the two store operations to ensure they occur in the order you initially specified. Core B needs a DMB before executing LDR R0, [Msg] to ensure it does not read the message before the flag is set.

10.2.2 Avoiding Deadlocks with a Barrier

Another situation that may lead to deadlocks without using barrier instructions is when one core writes to an address and polls an acknowledgment value applied by a peripheral. The following example demonstrates the type of code that can lead to issues:Without multi-processing extensions, the ARMv7 architecture does not strictly require that the store operation to [Addr] must complete (it may be in the write buffer while the memory system is busy reading the flag), so the two cores may deadlock, each waiting for the other. Inserting a DSB after the store operation can force the core to observe its store before reading Flag. Cores that implement multi-processing extensions are required to complete accesses within a limited time (i.e., their write buffers must be drained), so barriers are not needed.

10.2.3 WFE and WFI Interaction with Barriers

The WFE (Wait For Event) and WFI (Wait For Interrupt) instructions allow you to stop execution and enter a low-power state. To ensure that all memory accesses before executing WFI or WFE are complete (and visible to other cores), you must insert a DSB instruction.

Another consideration involves using WFE and SEV (Send Event) in multi-processing systems. These instructions allow you to reduce power consumption associated with lock acquisition loops (spinlocks). A core attempting to acquire a mutex may find that another core already holds the lock. Rather than having the core repeatedly poll the lock, it can pause execution and enter a low-power state using the WFE instruction.

When an interrupt or other asynchronous exception is detected, the core wakes up, or another core sends an event through the SEV instruction. The core holding the lock will use the SEV instruction to wake up other cores in WFE state after releasing the lock. Regarding memory barrier instructions, event signaling is not considered an explicit memory access. Therefore, we must ensure that the memory updates releasing the lock are actually visible to other processors before executing the SEV instruction. This requires using DSB. DMB is not sufficient because it only affects the order of memory accesses without synchronizing them with specific instructions, while DSB will prevent SEV from executing until all previous memory accesses are observed by other cores.

10.2.4 Linux Use of Barriers

Barriers are required to enforce the order of memory operations. Generally, you do not need to understand or explicitly use memory barriers, as they are already included in the locking and scheduling primitives of the kernel. However, writers of device drivers or those who wish to understand kernel operations may find detailed descriptions useful.

Compiler and core microarchitecture optimizations allow the order of instructions and their associated memory operations to be changed. However, sometimes you want to enforce a specific order of memory operations. For instance, you might write to memory-mapped peripheral registers. This write operation may have side effects elsewhere in the system. Memory operations before or after this operation may appear to be reorderable since they operate on different locations. However, in some cases, you want to ensure that all operations are complete before the peripheral write operation is finished. Or, you might want to ensure that no additional memory operations begin until after the peripheral write operation is complete.

Linux provides several functions to achieve this, as follows:

Indicating to the compiler that specific memory operations should not be reordered. This is achieved by calling the barrier() function. This only controls the compiler’s code generation and optimization and does not affect hardware reordering.
Calling memory barrier functions that map to ARM processor instructions, which perform memory barrier operations. These operations enforce specific hardware orderings. The available barriers are as follows (in the Linux kernel compiled with Cortex-A SMP support): — Read memory barrier rmb() function ensures that any reads occurring before the barrier complete before any reads occurring after the barrier. — Write memory barrier wmb() function ensures that any writes occurring before the barrier complete before any writes occurring after the barrier. — Memory barrier mb() function ensures that any memory accesses occurring before the barrier complete before any memory accesses occurring after the barrier.
These barriers have corresponding SMP versions called smp_mb(), smp_rmb(), and smp_wmb(). These are used to enforce the order of normal cacheable memory among cores within the same cluster, e.g., each core in a Cortex-A15 cluster. They can be used with devices, and they can even be effective for normal non-cacheable memory. When the kernel is compiled without CONFIG_SMP, each call to these functions expands to a barrier() statement.

All locking primitives provided by Linux include any required barriers. For these memory barriers, pairs of barriers are almost always needed. For more information, please see the relevant content.

10.3 Cache Coherency Implications

Caching is mostly invisible to the application programmer. However, when memory locations elsewhere in the system change or memory updates in application code need to be visible to other parts of the system, caching may become visible.

A simple example of a system containing external DMA devices and cores provides possible issues. Two situations can arise that lead to coherence failures. If DMA reads data from main memory while updated data exists in the core cache, DMA will read the old data. Similarly, if DMA writes data to main memory while stale data exists in the core cache, the core may continue to use the old data.

Therefore, before DMA begins, dirty data in the core data cache must be explicitly cleared. Similarly, if DMA is copying data for the core to read, it must be ensured that stale data is not present in the core data cache. Writes to memory by DMA will not update the cache, which may require the core to clear or invalidate the affected memory regions before initiating DMA. Since all ARMv7-A processors can perform speculative memory accesses, invalidation handling is also required after using DMA.

10.3.1 Issues with Copying Code

Boot code, kernel code, or JIT compilers may copy programs from one location to another or modify code in memory. There is no hardware mechanism to maintain consistency between instruction caches and data caches. You must invalidate the code in the instruction cache that is affected by invalidating the affected region and ensuring that the written code has actually reached main memory. If the core intends to jump to the modified code, a specific code sequence, including instruction barriers, is required.

10.3.2 Compiler Re-Ordering Optimizations

It is important to understand that memory barrier instructions only apply to hardware reordering of memory accesses. Inserting hardware memory barrier instructions may have no direct effect on the compiler’s operation reordering. In the C language, the volatile type modifier tells the compiler that the variable may be changed by something other than the code currently accessing it. This is often used for C language accesses to memory-mapped I/O, allowing safe access to these devices through pointers pointing to volatile variables. The C standard does not provide rules related to the use of volatile in multi-core systems. Therefore, while you can be assured that volatile loads and stores will occur in the order specified in the program, there is no such guarantee regarding the reordering of accesses relative to non-volatile loads or stores. This means that volatile does not provide a shortcut for implementing mutexes.

(Advertising Time)

Courses on ARM Architecture:

“From Beginner to Mastering ARMv8/ARMv9 Architecture (Phase 3)” — Best Seller
“From Beginner to Mastering ARMv8/ARMv9 Architecture (Phase 2)”
“From Beginner to Mastering ARMv8/ARMv9 Architecture (Phase 1)”
“Quick Start to ARMv8/ARMv9 Architecture”
“ARM Live Training Camp (8.11-9.2)” Replay
“Cache Live Training Camp Replay + Cache Special – Single Sale”
“ARM Microarchitecture Discussion Group – In-depth Interpretation/Discussion of ARM Microarchitecture Knowledge“
“Feishu Knowledge Base Document – ARM Column“ — Best Seller
“ARM Basic Architecture – Document Guide” —To Be Updated
[New Course/Completed] The Collection of Coresight/Trace/Debug is here, currently 64 courses, 16 hours, 6 major topics, 685 pages of PPT — Best Seller
New Course “Introduction to ARM Coresight”
“In-depth Explanation of SMMU Architecture” First Release/Unique in the Whole Network
In-depth Explanation and Practice of Power Management in ARM Architecture: Chip-Level Power Management Framework

Security Best-Selling Courses:

Trustzone/TEE Standard Version – 48 Courses/19.5 Hours
“Trustzone/TEE High-End Version – 205 Courses/50 Hours”
“Introduction to Optee Practical Version” — Also Known As: Trustzone/TEE Practical Version
“Optee System Architecture from Beginner to Mastery” — Also Known As:Optee Phase 2. New Course in November 2024, rich content, high quality, highly recommended！！！！
Secure Boot from Beginner to Mastery Training Camp
“Android 15 Security Architecture”

Classic Security Courses:

“ATF Architecture Development In-depth Explanation”
“Optee System Development In-depth Explanation”
“ATF/Optee/Hafnium/Linux/Xen Code Reading”
“Detailed Explanation of Android 13 Security Architecture”
“Secure Boot In-depth Explanation”
“Feishu Column – TEE Document”
“CA/TA Development from Beginner to Mastery”
“Trustzone/TEE Quick Start” Experience & Awareness
“TEE Awareness Course – OS Design”
“TEE Awareness Course – System Integration”
“TEE Awareness Course – System Architecture”
Free Gift “Building and Using Optee QEMU_v8 Environment – Including Video”
Free Gift “Building and Using Optee QEMU_v8 Environment – Direct Use”
“Trustzone/TEE Training Camp Replay” First Phase
“Trustzone/TEE Training Camp Replay” Second Phase
“8 Days to Master ARM Architecture”
“8 Days to Master Trustzone/TEE/Security Architecture”
“Detailed Explanation of Android Keymaster/Keymint” — Best Seller
MTE/PAC/BTI Memory Protection Trio

Other Courses:

Cortex-M Architecture In-depth Explanation

Platinum VIP Course Introduction

Selected ARM – Platinum VIP Course – Total Duration 850+ Courses, Total Time 320+ Hours, Total Value 30,000+

Best Introduction

Signature Courses: Trustzone Standard Version, Trustzone High-End Version
Top Three Selling Courses: ARM Phase 3, Secure Boot, Android 15 Security Architecture
Continuously Updated Courses: ARM Phase 3, Platinum VIP
Very GoodVery Good but Overlooked Courses: CA/TA Development
Recently Updated/Highlighted Courses/Key Courses: Optee System Architecture from Beginner to Mastery

Some Heartfelt Words:

1. Stop saying the courses are expensive, look at what kind of courses we have? Can others compare? Please do not compare our professional courses with general Linux, Android, Python, and C language courses.
2. Our VIP is a collection of dozens of courses. Do not compare the price of one course from others with the price of our 20 courses.
3. Many people know this knowledge, but how many can actually present it? How many are willing to present it? How many can actually present it?
4. Prices are calculated seriously, not randomly set. They are calculated based on content quality, core knowledge points, duration, and number of sessions. Prices are never raised without reason (a price increase requires a reason, such as increased course content…). We rely on content quality and long-term service, not on operations and marketing (mindless price increases).
5. If you have scrolled to this point, you may be an old fan/loyal fan, remember to like and comment. Thank you for your support.