Understanding ARM64 Architecture and Programming: Essential Cache Concepts

The third season video course on ARM64 architecture and programming: Basics of cache knowledge (1)

Why should system software engineers deeply understand cache?

Cache is ubiquitous in a system, and as a systems programmer, you cannot escape it. The diagram below is a classic architecture diagram of an ARM64 system, consisting of Cortex-A72 and Cortex-A53 forming a big.LITTLE architecture, where each CPU core has its own L1 cache, and each cluster shares an L2 cache, along with a Mali GPU and DMA peripherals.

For system software engineers, the following questions often arise:

What is the internal organization of the cache? Can you draw a layout diagram of a cache? What are set and way?
What are the differences between direct-mapped, fully associative, and set associative caches? What are their pros and cons?
How does the aliasing problem occur?
How does the synonym problem occur?
Will VIPT cause aliasing issues?
What are inner shareability and outer shareability? How do they differ?
What is PoU? What is PoC?
What is cache coherence? What methods are used in the industry to solve cache coherence?
I cannot understand the MESI state transition diagram.
What is cache false sharing? How does it occur, and how can it be avoided?
Why do DMA and cache have cache coherence issues?
How should a network card operate cache when receiving and sending data via DMA?
For self-modifying code, how can we ensure consistency between data cache and instruction cache?

Therefore, cache is extremely important for us systems programmers.

Not understanding cache well or not fully grasping it can significantly impact system programming. Sometimes, a small change in a line of code can affect the performance of the entire system. Therefore, I recommend that system programmers systematically learn about cache.

The “Ben” series articles mainly stem from the third season of the video course “ARM64 Architecture and Programming” and will likely have two parts:

Part one: Introduces background knowledge related to cache, such as what cache is, the layout structure of cache, cache hierarchy, VIPT/PIPT/VIVT, aliasing and synonym issues, and cache strategies.

Part two: Mainly introduces the ARM-specific concepts of inner share and outer share, as well as what PoU and PoC are, and the format of cache instructions. Why do cache coherence issues arise? The evolution of ARM’s solutions to cache coherence issues. Common solutions in the industry for cache coherence issues.

Part three: Mainly introduces the MESI protocol, how to read the MESI protocol state diagram, cache coherence issues between DMA and cache, consistency issues caused by self-modifying code, cache false sharing, and other problems.

Inner share and outer share

Inner and outer shareability are concepts proposed by ARM. An important point to note is that only memory with the normal memory attribute can set inner and outer shareability; device memory cannot set shareability.

How do we distinguish between inner share and outer share? The ARM manual states that different SoC designs have different methods of distinction, but there is a general rule: inner share typically refers to caches integrated with CPU IP, including L1 data cache and L2 cache integrated with CPU IP, while outer share refers to caches connected to the bus, such as external L3 caches.

The diagram below illustrates this clearly.The dashed line divides the two parts, with the left side being inner share and the right side being outer share.The left side represents caches integrated with CPU IP, where the cores integrate L1 and L2 caches, while the cores below integrate L1 cache. Therefore, the area enclosed by the dashed line is inner share.Now, looking at the left side of the dashed line, where L2 or L3 caches are connected via the bus, this side represents outer share.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

In the ARMv8.6 manual, there is a description of inner share and outer share in section B2.7.1.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

In example B2-1, it states that in a system with two clusters:

Each cluster has data cache and unified cache that are inner shareable.
The two clusters are considered outer shareable in this system, where each cluster is a different inner share, but the entire system is one outer share.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

In the Programmer’s guide, section 11.3, there is also a description of inner and outer share that is more relatable.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

Differences between PoU and PoC

Next, let’s introduce two important concepts defined by ARM: PoC, which stands for Point of Coherency, and PoU, which stands for Point of Unification.

What is PoU?

PoU: It indicates that the instruction cache, data cache, MMU, TLB, etc., in a CPU see the same copy of memory.

PoU for a PE means ensuring that the PE sees the same copy of I/D cache and MMU. Most of the time, PoU is observed from a single-core system perspective.
PoU for inner share means that all CPU cores within the inner share see the same copy.

Thus, PoU has two observation points: one is PoC for a PE (where PE refers to a CPU core), and the other is PoU for inner share. Detailed descriptions of PoU can be found in section D4.4.7 of the ARMv8.6 manual.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

In this example, it states that self-modifying code in PoU for inner can ensure consistency between data cache and instruction cache using the two simple instructions below. If it is not PoU for inner, additional memory barrier instructions are needed to ensure consistency between data cache and instruction cache.

What is PoC?

PoC: It indicates that all observers in the system, such as DSPs, GPUs, CPUs, and DMAs, can see the same copy of memory.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

Looking at the diagram on the left, it represents a single master, which can be a CPU, GPU, DMA, or any device capable of accessing memory. In a single-master system, the instruction cache, data cache, TLB, etc., can be seen as PoU because from the perspective of this single master, they are all the same copy. Now looking at the right diagram, which has two masters, for master A and master B to see the same copy, they need to maintain cache coherence, and this observation perspective is PoC.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

Differences between PoU and PoC

Let’s examine the differences between PoU and PoC. The biggest difference is that PoC is a system-wide concept related to the system configuration; it includes all devices capable of accessing memory, including CPUs, GPUs, DMAs, etc., which are referred to as observers. In contrast, PoU is a local concept. One more thing to note is that different system configurations may affect the range of PoU. For example, in Cortex-A53, you can configure L2 cache or not, which may affect the range of PoU. This is because an important observation point for supporting PoU is PoU for inner share, and the division of inner share is related to whether the cache is integrated with CPU IP.

For instance, in the diagram below, on the left, if there is no integrated L2 cache, then PoU equals PoC, and there are no other masters. In the right diagram, if the core integrates L2 cache, then the core, L1 cache, and L2 cache constitute PoU for inner, while master A and master B, along with system memory, constitute PoC.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

Cache Maintenance Instructions

Next, let’s discuss cache maintenance. Previously, we talked extensively about the background knowledge of cache; now we will introduce cache maintenance, which refers to manual maintenance, i.e., software intervention in cache behavior.

In ARMv8, there are three types of cache management operations defined:

Invalidate: Invalidate the entire cache or a specific cache line. Data in the cache will be discarded.
Clean: Clean the entire cache or a specific cache line. The corresponding cache line will be marked as dirty, and data will be written back to the next level cache or main memory.
Zero: In certain situations, performing a zero operation on the cache serves as a prefetch and acceleration function. For example, when a program needs to use a large block of temporary memory, a zero operation is performed during the initialization phase, causing the cache controller to actively write these zero data into the cache lines. If the program actively uses the cache zero operation, it will greatly reduce the bandwidth of the internal system bus.

Cache operations can specify different ranges:

The entire cache.
A specific virtual address.
A specific cache line or set and way.

Additionally, the ARMv8 architecture supports up to 7 levels of cache, from L1 to L7. When performing operations on a cache line, we need to understand the scope of the cache operation. The ARMv8 architecture divides the perspectives from the processor to all memory into the following:

PoC (Point of Coherency, global cache coherence perspective)
PoU (Point of Unification, processor cache coherence perspective)

Cache Instruction Format

Now let’s discuss the format of cache instructions. ARMv8 provides two cache-related instructions: one is the instruction cache management instruction, called IC, which is an abbreviation for instruction cache. The other is the data cache management instruction, called DCC, which is an abbreviation for data cache.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

Each instruction consists of two parts: one is the operation, and the other is xt, which represents parameters. The operation can have four parts: the function, which refers to what operation you want to perform. We previously discussed the three main functions of cache management: invalidate, clean, and zero. The first two can be combined into clean & invalidate, meaning you first flush the cache back to memory and then invalidate it. The second part is the type, where VA indicates the address to be operated on, sw indicates set and way, and all indicates the entire cache. The third part is the point, which refers to the management observation point of the cache, whether it is PoU or PoC; the ranges of PoU and PoC differ, as we discussed earlier. The fourth part is IS, indicating whether to include inner shareable. Here, PoU and inner share can be combined. The last parameter may pass additional parameters, such as the address required by this instruction.

The table below shows the cache management instructions supported in ARMv8, which I have briefly translated for you.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

You can refer to table D4-3 in section D4.4.8 of the ARMv8.6 manual for the most authoritative information.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

Why is cache coherence important?

Why is cache coherence necessary? Or how does the issue of cache coherence arise? To understand this problem, we need to start from the evolution of single-core CPUs to multi-core processors. Taking ARM as an example, the Cortex-A8 was a single-core processor, while the Cortex-A9 introduced multi-core processors. However, it is worth noting that there were also multi-core designs in ARM11, but they were not very mature at that time. In multi-core processors, each core has its own L1 cache, and multi-core processors may share an L2 cache, etc. For example, core0 has its own L1 cache, and core1 has its own L1 cache, followed by the memory. When core0 accesses an address first and loads the data from that address into its cache, what should core1 do if it wants that data? Should it read from memory or ask core0 for the data? Thus, the problem of cache coherence arises.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

Cache coherence concerns the consistency of the same data across multiple caches and memory. The primary methods for resolving cache coherence issues are bus snooping protocols, such as the MESI protocol. Therefore, a key point of this lesson is to introduce the MESI protocol.

Why should we, as system software engineers, care about cache coherence? Although the MESI protocol mentioned earlier is transparent to software, meaning it is entirely implemented in hardware, some scenarios require software intervention. Here are a few examples:

Using DMA in drivers (data cache and memory inconsistency); this is a common example, especially for those writing drivers. When your device driver uses DMA, you need to be particularly careful. For instance, devices typically have FIFO, and when you need to write the device’s FIFO data into the memory’s DMA buffer, how should you operate your cache? Conversely, when you need to move data from the memory DMA buffer to the device’s FIFO via DMA, how should you handle your cache?
Self-modifying code (data in the data cache may be newer than in the instruction cache).
Modifying page tables (data saved in the TLB may be outdated).

The Evolution of ARM’s Cache Coherence

Let’s take a look at the evolution of cache coherence in ARM. This diagram is quite interesting.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

In 2006, the Cortex-A8 processor was released, but the Cortex-A8 is a single-core design, which does not have cache coherence issues between multi-core processors, although there are still coherence issues between DMA and cache.
With the introduction of the Cortex-A9, multi-core designs (MPcore) emerged. Multi-core designs require hardware to ensure cache coherence between cores, and the usual approach is to implement protocols similar to MESI.
By the time of A15, big.LITTLE architecture appeared, which involves two clusters, each containing multiple cores. The multiple cores within a cluster require the MESI protocol to ensure coherence, and between clusters, a cache coherence controller such as AMBA coherency extensions is needed to address this issue. This is the system-level cache coherence problem. ARM has done considerable work in this area, providing existing IPs for use, such as CCI-400, CCI-500, etc.

In single-core CPUs, as mentioned earlier, since there is only one CPU, there is only one L1 cache and L2 cache, and there will be no second CPU accessing the cache. Thus, single-core processors do not have cache coherence issues. It is important to note that the cache coherence issue referred to here is the cache coherence issue between multi-core processors; there are still cache coherence issues between DMA and CPU. Furthermore, in single-core processor systems, the scope of cache management instructions is limited to single-core.

Now let’s look at multi-core processor scenarios, such as the Cortex-A9’s MP core. In this case, hardware supports multi-core cache coherence, implementing the MESI protocol in hardware. In ARM manuals, this is generally referred to as the Snoop Control Unit (SCU). Another point is that in multi-core processor systems, cache management instructions will broadcast messages to other CPU cores, which differs from single-core processors.

The following three diagrams illustrate the situation: the first diagram shows a single-core processor with only one CPU core and a single cache, with no multi-core cache coherence issues. The second diagram shows a dual-core processor, where each core has its own cache, necessitating a hardware unit to resolve coherence issues between the two caches, typically referred to as the SCU hardware unit. The third diagram is more complex, consisting of two clusters, each with two cores. In one cluster, there is a hardware unit for cache coherence between cores. At the bottom, there is a cache coherence bus or cache coherence controller to ensure cache coherence between the two clusters.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

System-Level Cache Coherence Issues

Now let’s take a look at system-level cache coherence issues. As ARM systems become increasingly complex, evolving from multi-core to multi-cluster architectures such as big.LITTLE, we can consider the diagram below, which depicts a typical big.LITTLE architecture. The little cores are represented by A53, while the big cores are represented by A72. The two A53 cores form a cluster, with each A53 CPU having its own independent L1 cache, sharing an L2 cache, and connecting to the cache coherence bus via an ACE hardware unit, which stands for AXI coherent extensions defined in the AMBA 4 protocol. Similarly, the big cores also consist of two cores, each with its own L1 cache and sharing an L2 cache, also connecting through the ACE interface to the cache coherence bus. In addition to CPUs, the system also includes GPUs, such as ARM’s Mali GPU, and other peripherals with DMA capabilities, all of which must connect to the cache coherence bus through the ACE interface. This cache coherence bus is responsible for achieving system-level cache coherence. We will discuss system-level cache coherence in more detail later.

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

Solutions for Cache Coherence

Let’s briefly look at the common solutions for cache coherence. Cache coherence needs to ensure that all CPUs and bus masters in the system, such as GPUs and DMAs, observe consistent memory. For example, if peripherals use DMA, and your software generates some data via the CPU and wants to transfer that data to the peripheral using DMA, if the CPU and DMA see inconsistent data—such as the CPU’s generated data remaining in the cache while DMA accesses data directly from memory—then DMA will see outdated data, leading to data inconsistency. In this case, the most recent data resides in the CPU’s cache, as the CPU is the producer responsible for generating the data.

Generally, there are three approaches to achieving cache coherence in systems:

First Approach: Disable Cache

The first approach is to disable the cache. This is the simplest method, but it severely impacts performance. For instance, in the previous example, if the CPU generates data and places it in a DMA buffer, using the disable cache approach means the CPU cannot utilize the cache during data generation, significantly affecting performance due to frequent accesses to DDR, leading to decreased performance and increased power consumption.

Second Approach: Software Management of Cache Coherence

The second approach involves software managing cache coherence. This is the most commonly used method, where software needs to clean or flush dirty cache or invalidate old data at appropriate times. This method requires engineers writing drivers to be particularly careful. Advantages include simple hardware RTL implementation, while disadvantages are:

Increased software complexity. Software needs to manually clean/flush cache or invalidate cache.
Increased debugging difficulty. Because the actions of cleaning and invalidating the cache must be completed at the right timing, if handled incorrectly, your DMA may transmit incorrect data, which is challenging to debug. This is because the error occurs at a random time point, not during a system crash, making it difficult to pinpoint the issue. Data corruption bugs are among the hardest to locate.
Reduced performance and increased power consumption. Some may wonder why software managing cache can reduce performance and increase power consumption. This is because flushing the cache takes time, as it involves writing dirty cache back to memory. In the worst-case scenario, you may need to flush the entire cache, increasing memory access frequency, thus reducing performance and increasing power consumption. Frequently flushing the cache is not a good practice and significantly impacts performance.

Third Approach: Hardware Management of Cache Coherence

The third approach involves hardware managing cache coherence, which is transparent to software. For multi-core cache coherence, the usual method is to implement the MESI protocol in multi-core systems, creating a snoop control unit. For system-level cache coherence, a coherent bus protocol needs to be implemented. In 2011, ARM proposed the AXI Coherency Extensions (ACE) in the AMBA 4 protocol to achieve this. The ACE interface is used for cache coherence between clusters, while the ACE lite interface is used for cache coherence for IO devices like DMA and GPU.

Exciting Content in the Next Episode

In the next episode, we will focus on introducing the MESI protocol state diagram and teach you how to interpret the MESI state diagram. We will also explain how understanding the MESI state diagram is useful in our practical work. The following cases illustrate this best:

Cache false sharing
Cache coherence issues with DMA
Self-modifying code issues

See you in the next episode! If you find Ben’s articles unsatisfying, feel free to subscribe to the flagship video course “ARM64 Architecture and Programming” from the third season, where Ben will guide you through Raspberry Pi, experiments, and improvement!

ARM64 Video Course

The job market is tough. Classmates who graduated from second-tier universities like me must work hard; otherwise, we will go hungry because all the good food is taken by classmates from top-tier universities who are both talented and diligent.

Come on, give yourself a reason to continue learning and improving. Join Ben to learn ARM64, play with Raspberry Pi, conduct experiments, and progress together! Globally original ARM64 experiments, the world’s first hands-on video course interpreting the ARMv8 manual, you deserve it!

Click “Read the original text” to enter the WeChat store to subscribe!

Understanding ARM64 Architecture and Programming: Essential Cache Concepts

Why should system software engineers deeply understand cache?

Inner share and outer share

Differences between PoU and PoC

What is PoU?

What is PoC?

Differences between PoU and PoC

Cache Maintenance Instructions

Cache Instruction Format

Why is cache coherence important?

The Evolution of ARM’s Cache Coherence

System-Level Cache Coherence Issues

Solutions for Cache Coherence

First Approach: Disable Cache

Second Approach: Software Management of Cache Coherence

Third Approach: Hardware Management of Cache Coherence

Exciting Content in the Next Episode

ARM64 Video Course

Related posts

Leave a Comment Cancel reply