When writing articles on virtualization, I recently touched on the memory model and hesitated whether to systematically summarize the storage aspects, as they are very important and interesting. In the end, I decided to open a dedicated series on ARM architecture. The ARM architecture is vast and complex, covering knowledge in both software and hardware. We will discuss it slowly, and I hope everyone can provide feedback. Where should we start? Let’s start with the signing and selling of books, haha, and first talk about cache. Learning from previous articles, I will try to compress each article to make it easier for everyone to read. The entire series on Cache will be divided into the following parts:
(1) The basic architecture of Cache, introducing the background of Cache’s birth, the basic structure of Cache, and related terminology.
(2) The structure of Cache, detailing how the structure of Cache is designed, introducing related design patterns such as VIPT, PIPT, etc.
(3) Cache strategies, introducing some related strategies in the use of Cache, such as how to replace data in Cache, etc.
(4) Cache consistency issues, for example, how CPU cores share data and maintain consistency, and how to keep data consistent in Cache and devices (DMA), etc.
In computer systems, memory is the electronic component that can temporarily or permanently store instructions and data. Before discussing cache, let’s first understand the memory devices around SoC, giving everyone an overall concept.
From the perspective of the CPU, memory can be divided into five levels, as shown in Figure 1-1.
Figure 1-1 Classification of Memory
Now let’s briefly introduce the various types of memory:
• Registers: Registers are more like a part of the CPU itself rather than memory, capable of storing extremely limited information, but they are very fast and almost synchronized with the CPU’s clock cycle. The clock cycle is related to the CPU’s main frequency; for a 2GHz CPU, the clock cycle is 1/2G (0.5ns). So it is indeed very fast.
• CPU Cache: Uses static random access memory (SRAM) chips. This is the main focus of today’s discussion, typically divided into three levels: L1, L2, and L3 in the ARM architecture. The access speed of Cache is also very fast; L1 usually takes 2 to 4 clock cycles, L2 usually takes 10 to 20 clock cycles, and L3 usually takes 20 to 60 clock cycles.
• DRAM: Uses dynamic random access memory (DRAM) chips, which have higher density and larger capacity than SRAM, and are also significantly cheaper than SRAM chips. However, they are slower, with access times of about 200 to 300 clock cycles.
• Hard Disk: Such as solid-state drives (SSD) and hard disk drives (HDD). Hard disks are well-known for their large storage capacity but the slowest access speed. Memory is 10 to 1000 times faster than SSDs and 100,000 times faster than HDDs.
Next, we summarize the characteristics of memory with a diagram, as shown in Figure 1-2. Following the direction of the arrows, the lower you go, the larger the storage capacity, the slower the access speed, and the lower the cost. Each type of storage device only interacts with its adjacent storage devices. For example, CPU Cache loads data from memory or needs to write back to memory; it does not write data directly to the hard disk or load data directly from the hard disk to the CPU Cache, but first loads it into memory and then from memory into Cache.
Figure 1-2 Characteristics of Memory
There’s no sweetness at both ends of sugarcane; the faster the access to memory, the more expensive it is, while the slower memory is naturally more economical. We can compare them with a diagram, as shown in Figure 1-3.
Figure 1-3 Comparison of Memory Costs
As we can see, cache is still very expensive. Here’s an example to give everyone a tangible understanding, such as the cache configuration of a certain mass-produced 8×55 processor CPU. The L1 size is confidential, and L3 is 2MB. Due to the big.LITTLE architecture, the L2 cache of each CPU core is different, as shown in Figure 1-4. From the figure, we can see that the stronger the computational ability of the core, the larger the data and instruction throughput it requires, and therefore the larger the cache it needs. The following sections will focus on cache based on AArch64.
Figure 1-4 Mainstream AArch64 Chip L2 Cache Size
We spent some time introducing the storage system around SoC. Now let’s review the architecture of SoC (for detailed reference, see the article on our site <[V-02] Virtualization Basics – CPU Architecture (Based on AArch64)>), as shown in Figure 1-5. It can be seen that the A-Profile (CPU), main memory (RAM), peripherals (Peripheral), etc., are connected together through the Interconnect (AMBA) bus.
Figure 1-5 Typical ARM-based SoC Architecture
Under the synchronization of the clock, all nodes can work according to the agreed bus high and low level signals. Here, the A-Profile represents the CPU’s operating clock frequency, which is very high, with a clock cycle of just a few nanoseconds, while the main memory and peripherals cannot achieve such speeds, with basic access cycles reaching hundreds of nanoseconds, close to milliseconds, with a difference of several orders of magnitude. In this case, the performance of the CPU is limited by the access speed of external memory and peripherals. Therefore, when designing the CPU architecture, a mechanism needs to be introduced to alleviate this limitation and maximize the performance of the CPU. This mechanism is to add a memory unit inside the CPU to support high-speed access close to the CPU’s main frequency. This memory unit is the cache. Here, we quote a passage from the manual to provide a more precise explanation of the concept of cache.
A cache is a small, fast block of memory that sits between the core and main memory. It holds copies of items in main memory. Accesses to the cache memory occur significantly faster than those to main memory. Whenever the core reads or writes a particular address, it first looks for it in the cache. If it finds the address in the cache, it uses the data in the cache, rather than performing an access to main memory.
Cache stores instructions and data, which are backups of the data and instructions in memory, as the data is moved from external low-speed storage devices to internal high-speed storage devices. This design allows the CPU’s core to prioritize searching from the high-speed cache each time it fetches instructions or data. If there’s a miss, it will then retrieve from external storage. ARM processors adopt a multi-level cache mechanism, as shown in Figure 1-6.
Figure 1-6 ARM Multi-Level Cache Block Diagram
In ARM processors, there are usually multiple levels of cache, including L1, L2, and L3. Among them, L1 cache is the closest to the processor, with a smaller capacity but the fastest access speed; L2 cache has a larger capacity but slightly slower access speed; L3 cache is usually the largest capacity but relatively slower access speed, typically shared among multiple processor cores. In the previous chapters, we have learned that introducing cache can solve the memory wall problem and improve CPU performance. At this point, we need to consider a question: why introduce a multi-level cache mechanism? Here’s the conclusion: this multi-level cache design is primarily to allow ARM processors to achieve a balance between high performance and low power consumption in different application scenarios, which can be elaborated in the following aspects:
(1) Reducing Power Consumption: By introducing multi-level caches, the size of each level of cache increases gradually, allowing the processor to manage data access more intelligently. When the processor needs to access certain data, it will first check each level to see if a copy of the data exists in the cache. If there is a hit, it reads data directly from the cache, avoiding a large-scale search, thus reducing power consumption.
(2) Improving Cache Hit Rate: The multi-level cache design allows each level of cache to focus on storing a portion of data. When the processor needs to access certain data, it will first attempt to read from the highest level of cache (usually L1 cache). If there’s a miss, it will check the next level cache until the desired data is found or finally accesses the main memory. This design improves the cache hit rate, further enhancing system performance.
(3) Flexibility: The multi-level cache design allows different cache levels to have different characteristics, such as capacity, access latency, and replacement strategies. This enables ARM to optimize cache design according to specific application scenarios and needs to achieve the best performance and power consumption balance, and most importantly, to control the cost of the chip itself.
Why focus on power consumption:
The power consumption of a chip refers to the energy consumed during operation, usually measured in watts (W). As power consumption increases, the chip generates more heat during operation, causing the temperature to rise, and the increase in temperature may further increase power consumption. However, at excessively high temperatures, power consumption may decrease due to the “temperature feedback phenomenon.” This not only affects hardware design, such as the design of internal chip circuits, on-chip circuit design, and heat dissipation systems, but also affects software design. Whether in mobile scenarios, car cockpits, or intelligent driving systems, to maintain chip stability and extend its lifespan, as well as for cost control, it is necessary to closely monitor its power consumption and temperature and take effective heat dissipation measures.
Basic Structure and Terminology of Cache
We know that Cache caches content from main memory (internal and external), and naturally, there must be a mapping path for the cached data to find each other with the data in memory. Memory is encoded and stored in bytes when storing instructions or data needed during instruction execution, so the memory address is the best mapping link connecting Cache and the data in memory. Based on this background, let’s understand Cache in ARM, as shown in Figure 1-7.
Figure 1-7 Basic Structure of ARM Cache
By observing the structure of Cache, we find that the organization of data is quite different between memory and Cache. Data in Cache is more aggregated, and new concepts have been designed to help express this form of data storage.
In the following text, unless otherwise specified, data in cache will refer to data and instructions in memory.
These new concepts include: TAG, Line, Index, Set, Way, and Offset. Below, we will explain each one based on Figure 1-7.
(1) TAG:
A portion of the high-end area of the memory address encoding is called TAG. In Figure 1-7, T-x represents individual TAGs. Note that the storage space occupied by TAG will not be counted in the cache size. For example, a CPU Core’s L2 Cache of 256K does not include the area for storing TAG.
(2) Line (Cache Line):
To operate and use data more efficiently, data in Cache is organized in blocks. A block of data is organized under the same TAG. For example, in Figure 1-7, A-1 to A-4 store cached data from memory, organized in blocks known as Cache Lines. Of course, A-5 to A8 is also a Cache Line, and A9 to A12 is also a Cache Line, and so on. Note that we also need to reserve some bits to indicate the state of the data in Cache, such as whether it is consistent with the data in main memory.
(3) Index:
A portion of the middle area of the memory address encoding used to index and find which group in the cache line the address is located. In Figure 1-7, A-1, A-5, A-9, A1-3 are in different Cache Lines, distinguished by Index.
(4) WAY:
There was originally no road in the world, but when more people walked, it became a road. Here, it’s the same; when there are more Cache Lines, they form a road. In Figure 1-7, A-1, A-5, A-9, and A1-3 in the same Cache Line together form a way (Way). Note that the roads in cache must be of the same size.
(5) Set:
Each road in Figure 1-7 has 4 Cache Lines, so the Index of each road (Way) ranges from 0 to 3 (0b00, 0b01, 0b10, 0b11), thus forming a set. For example, A-13, B-13, C-13, and D-13 in the Lines of roads A, B, C, and D have the same Index of 3 (0b11), so A-13, B-13, C-13, and D-13 form a group.
(6) Offset:
The low-end area of the memory is called Offset, which is actually the offset in the Cache Line, allowing the CPU Core to address the contents of the cache line by byte or word.
The core here is the Cache Line, as it is the actual storage backup for data in memory. Commonly seen Cache Lines are 32 bytes or 64 bytes in size. Other elements are designed to enable the MMU to quickly retrieve and locate data in Cache. Below, we quote the manual’s description to further explain Cache Line.
It would be inefficient to hold one word of data for each tag address, so several locations are typically grouped together under the same tag. This logical block is commonly known as a cache line, and refers to the smallest loadable unit of a cache, a block of contiguous
words from main memory. A cache line is said to be valid when it contains cached data or instructions, and invalid when it does not.
The size of the cache is referred to as cache size, representing the maximum size of data that can be cached. For example, if a CPU Core’s L2 Cache Size is 256KB and we set the Cache Line Size to 64 bytes, then this L2 Cache has a total of 256K/64 = 256 x 1024 / 64 = 4096 Cache Lines. If we divide each 4 Cache Lines into one way, then this L2 Cache has 4096 / 4 = 1024 ways (ways), which is a lot of ways.
Set Associative Mapping and Ways
In the previous section, we learned about the basic structure of cache and clarified some basic concepts of Cache design based on the ARM architecture. The purpose of creating so many elements is to map the contents of memory to Cache by splitting the memory address. We have understood the basic concept of ways in Cache, which is to divide the Cache into equally sized slices; each slice is one way. So how many slices should we divide the Cache into in practical engineering? What is the basis? Let’s introduce a model, as shown in Figure 1-8, as the basis for our discussion. The basic configuration of this model is as follows: 2-ways, each way has 4 Cache Lines, and each Cache Line can save 16 bytes of data.
Based on this model, let’s assume there is no Cache way 1 and see what happens during computation. At this point, there is only Cache way 0. When we access data in the continuous range of [0x0000.0000, 0x0000.0010), regardless of whether it is one byte or several bytes, we need to first load all 16 bytes of data into Cache way 0’s index (0b00) from main memory. Following the principle of locality, we will fill up Cache way 0.
At this point, the data in Cache way 0 and the data in main memory have the following mapping relationship:
(1) Data in the continuous range of [0x0000.0000, 0x0000.0010) is placed in Cache way 0’s index (0b00).
(2) Data in the continuous range of [0x0000.0010, 0x0000.0020) is placed in Cache way 0’s index (0b01).
(3) Data in the continuous range of [0x0000.0020, 0x0000.0030) is placed in Cache way 0’s index (0b10).
(4) Data in the continuous range of [0x0000.0030, 0x0000.0040) is placed in Cache way 0’s index (0b11).
If we keep operating on the data in [0x0000.0000, 0x0000.0010), the data in Cache way 0 remains relatively stable. This is because if an instruction jumps or accesses data, it needs to load data from [0x0000.0040, 0x0000.0050), then we need to perform the same action as discussed above, replacing the data in Cache way 0.
After replacement, the data in Cache way 0 and the data in main memory have the following mapping relationship:
(1) Data in the continuous range of [0x0000.0040, 0x0000.0050) is placed in Cache way 0’s index (0b00).
(2) Data in the continuous range of [0x0000.0050, 0x0000.0060) is placed in Cache way 0’s index (0b01).
(3) Data in the continuous range of [0x0000.0060, 0x0000.0070) is placed in Cache way 0’s index (0b10).
(4) Data in the continuous range of [0x0000.0070, 0x0000.0080) is placed in Cache way 0’s index (0b11).
If the program now needs to access data in main memory [0x0000.0080, 0x0000.0090), a new round of replacement (Process Three) is required. After going through Process Three, if we need to access data in main memory [0x0000.0000, 0x0000.0010), we will repeat Process One to replace the data in Cache. Then Cache will continuously replace data based on the system’s memory access situation, and this replacement situation incurs a clock cycle cost, especially when re-accessing data previously loaded into Cache. This frequent and continuous replacement process is referred to as Cache Thrashing.
Optimizing Replacement Process
Clearly, we do not want to see Cache Thrashing, so we need to optimize it. Here, we add one more way, as shown in Figure 1-8, turning it into a two-way model to see the effect. First, let’s assume the CPU needs to access data in main memory [0x0000.0000, 0x0000.0010). We will unfold the first round of replacement.
At this point, the data in Cache way 0 and the data in main memory have the following mapping relationship:
(1) Data in the continuous range of [0x0000.0000, 0x0000.0010) is placed in Cache way 0’s index (0b00).
(2) Data in the continuous range of [0x0000.0010, 0x0000.0020) is placed in Cache way 0’s index (0b01).
(3) Data in the continuous range of [0x0000.0020, 0x0000.0030) is placed in Cache way 0’s index (0b10).
(4) Data in the continuous range of [0x0000.0030, 0x0000.0040) is placed in Cache way 0’s index (0b11).
If we need to operate on data in main memory [0x0000.0040, 0x0000.0050), due to the optimized architecture of Cache, we can leave the data in Cache way 0 unchanged and directly place the data from main memory [0x0000.0040, 0x0000.0050) into Cache way 1.
After replacement, the data in Cache way 0 and the data in main memory have the following mapping relationship:
(1) Data in the continuous range of [0x0000.0040, 0x0000.0050) is placed in Cache way 1’s index (0b00).
(2) Data in the continuous range of [0x0000.0050, 0x0000.0060) is placed in Cache way 1’s index (0b01).
(3) Data in the continuous range of [0x0000.0060, 0x0000.0070) is placed in Cache way 1’s index (0b10).
(4) Data in the continuous range of [0x0000.0070, 0x0000.0080) is placed in Cache way 1’s index (0b11).
At this point, if the program needs to use data in [0x0000.0000, 0x0000.0010), there is no need to reload the data into Cache, as this data is still in Cache way 0 and can be used directly.
From this thought experiment, we can conclude that increasing one way in Cache can effectively reduce Cache Thrashing and thus improve overall system performance. However, is it enough to just increase the number of ways? Not necessarily, as each additional way increases hardware complexity, thus raising costs and power consumption. Returning to the beginning of this section, how many ways are appropriate? Without going into details, here is the conclusion (specific differences may occur due to Cortex-xx versions): L1 instruction Cache and data Cache are separate, generally having 2 or 4 ways, L2 generally has 16 ways, and L3 is generally larger.
At this point, we also understand why the code in the Linux Kernel needs to be aligned, especially for frequently used data structures. If a union can be used, it should be used as much as possible; if alignment can be achieved, it should be aligned. The underlying logic is aimed at execution efficiency. If it can be completed in one Cache Line, it should not be scattered across multiple Cache Lines. If it can be completed in one way, it should not be scattered across multiple ways. Because these situations will increase the overhead of the CPU Core when accessing data.
At this point, let’s summarize the mapping methods of Cache: Accessing data in memory cannot cross Cache, which leads to establishing a relationship between memory data blocks and Cache data blocks, that is, the process of establishing a mapping relationship. Once the significance of the mapping relationship is clarified, the next topic is how to establish the mapping, which can be summarized into three methods: Direct Mapped Cache, Fully Associative Cache, and N-way Set Associative Cache.
The specific analysis is as follows:
The most notable feature of this mapping method is that there is a fixed mapping relationship between memory blocks and Cache blocks.
For example, in Figure 1-8, if Cache only needs to retain one way Cache way0, memory blocks [0x0000.0000, 0x0000.0010), [0x0000.0040, 0x0000.0050), [0x0000.0080, 0x0000.0090), etc., will all be mapped to the same position, which is Cache way0’s Index (0b00). The reason is that Cache way0 has only 4 ways, and each Cache Line has a size of 16 bytes, so the low 4 bits of the memory address [0b0000, 0b1111] can be used to address the Cache Line by byte. At this point, the Offset is confirmed. Cache way0 has 4 lines, so the 2 bits of the middle area of the memory address [0b00, 0b11] can address the Cache Line. Thus, all memory blocks with the same index will be mapped to the same Cache Line, for example, [0x0000.0000, 0x0000.0010), [0x0000.0040, 0x0000.0050), [0x0000.0080, 0x0000.0090) have the same 5th and 6th bits of 0b00. Thus, this mapping model looks like this:
(1) Data in the continuous range of [0x0000.0000, 0x0000.0010), [0x0000.0040, 0x0000.0050), … is placed in Cache way 0’s index (0b00).
(2) Data in the continuous range of [0x0000.0010, 0x0000.0020), [0x0000.0050, 0x0000.0060), … is placed in Cache way 0’s index (0b01).
(3) Data in the continuous range of [0x0000.0020, 0x0000.0030), [0x0000.0060, 0x0000.0070), … is placed in Cache way 0’s index (0b10).
(4) Data in the continuous range of [0x0000.0030, 0x0000.0040), [0x0000.0070, 0x0000.0080), … is placed in Cache way 0’s index (0b11).
The advantage of designing this mapping method is that the hardware circuit design is simple, but the disadvantage is that it may lead to Cache Thrashing.
(2) Fully Associative Cache
The most notable feature of this method is that there is a random and flexible mapping relationship between memory blocks and Cache blocks;
For example, in Figure 1-8, if Cache also only needs to retain one way Cache way0, the size of the memory block is 16 bytes, the low four bits of the physical memory address [0b0000, 0b1111] can be used to address the Cache Line by byte. However, to ensure the random mapping feature, there is no need for the Index area to address the rows within Cache way0. At this point, memory blocks [0x0000.0000, 0x0000.0010), [0x0000.0040, 0x0000.0050) can be mapped to any position within Cache way0. In other words, at this time, the low 4 bits of the memory address are used for addressing within the Cache Line, and the high bits of the memory address become TAG.
The advantage of designing this mapping method is that if the Cache size is large enough, the number of Cache lines is sufficient, it greatly reduces the risk of Cache Thrashing. However, the disadvantage is that the hardware circuit design is complex, expensive, and difficult in engineering practice. ARM’s explanation is as follows:
Increasing the associativity of the cache reduces the probability of thrashing. The ideal case is a fully associative cache, where any main memory location can map anywhere within the cache. However, building such a cache is impractical for anything other than very small caches, for example, those associated with MMU TLBs.
This mapping method combines the advantages of the previous two mapping methods while discarding their disadvantages and is a compromise between the two.
Still referring to the example in Figure 1-8, we actually used the set associative mapping method when discussing the optimization scheme to reduce Cache Thrashing. On one hand, we improved the fixed mapping method by increasing the number of ways (each way cannot just be one line; if the number of ways is increased without adding lines, it would be meaningless), allowing memory blocks to be mapped to multiple Cache blocks. For example:
(1) Data in the continuous range of [0x0000.0000, 0x0000.0010), [0x0000.0040, 0x0000.0050), … is placed in Cache way 0 and Cache way 1’s index (0b00).
(2) Data in the continuous range of [0x0000.0010, 0x0000.0020), [0x0000.0050, 0x0000.0060), … is placed in Cache way 0 and Cache way 1’s index (0b01).
(3) Data in the continuous range of [0x0000.0020, 0x0000.0030), [0x0000.0060, 0x0000.0070), … is placed in Cache way 0 and Cache way 1’s index (0b10).
(4) Data in the continuous range of [0x0000.0030, 0x0000.0040), [0x0000.0070, 0x0000.0080), … is placed in Cache way 0 and Cache way 1’s index (0b11).
This way, we take advantage of the strengths of the previous two mapping methods, and in fact, the ARM Cache mechanism also adopts this mapping method.
In practice, performance improvements are minimal for above 8-way, with 16-way associativity being more useful for larger L2 caches.
This article introduces the storage-related content of computer systems, describes the basic architecture of Cache, and explains the basic structural elements of Cache. Finally, it discusses how memory blocks in Cache and main memory are mapped, and introduces the background of ARM’s adoption of set mapping. This lays the foundation for further discussions on other aspects of Cache.
The content regarding Cache is indeed relatively obscure, but it is quite useful for software engineers. If one is not familiar with the cache mechanism, it is difficult to understand the design background of many mechanisms in the kernel, such as locking mechanisms, etc. This article concludes here, and in future articles, we will continue to analyze Cache’s structure, working principles, strategies used, and synchronization issues.
[01] <DDI0487K_a_a-profile_architecture_reference_manual.pdf>
[02] <DEN0024A_v8_architecture_PG.pdf>
[03] <80-LX-MEM-yk0008_CPU-Cache-RAM-Disk关系.pdf>
[04] <80-ARM-ARCH-HK0001_一文搞懂CPU工作原理.pdf>
[05] <80-ARM-MM-Cache-wx0003_Arm64-Cache.pdf>
[06] <80-ARM-MM-HK0002_一文搞懂cpu-cache工作原理.pdf>
SRAM – Static Random-Access Memory
DRAM – Dynamic Random Access Memory
AMBA – Advanced Microcontroller Bus Architecture