Understanding ARM Cortex-M7 Cache: Basics and Principles

Summary

Introduction

1. Why introduce cache in CPU architecture?

2. The role of cache

3. Memory system of ARM Cortex-M4 / Cortex-M7 CPU architecture

4. Detailed explanation of basic concepts related to cache

Conclusion

Introduction

For many years, most embedded MCUs based on RISC CPU (like 8051 core, PIC core, and ARM Cortex M0/M0+ and M4 cores) typically did not come equipped with high-speed caches. With the introduction of the ARMv7 architecture, cache support was provided in the ARMv7-A series (e.g., Cortex-A8), but it was not supported in the core designs of ARMv7-M microcontrollers (like Cortex-M3 and Cortex-M4). However, when the Cortex-M7 was released, it broke this pattern by providing cache support for smaller embedded microcontrollers.

This series will be introduced in three parts:

1. Basics of cache

2. Cache replacement strategies

3. How to optimize application software to better utilize cache

1. Why introduce cache in CPU architecture?

With the advancement of semiconductor technology, Moore’s Law has brought higher integration, allowing CPU cores to run at higher frequencies. However, traditional memory (also known as main memory) based on Flash and SRAM is limited by its working structure/principle. The operating frequency/read and write access speeds cannot match the high-speed core’s access speeds for instructions and data, leading to performance bottlenecks in controller chips.

The purpose of cache is to improve the average speed and efficiency of high-speed CPU cores accessing memory. One of the most direct and obvious benefits is the significant improvement in application performance, which in turn can lead to enhanced power consumption models. High-speed caches have been used in processor-based high-end systems for many years (dating back to the 1960s).

The driving force behind the development and use of caches is the principle of program locality.

Caches operate based on two locality principles:

Spatial locality: Application/code compilation results are stored sequentially, and accessing one memory location is likely to lead to accessing adjacent locations.
Temporal locality: During the execution of applications/code, it is likely that the same memory area will be accessed repeatedly within a short period of time.

It is also worth noting the sequentiality—given that specific locations have been referenced, it is likely that the next few references will reference location s + 1. Sequentiality is a constrained type of spatial locality and can be seen as a subset of it.

In high-end modern chip systems, there can be various forms of cache, such as network caches and disk caches commonly found on x86 architecture PCs, but this article will focus on the main memory (Flash and SRAM) caches of embedded microcontrollers (MCUs) represented by ARM Cortex-M7.

Additionally, main memory caches can also be hierarchical, meaning there are multiple caches between the processor and main memory, often referred to as L1, L2, L3, etc., with L1 being closest to the processor core.

Understanding ARM Cortex-M7 Cache: Basics and Principles

2. The role of cache

The simplest way to understand high-speed caches is as a small high-speed buffer placed between the CPU and main memory, used to store blocks that have recently been referenced to main memory.

Once we use the cache, every memory read will result in one of the following two outcomes:

Cache hit—The memory address is already in the cache.
Cache miss—The memory access is not in the cache, so we must access the main memory.

The core processor architecture typically provides support for cache implementation. In its simplest form, there are two main CPU processor architectures: Harvard and Von Neumann architecture. The Von Neumann architecture only has one bus for data transfer and instruction fetching, so different fetches must be interleaved as they cannot be executed simultaneously. The Harvard architecture has separate instruction and data buses, allowing transfers to be executed simultaneously on both buses.

The Von Neumann architecture typically has a unified cache that stores both instructions and data. Since the Harvard architecture has separate instruction and data buses, logically they usually have separate instruction and data caches.

Cortex-M7 is a variant of the Harvard architecture, known as Modified Harvard. Like Harvard, it provides independent instruction and data buses, but these buses access a unified memory space; allowing access to instruction memory content in the same way as accessing data space. This means that the modified Harvard can simultaneously support both unified and Harvard (separate) caches.

It is also important to note that all ARMv7-M series cores are based on a load/store architecture. Importantly, this means that the only allowed data memory access is through load and store operations (LDR/STR), and all algorithm data operations are performed using general-purpose registers (r0-r12).

3. Memory system of ARM Cortex-M4 / Cortex-M7 CPU architecture

Unlike other ARM series, the ARMv7-M architecture pre-partitions memory mapping into 8 x 512MB sections, which are then allocated to code/Flash, RAM, peripherals, and system space.

Cortex-M4 and Cortex-M7 share the same system memory mapping but have completely different memory systems. On Cortex-M4, users will see on-chip Flash located at address 0x00000000 and on-chip SRAM located at address 0x20000000. However, for Cortex-M7, there are two differences in the instruction and data address regions.

First, the Cortex-M7 core is equipped with local (L1) instruction cache (I-Cache) and data cache (D-Cache).

Second, the Cortex-M7 memory system also supports connection to local tightly coupled memory (TCM) for storing instructions and data, referred to as ITCM and DTCM, respectively.

4. Detailed explanation of basic concepts related to cache

As mentioned earlier, caches are local high-speed buffers between main memory and the central processing unit. In the ARM Thumb-2 ISA (Instruction Set Architecture), load and store operations can transfer bytes, half-words (2 bytes), or words (4 bytes) and even double words (8 bytes, supported by Cortex-M7) to and from memory. However, due to spatial locality, the cache controller does not only buffer a single read to a specific location but pulls in many words around the currently accessed memory. The number of words in a single transfer from main memory to cache is referred to as the cache line length, and the process of reading into the cache is called line fill.

The line length varies by design but is fixed at 8 words (i.e., 32 bytes) on Cortex-M7.

For memory access reasons, each cache line is now bounded by a 32-byte boundary address. Therefore, memory read from address 0x0000000c and address 0x00000018 is in the same cache line.

This means that for an 8-word cache line, if we mask the lowest 5 bits, all addresses in the same cache line will evaluate to the same result. Right-shifting this result by five bits will yield a unique index for each address (e.g., (address & ~0x1f) >> 5).

In the remaining bottom 5 bits, bits 4-2 index the word within the cache line, while bits 1-0 provide a byte within the word (when byte and half-word accesses are needed).

Thus, reading from address 0x0000004c has the same word offset as 0x0000000c but resides in a different cache line index, and addresses 0x00000078 and 0x00000018 are similarly so.

Of course, this model only works when our cache size is the same as the Flash/SRAM size, which would break objects (and is not cost-effective). Therefore, the cache size will be significantly less than the available Flash/SRAM; thus, the index model needs to be modified.

Caches on Cortex-M7 are optional and can be 4KB, 8KB, 16KB, 32KB, or 64KB in size. Note that the sizes of instruction and data caches may differ.

Assuming we have the smallest available cache size (4KB), we can calculate the number of independent cache lines. For a 32-byte cache line, this gives us a set of 128 unique cache lines, providing us with indices [0-127] for each cache line.

To calculate the index of an address, simply mask bits 11-5 and right shift 5 positions ((address & 0xfe0) >> 5). However, this means that within a typical Flash address range, multiple addresses will map to the same index, for example:

(0x0000004c &amp; 0xFe0) &gt;&gt; 5 = index 2, word 3(0x0000204c &amp; 0xFe0) &gt;> 5 = index 2, word 3

Therefore, if we read address 0x0000004c, the cache controller will fill cache line 2 with bytes 0x00000040-0x0000005F. In addition to the stored memory values, the cache controller also stores address tags and lines. The tag is the remaining part of the address (i.e., (address & ~0xfe0) >> 12).

So for the above two addresses, we now get:

0x0000004c : index 2, word 3, tag 0 0x0000204c : index 2, word 3, tag 2

If we subsequently read from a different address 0x0000204c, this will match the same cache index (index). However, the cache controller will also compare the cache line address tag with the actual address tag, which of course does not match, ensuring that the cache controller does not return memory from the cache to the processor. When the tags do not match, this is referred to as a cache miss, in which case the cache controller needs an action plan—also known as a policy.

Conclusion

The above is the first part of “Understanding ARM® Cortex®-M7 Cache.” I hope everyone can grasp the reasons for modern high-speed CPU cores equipped with high-speed caches and related basic concepts, in order to further learn and understand the subsequent content on the second part (cache replacement strategies) and the third part (how to optimize application software to better utilize cache).

References (Click the article link to jump directly to read):

1. Introduction to the ARM® Cortex®-M7 Cache – Part 1 Cache Basics, By Niall Cooling

2. Introduction to the ARM® Cortex®-M7 Cache – Part 2 Cache Replacement Policy, By Niall Cooling

3. Introduction to the ARM® Cortex®-M7 Cache – Part 3 Optimising software to use cache, By Niall Cooling

4. ARM® Cortex® -M7: Bringing High Performance to the Cortex-M Processor Series, By Ian Johnson Senior Product Manager, ARM.

5. “A Brief Discussion on the High Performance of Embedded MCU CPU Cores: Key Points of ARM Cortex-M7 Core Dual Issue ISA Implementation”, Enwei Hu, “Automotive Electronics Expert” WeChat public account

Enwei Hu (胡恩伟)

August 31, 2022, in Chongqing

Related posts

Leave a Comment Cancel reply