Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

Click the blue "Arm Selected" at the top left and choose "Set as Star"

Thoughts:1. What is in the cache entry? 2. What is in the TLB entry? 3. What is in the MMU’s page table entry? What is in the L1 and L3 tables? This article has the answers. After learning, will you know? Let’s see how it goes, haha….Note: Knowledge about MMU/TLB/Cache is too fragmented, and various modules are closely related, so they will be introduced together. This article aims to introduce the working principle of MMU, for learning about cache, please refer to <ARM Cache Learning Notes – One Article is Enough>.The following comes from questions in the Uncle Ben’s public account:1. What is the internal organizational structure of the cache? Can you draw a layout diagram of the cache? What is set and way? 2. What are the differences between direct mapping, fully associative, and set associative? What are their advantages and disadvantages? 3. How does the name conflict occur? 4. How does the same-name problem occur? 5. Will VIPT have name conflicts? 6. What is inner shareability and outer shareability? How to distinguish? 7. What is PoU? What is PoC? 8. What is cache consistency? What methods are used in the industry to solve cache consistency? 9. I can’t understand the MESI state transition diagram. 10. What is cache false sharing? How does it occur, and how to avoid it? 11. Why does DMA and cache have cache consistency issues? 12. How should the network card operate the cache when receiving and sending data via DMA? 13. For self-modifying code, how to ensure the consistency of data cache and instruction cache?

Article Directory

    • 1. Memory Attribute

    • 2. Some Basic Concepts of Cache

    • 3. Cache Memory Access Model:

    • 4. Introduction to MMU

    • 5. VMSA Related Terminology:

    • 6. Address Translation System (AT)

      • (1) The process of address translation

      • (2) System registers related to MMU

      • (3) Registers related to enabling MMU and endianness

      • (4) Address size configuration

      • (5) Granule sizes

      • (6) The impact of granule size on address translation

      • (7) Disable MMU

    • 7. Translation Table

      • (1) TTBR0/TTBR1

      • (2) What information is included in the page table entry

      • (3) Granule sizes

      • (4) Cache configuration

    • 8. The process of querying the ARM MMU three-level page table

    • 9. Translation Lookaside Buffer (TLB)

      • (1) What is in the TLB entry?

      • (2) Contiguous block entries

      • (3) TLB abort

      • (4) TLB consistency

    • 10. VMSAv8-64 Translation Table Format Descriptors

1. Memory Attribute

ARMv8 defines two types of memory: device memory and normal memory, where device memory is fixed as Outer-Shareable and Non-cacheable, while normal memory has multiple attributes to choose from. To clarify: in section B2.7.2, there is a statement “Data accesses to memory locations are coherent for all observers in the system, and correspondingly are treated as being Outer Shareable,” treated as means (but is not), so in some articles, it is considered that device memory has no shareable attribute. It can also be understood that if a segment of memory is set to non-cacheable, then discussing the shareable attribute of that memory seems meaningless. In any case, let’s understand it according to the table below, where device memory is fixed as Outer-Shareable and Non-cacheableIntroduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

For device memory, there are three characteristics: ➨Gathering and non-Gathering (G or nG): Indicates whether access to multiple memories can be merged; if nG, the processor must strictly follow the memory access in the code, and cannot merge two accesses into one. For example, if there are two read accesses to the same address in the code, the processor must strictly perform two read transactions. ➨Reordering (R or nR): Indicates whether the processor is allowed to reorder memory access instructions. nR indicates that program order must be strictly followed. ➨Early Write Acknowledgement (E or nE): PE accesses memory with a question and answer (the more professional term is transaction); for write, PE needs a write ack operation to confirm the completion of a write transaction. To speed up writes, the intermediate stages of the system may set some write buffers. nE indicates that the write operation’s ack must come from the final destination, not from the intermediate write buffer.

For the above three characteristics, the following four configurations are given:

  • Device-nGnRnE: The processor must strictly follow the memory access in the code, must strictly execute program order (no reordering required), and the write operation’s ack must come from the final destination.

  • Device-nGnRE: The processor must strictly follow the memory access in the code, must strictly execute program order (no reordering required), and the write operation’s ack can come from the intermediate write buffer.

  • Device-nGRE: The processor must strictly follow the memory access in the code, memory access instructions can be reordered, and the write operation’s ack can come from the intermediate write buffer.

  • Device-GRE: The processor can merge access to multiple memories, memory access instructions can be reordered, and the write operation’s ack can come from the intermediate write buffer.

Meaning of Some Attributes
Shareable Memory is shared
non-Shareable Memory is not shared and is generally only accessible by a single PE
cacheable Memory will be cached
non-cacheable Memory will not be cached

Definitions of PoU/PoC/Inner/OuterIn brief, PoU/PoC defines the caches or memories that instructions and commands can reach; upon reaching the specified location, Inner/Outer Shareable defines the range they are broadcasted. For example: executing a Clean instruction on the instruction cache of a certain A15 specifies PoU. Clearly, all four A15’s level 1 instruction caches will be cleared. So are other masters affected? That requires the use of Inner/Outer/Non Shareable.Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

With the above definitions, let’s look at inner Shareable and outer Shareable.

Shareable
inner Shareable Memory is shared within the inner range
outer Shareable Memory is shared within the outer range

Read Allocation When the CPU reads data and a cache miss occurs, it will allocate a cache line to cache the data read from main memory. By default, caches support read allocation.

Write Allocation When the CPU writes data and a cache miss occurs, write allocation strategy will be considered. When we do not support write allocation, the write instruction will only update the main memory data and then end. When supporting write allocation, we first load data from main memory into the cache line (equivalent to performing a read allocation action), and then update the data in the cache line.

Write Through When the CPU executes the store instruction and hits the cache, we update the data in the cache and also update the data in the main memory. The data in the cache and main memory remain consistent.

Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

Write Back When the CPU executes the store instruction and hits the cache, we only update the data in the cache. Each cache line will have a bit to record whether the data has been modified, known as the dirty bit (refer to the previous image, the D next to the cache line indicates the dirty bit). We will set the dirty bit. The data in the main memory will only be updated when the cache line is replaced or explicitly cleaned. Therefore, the data in the main memory may be unmodified, while the modified data resides in the cache. The data in the cache and main memory may not be consistent.

Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

➨ For the registers defining these attributes in ARMV8 and examples used in Linux kernel or OP-TEE software code, please click here <Memory attributes and Cache policies in ARMV8 MMU memory management>

2. Some Basic Concepts of Cache

Cache is a high-speed memory block that contains many entries, each entry contains: memory address information (such as tag), associated data.

The design of the cache considers two main principles: spatial locality: after accessing one location, it may access adjacent areas, such as sequentially executed instructions, accessing a structure’s data; temporal locality: memory area access is likely to be repeated in a short time, such as executing a loop operation in software.

To reduce the number of cache reads and writes, multiple data are placed under the same tag, which is what we call cache line access. Information already in the cache is called cache hit, while accessing data not in the cache is called cache miss.

Potential problems introduced by cache: memory access may not be as expected by the programmer; a piece of data can exist in multiple physical locations.

3. Cache Memory Access Model:

Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

Definitions of Memory Coherency Terms: Point of Unification (PoU), Point of Coherency (PoC), Point of Persistence (PoP), and Point of Deep Persistence (PoDP).

4. Introduction to MMU

In the ARMV8-A architecture, the hardware structure diagram for ARM Core accessing memory is shown below:Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

Among them, the MMU consists of TLB and Table Walk Unit.

TLB: Translation Lookaside Buffer (TLB), corresponding to TLB instruction Table Walk Unit, also called address translation, address translation system, corresponding to AT instruction.

5. VMSA Related Terminology:

➨ VMSA – Virtual Memory System Architecture ➨ VMSAv8 ➨ VMSAv8-32 ➨ VMSAv8-64

➨ Virtual address (VA) ➨ Intermediate physical address (IPA) ➨ Physical address (PA)

Translation stage can support only a single VA range ➨ 48-bit VA, 0x0000000000000000 to 0x0000FFFFFFFFFFFF ➨ ARMv8.2-LVA: 64KB granule: 52-bit VA, 0x0000000000000000 to 0x000FFFFFFFFFFFFF

Translation stage can support two VA ranges ➨ 48-bit VA: 0x0000000000000000 – 0x0000FFFFFFFFFFFF, 0xFFFF000000000000 to 0xFFFFFFFFFFFFFFFF ➨ 52-bit VA: 0x0000000000000000 – 0x000FFFFFFFFFFFFF, 0xFFF0000000000000 to 0xFFFFFFFFFFFFFFFF

Address tagging / Memory Tagging Extension / Pointer authentication

6. Address Translation System (AT)

(1) The Process of Address Translation

The address translation work of the MMU is an automatic behavior. After filling the page table and configuring the system registers, the virtual address read/write operations initiated by the CPU will be automatically converted into physical addresses by the MMU and then sent to the AXI bus to complete the actual read/write operations of memory or devices. The following lists the address translation models of ARM in different exception levels.

Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)In general, we will not enable stage 2 unless EL2 is enabled and a hypervisor is implemented; only then will stage 2 be enabled, as shown in the following figure:Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

(2) System Registers Related to MMU

Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)In the ARMV8-A architecture, the TCR (Translation Control Register) registers are:

  • TCR_EL1

  • TCR_EL2

  • TCR_EL3

  • VTCR_EL2

Their meanings: control registers for address translation.

  • TCR_EL1: Translation Control Register (EL1) The control register for stage 1 of the EL1&0 translation regime.

  • TCR_EL3: Translation Control Register (EL3) The control register for stage 1 of the EL3 translation regime.

  • TCR_EL2: Translation Control Register (EL2) The control register for stage 1 of the EL2 or EL2&0 translation regime.

  • VTCR_EL2: Virtualization Translation Control Register The control register for stage 2 of the EL1&0 translation regime.

The corresponding bit map is Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

(3) Registers Related to Enabling MMU and Endianness

Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)In the ARMV8-A architecture, there are three sctlr registers:

  • SCTLR_EL1

  • SCTLR_EL2

  • SCTLR_EL3

(4) Address Size Configuration
  • Physical address size – tells the CPU how many bits the current system’s physical address is.

  • Output address size – tells the MMU how many bits of physical address you need to output.

  • Input address size – tells the MMU that the input is mostly a virtual address.

  • Supported IPA size – the size of the stage 2 page table conversion, not introduced here.

. Physical address sizeIntroduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)b. Output address sizeIntroduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)c. Input address size

  • TCR_ELx.T0SZ defines the size of the VA address when using TTBR0_ELx.

  • TCR_ELx.T1SZ defines the size of the VA address when using TTBR1_ELx.Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)Note that the maximum size is: 2^(64-x), where x is TCR_ELx.T0SZ or TCR_ELx.T1SZ.d. Supported IPA size is determined by VTCR_EL2.SL0 and VSTCR_EL2.SL0 registers.

(5) Granule Sizes

a. Stage 1 Granule SizesIntroduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

b. Stage 2 Granule SizesIntroduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

(6) The Impact of Granule Size on Address Translation

Different configurations of granule size will affect the establishment of page tables. Pages with different granule sizes have different page table structures, as shown in the table below:Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

(7) Disable MMU

After disabling MMU, the stage 1 translation, for the EL1&0: For Normal memory, Non-shareable, Inner Write-Back Read-Allocate Write-Allocate, Outer Write-Back Read-Allocate Write-Allocate memory attributes.

7. Translation Table

(1) TTBR0/TTBR1

According to ARM documents: because application programs need to switch page tables frequently, while the kernel does not need to switch page tables, ARM provides a number of features, namely TTBR0 and TTBR1 two page table base addresses. TTBR0 is used for the translation of the virtual address space 0x00000000_00000000 – 0x0000FFFF_FFFFFFFF, and TTBR1 is used for the translation of the virtual address space 0xFFFF0000_00000000 – 0xFFFFFFFF_FFFFFFFF.

EL2/EL3 only has TTBR0, no TTBR1, so the virtual address space of EL2/EL3 is: 0x0000FFFF_FFFFFFFF.

(2) What Information is Included in the Page Table Entry

In addition to completing address translation, the MMU also controls access permissions, memory ordering, and cache policies.

As shown in the figure, three types of entry information are listed:Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

bits[1:0] indicate whether the output is a block address, the next level table address, or an invalid entry.

(3) Granule Sizes

There are three granule sizes for the page table: 4KB, 16KB, 64KBIntroduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

(4) Cache Configuration

The MMU uses translation tables and translation registers to control cache policy, memory attributes, access permissions, and VA to PA conversion.

8. The Process of Querying the ARM MMU Three-Level Page Table

Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

  • (1) After enabling MMU, the read/write address initiated by the CPU is a 64-bit virtual address,

  • (2) The high 16 bits of the virtual address are either all 0s or all 1s. If all 0s, then TTBR0_ELx is chosen as the base address for the L1 page table; if all 1s, then TTBR1_ELx is chosen as the base address for the L1 page table;

  • (3) TTBRx_ELn serves as the L1 page table, pointing to the L2 page table, and based on bit[41:29] index, queries the base address of the L3 page table.

  • (4)(5) With the base address of the L3 page table, the address of the page is queried based on bit[28:16] index.

  • (6) Finally, the final physical address is found based on bit[15:0].

9. Translation Lookaside Buffer (TLB)

(1) What is in the TLB Entry?

The TLB contains not only physical and virtual addresses but also some attributes such as memory type, cache policies, access permissions, ASID, VMID. Note: ASID – Address Space ID, VMID – Virtual Machine ID.

(2) Contiguous Block Entries

The TLB has a fixed number of entries, so you can improve the TLB hit rate by reducing the number of external memory address translations. In the ARMV8 architecture, there is a feature in TLB called contiguous block entries, which indicates one entry can correspond to multiple blocks. One entry finds multiple blocks and then uses the index to determine which specific block it is. In the page table’s block entries, there is also a contiguous bit. If this bit is 1, it indicates that the TLB’s contiguous block entries feature is enabled. The contiguous block entries feature requires alignment, for example: • 16 × 4KB adjacent blocks giving a 64KB entry with 4KB granule. Cache 64KB blocks, only need 16 entries • 32 × 32MB adjacent blocks giving a 1GB entry for L2 descriptors, 128 × 16KB giving a 2MB entry for L3 descriptors when using a 16KB granule. • 32 × 64KB adjacent blocks giving a 2MB entry with a 64KB granule.

If the contiguous bit is supported, then: PA after TLB query = PA in TLB entry + index.

(3) TLB Abort

If the contiguous bit is enabled, but the table entries to be converted are not contiguous, or the entries’ output is outside the address range or not aligned, it will cause a TLB abort.

(4) TLB Consistency

If the OS modifies the page table (entries), the OS needs to inform the TLB to invalidate these TLB entries, which needs to be done by software. The instruction is as follows:

TLBI <type><level>{IS} {, <Xt>}

10. VMSAv8-64 Translation Table Format Descriptors

The table format descriptors here are actually the entries mentioned in the “thoughts” section at the beginning of this article. In the page table entry, it can be either invalid, a table entry, or a block entry

  • An invalid or fault entry.

  • A table entry that points to the next-level translation table.

  • A block entry that defines the memory properties for the access.

  • A reserved format.

If it is a table entry, its attribute description is as follows: Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)If it is a block entry, its attribute description is as follows: Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)If it is a stage 2 entry, regardless of whether it is a table entry or block entry, its attribute description is as follows: Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)In fact, using the following image is clearer: Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)Here, bits[1:0] are used to define the type of entry (invalid? block? table? reserved) bits[4:2] point to one of the bytes in the MAIR_ELn register, used to define the type of memory (main memory and device memory).

MAIR_ELn register is split into 8 bytes, each byte defines a type of memory (MAIR_ELn, Memory Attribute Indirection Register (ELn))Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)The meaning of each byte (attrn) is as follows:Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)The specific meanings of each bit are shown in the imageIntroduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)For example, the memory attribute configuration in OP-TEE is as follows:

#define ATTR_DEVICE_INDEX 0x0
#define ATTR_IWBWA_OWBWA_NTR_INDEX 0x1
#define ATTR_INDEX_MASK 0x7

#define ATTR_DEVICE (0x4)
#define ATTR_IWBWA_OWBWA_NTR (0xff)

mair = MAIR_ATTR_SET(ATTR_DEVICE, ATTR_DEVICE_INDEX);
mair |= MAIR_ATTR_SET(ATTR_IWBWA_OWBWA_NTR, ATTR_IWBWA_OWBWA_NTR_INDEX);
write_mair_el1(mair);

(Advertisement Time)

ARM Architecture Courses:

  • “From Beginner to Master of Armv8/Armv9 Architecture (Phase 3)” — Hot Sale

  • “From Beginner to Master of ARMv8/ARMv9 Architecture (Phase 2)”

  • “From Beginner to Master of ARMv8/ARMv9 Architecture (Phase 1)”

  • “Quick Start to ARMv8/ARMv9 Architecture”

  • “ARM Live Training Camp (8.11-9.2)” Replay

  • “Cache Live Training Camp Replay + Cache Special – Single Sale”

  • Arm Microarchitecture Discussion Forum — In-depth Interpretation/Discussion of ARM Microarchitecture Knowledge

  • Feishu Knowledge Base Document – Arm Column — Hot Sale

  • “ARM Basic Architecture – Document Guide” To be updated

  • [New Course/Completed] The Complete Collection of Coresight/Trace/Debug is here, currently 64 classes 16 hours, 6 major themes, 685 pages of PPT — Hot Sale

  • New Course “Introduction to Arm Coresight”

  • “In-depth Explanation of SMMU Architecture” First Release/Only in the World

  • “ARM Architecture – Power Management In-depth Explanation and Practice: Chip-level Power Management Framework”

Security Hot Sale Courses:

  • Trustzone/TEE Standard Edition – 48 Classes/19.5h

  • “Trustzone/TEE High-Configuration Version – 205 Classes/50h”

  • “OP-TEE Entry Practical Version” — Also known as: Trustzone/TEE Practical Version

  • “From Beginner to Master of OP-TEE System Architecture” — Also known as:OP-TEE Phase 2. New course in November 2024, rich content, high quality, strongly recommended!!!!

  • Secureboot from Beginner to Master Training Camp

  • “Android 15 Security Architecture”

Classic Security Courses:

  • “ATF Architecture Development In-depth Explanation”

  • “OP-TEE System Development In-depth Explanation”

  • “ATF/OP-TEE/Hafnium/Linux/Xen Code Reading”

  • “Detailed Explanation of Android 13 Security Architecture”

  • “Secureboot In-depth Explanation”

  • “Feishu Column – TEE Document”

  • “CA/TA Development from Beginner to Master”

  • “Trustzone/TEE Quick Start” Experience & Awareness

  • “TEE Awareness Course – OS Design”

  • “TEE Awareness Course – System Integration”

  • “TEE Awareness Course – System Architecture”

  • Gift: “Building and Using OP-TEE QEMU_v8 Environment – Includes Video”

  • Gift “Building and Using OP-TEE QEMU_v8 Environment – Direct Use”

  • “Trustzone/TEE Training Camp Replay” Session One

  • “Trustzone/TEE Training Camp Replay” Session Two

  • “8 Days to Master ARM Architecture”

  • “8 Days to Master Trustzone/TEE/Security Architecture”

  • “Detailed Explanation of Android Keymaster/Keymint”— Hot Sale

  • MTE/PAC/BTI Memory Protection Trio

Other Courses:

  • Cortex-M Architecture In-depth Explanation

Platinum VIP Course Introduction

  • Arm Selected – Platinum VIP Course – Total Course Hours 850+ , Total Duration 320h+ , Total Value 30k+

  • Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

    Introduction to ARMv8-A Architecture Virtual Memory (MMU/TLB/Cache)

Leave a Comment

×