Summary1. Introduction to ARM Cortex-M7 Core2. High Performance Implementation of ARM Cortex-M7 Core2.1 Dual Issue 6-stage Instruction Pipeline with Branch Prediction2.2 Instruction Set Extension (ISA) vs. Dual Issue2.3 Tight Coupled Memory (TCM)2.4 Dedicated DMA Access Interface for TCM (AHBS)2.5 Cache2.6 High Performance Peripheral Interface (AHBP)2.7 High Performance Internal Bus Interconnect Matrix (AXIM)Conclusion1. Introduction to ARM Cortex-M7 CoreThe ARM Cortex-M7 core, or CM7, codenamed “Pelican”, is a large-bodied water bird, about 1.7 meters long with a wingspan of up to 3 meters. Its strong wings can easily lift its massive body into the sky. Weighing up to 13 kilograms, it is one of the largest birds in existence.The ARM Cortex-M7 core is named to declare its high-performance CPU core supremacy in the embedded MCU application field.The ARM Cortex-M7 core is currently the highest performance core in the ARM Cortex-M series, achieving up to 2.14 DMIPS/MHz and 5.01 CoreMark/MHz performance, optimized for advanced processes like 40nm (40LP), 28nm (28HT), and 16nm (16FFC). It can achieve high performance (1.4GHz @ 16FFC) while also achieving lower power consumption (18.5uW/MHz @ 16FFC), allowing for reduced size and lower hardware implementation costs.Next, this article will take you through the high performance of the ARM Cortex-M7 core, especially how the Dual Issue ISA is implemented.2. High Performance Implementation of ARM Cortex-M7 Core2.1 Dual Issue 6-stage Instruction Pipeline with Branch PredictionThe ARM Cortex-M7 core employs a Dual Issue 6-stage instruction pipeline design with branch prediction, combined with high-performance, low-latency (zero access wait cycles) on-chip tightly coupled memory (I-TCM & D-TCM) and instruction and data cache (I-Cache & D-Cache), ensuring that most instructions can be executed in parallel.Specifically, the 6-stage instruction pipeline of the ARM Cortex-M7 core is implemented such that each instruction is processed by the same fetch and decode pipeline units in the first two stages, with instruction dispatch (issue) occurring in the third stage, identifying its instruction type (memory access [load/store pipeline], arithmetic logic operation [ALU pipeline], multiply-accumulate operation [MAC pipeline], floating-point operation [FPU pipeline]), allowing for parallel execution of different instructions.Among them, the ARM Cortex-M7 core implements two shifters and ALUs, having two ALU pipelines, ensuring that commonly used integer processing instructions can also be processed in parallel, greatly enhancing the computational performance of the CPU core.2.2 Instruction Set Extension (ISA) vs. Dual IssueIn terms of instructions, the ARM Cortex-M7 core, as the highest-end core of the ARMv7-M architecture, has extended its integer instruction set, DSP instruction set, and floating-point instruction set compared to previous Cortex-M0/M+ and Cortex-M3/4 cores, supporting multi-core exclusive access, dual-word (dual-word, 64-bit) memory read/write access (STRD, LDRD), more powerful DSP instructions, and double-precision floating-point processing (double FPU):Some limitations of the ARM Cortex-M7 core’s dual-issue pipeline include:① An additional instruction cycle is needed for address-based pointer access load operations;② Two FPU, SIMD, and MAC instructions cannot be issued simultaneously because the corresponding FPU, MAC, and SIMD-supporting ALU hardware is limited to one;③ Multi-byte and double-word load and store instructions, exclusive access instructions, unaligned load and store instructions, integer and floating-point division instructions, floating-point square root instructions, all double-precision floating-point instructions, and special kernel instructions (e.g., SVC instructions and breakpoint instructions BKPT that trigger kernel exceptions on the same call) cannot be executed in parallel with any other instructions;④ Dual issue memory access, storing data in different D-TCM ports with 32-bit/16-bit/8-bit data access can also support dual issue parallel execution, but accessing system memory (SRAM and Flash) through D-Cache does not support dual issue;To complement the dual-issue instruction pipeline and ensure core performance, the ARM Cortex-M7 core is also equipped with a dedicated tightly coupled memory interface (TCM interface), providing 64-bit wide instruction tightly coupled memory (64-bit I-TCM) and two 32-bit wide data tightly coupled memory (Dual 32-bit I-TCM) expansion interfaces; and a high-performance bus interface (AXI master) with instruction and data cache (I/D-Cache) functionality as access interfaces for system memory (SRAM & Flash):2.3 Tight Coupled Memory (TCM)Tight Coupled Memory (TCM) is a key configuration for the ARM Cortex-M7 core to achieve high performance. Through a dedicated tightly coupled memory interface, it can support a maximum of 16MB of I-TCM and 16MB of D-TCM, and is equipped with a dedicated high-speed DMA access interface (AHBS):
I-TCM: Used for storing code, with a 64-bit data width;
D-TCM: Used for storing data, with dual-port 64-bit data width;
I-TCM and D-TCM use independent bus Harvard architecture design, and access does not go through the chip’s system bus matrix (AXI, or CrossBar), providing zero wait cycles, thus ensuring high-speed operation of the core.
In addition, I-TCM and D-TCM also support ECC error correction and detection, ensuring the reliability of code and data storage, meeting functional safety requirements.2.4 Dedicated DMA Access Interface for TCM (AHBS)The ARM Cortex-M7 core supports DMA access to data in TCM through a dedicated AHBS interface, allowing results processed by the core in TCM to be moved by DMA to peripherals for transmission (e.g., CAN-FD, FlexRay, Ethernet, etc.), and data received by peripherals can also be moved to TCM using DMA, greatly improving the data processing bandwidth of the MCU/SOC system while reducing CPU loading.Tips: It is worth noting that when DMA accesses TCM via the AHBS interface, it needs to use the backdoor access address of TCM, for example, the backdoor access address mapping of the I-TCM and D-TCM of the NXP S32K3 series MCU is as follows:2.5 CacheTo ensure the core’s performance when using low-speed memory like SRAM and Flash, the ARM Cortex-M7 is equipped with 4~64KB of instruction and data cache (I&D-Cache), reducing the number of times the core accesses SRAM and Flash through the system bus matrix.Independent I-Cache and D-Cache, Harvard architecture design, ensure that instruction fetching can read and write data simultaneously;I-Cache and D-Cache are disabled by default and must be enabled before use (usually done in the MCU/SOC startup code);The Cache of the ARM Cortex-M7 core supports ECC error correction and detection:
D-Cache: Single-bit correction, multi-bit detection (SEC-DED ECC32, 32-bit user data + 7 bit ECC check);
I-Cache: Single-bit correction, multi-bit detection (SEC-DED ECC64, 64-bit user data + 8 bit ECC check);
2.6 High Performance Peripheral Interface (AHBP)The ARM Cortex-M7 core also reserves a 32-bit AHBP interface for high-performance low-latency peripheral connections: peripherals expanded through this interface do not need to go through the system bus interconnect matrix during read/write access, thus ensuring high performance and low latency.2.7 High Performance Internal Bus Interconnect Matrix (AXIM)The ARM Cortex-M7 core is equipped with the ARM 4th generation AMBA bus matrix – AXIM, providing the MCU/SOC with 64-bit memory and peripheral bus interconnect capabilities. When accessing memory (SRAM and Flash) and peripherals through AXIM, the kernel’s MPU can be configured to enable I-Cache and D-Cache, thus accelerating access and further improving system efficiency.ConclusionThe following is a comparison and summary of the access latency of various memory and peripheral interfaces of the ARM Cortex-M7 core introduced in this article: TCM and cache hit (Cache Hit) AXI access both have zero wait cycles, while AHBP interface access latency is fixed at 2 cycles, and for cache miss and non-cacheable AXI memory access as well as peripherals through AXI, the latency is 4 wait cycles:That concludes what I wanted to share with you today about the key configurations for achieving high performance in the ARM Cortex-M7 core. Since its introduction around 2010, many semiconductor companies have launched high-performance MCUs/MPUs based on the ARM Cortex-M7 core, such as ST’s STM32F7 and STM32H7 series high-performance general-purpose MCUs:NXP’s i.MX RT series crossover MCU and S32K3 series high-performance general-purpose automotive MCUs:In the future, I will introduce the specific TCM and Cache configurations and usage methods based on the specific NXP S32K3xx series high-performance automotive general-purpose MCU and their impact on performance, so stay tuned.
That’s all I wanted to share with you today. I hope it helps and inspires you.
Recommended recent articles from this public account (click the article title to jump directly to read):
1.《Automotive Electronics Expert Growth Path WeChat Official Account Original Technical Sharing Article Collection 2020》;
2. 《Automotive Electronics Expert Growth Path WeChat Official Account Original Technical Sharing Article Collection – 2019 Annual Edition》;
3.《Automotive Electronics Expert Growth Path WeChat Official Account Original Technical Sharing Article Collection – 2017~2018》;
4. A Detailed Discussion on the Privilege Mode Definition and Switching Method of ARM Cortex-M Series CPU Cores in Embedded MCU Software Development;
5. Exploring the Reasons for Normal Online Debugging Function but Abnormal Offline Operation in Embedded MCU Software Development (Taking NXP Automotive MCU as an Example);
6. RAM NVM Driver Generation and Integration Call and Testing Details for S32K1xx ECU Bootloader Development (S19 File);
7. Independent Secure Bootloader Development Details for S32K1xx Series MCU NVM Driver in Automotive Electronics ECU Bootloader Development;
8. S32K1xx Series MCU Application Guide: FlexCAN Module Function and Application Details;
9. S32K SDK Usage Details: can_pal Component and flexcan Component Usage Details (Including RxFIFO DMA and ID Filter as well as Bus Shutdown Recovery, etc.);
10. Detailed Discussion on the Startup Process and Startup Time Optimization Methods of S32K1xx Series MCU in Embedded MCU Software Development;
11. Detailed Discussion on Measuring Application Code Execution Time of S32K14x MCU Using the Core Cycle Counter of the DWT Module of the Cortex-M Core in Embedded MCU Software Development;
12. Detailed Discussion on the Boot and Startup Process of Qorivva MPC560x Series MCU and Key Points of ROM Image Configuration in Embedded MCU Software Development;
13. S32K SDK Usage Details: Implementation and Usage Notes of OSIF_TimeDelay() (Including Detailed Explanation of Cortex-M Core SysTick Timer);
14.Detailed Discussion on the Implementation Principles and Methods of Redirecting Application Code to Run on the System Stack (stack) in Embedded MCU Software Development;;
15.Summary of Factors Affecting Interrupt Latency in Embedded MCU Software Development;
16. Detailed Discussion on the Importance of Atomic Operations for Control and Configuration Registers of Low-Level Drivers in Embedded MCU Software Development (Analyzing the FlexCAN Module Driver’s MB Lost CAN Message Issue);17. S32K3xx Series MCU Software Development Guide: S32K3 Multi-Core Compilation Optimization and HSE-FW Installation to Enable HSE-B for U-Multilink Download Debugging;18. Detailed Discussion on Adding User-Defined bss Segment Attributes NOLOAD to GNU Link Script to Avoid Generating Unnecessary SRAM Address S19 Lines;
For more exciting articles, please visit the homepage of this public account, access through the category menu, or click the following historical article category list directory to obtain reading links:
《Historical Article Category List Directory (Click the article title to jump directly to read, as of September 15, 2021)》;
Original writing is not easy. If you find this article helpful for your work and study, you are also welcome to appreciate and encourage it—-I will continue to write and bring you more exciting original articles.
Tips:Click on the “Like Author” at the end of this article to appreciate or “Looking” to share, and leave a message to ask questions, I will reply to your concerns as soon as possible!
Welcome to learn/discuss automotive electronics and embedded systems hardware and software design technologies with me here. Any opinions and suggestions regarding this public account or the articles shared are also welcome to be pointed out in the comments. Your likes/follows/shares are the greatest support and recognition of my hard work in writing!
This public account has opened a keyword reply function. Please reply with the following keywords on the public account homepage to get more information and exciting articles:
About the author, please reply with the keyword “Author Introduction”;
Contact the author, please reply with the keyword “Contact Author”;
Get a high-definition PDF version of the public account article, please reply with the keyword “Get Article”;
Get professional and timely technical support services, please reply with the keyword “Professional Service”;
Annual original technical article collection high-definition PDF access, please reply with the keyword “Article Collection”;
CodeWarrior IDE license purchase and installation use issues, please reply with the keyword “CW License”;
Purchase of automotive Ethernet to industrial Ethernet converter, please reply with the keyword “Ethernet Converter Purchase”;
Get the latest and most comprehensive directory of original technical sharing articles from the public account, please reply with the keyword “Article Directory”;
Serious Statement:All original technical articles of this public account are free to read, and all views/conclusions in the articles are personal opinions and do not represent any company’s official views; all demo codes/programs are for reference and learning only, without quality guarantee, and if used for commercial purposes, the responsibility lies with the user; all articles of this public account are owned by the author, and any unauthorized reproduction is illegal and will be pursued for copyright infringement!
If you like the articles of this public account, please click the public account’s follow button at the beginning of the article or WeChat directly long press to scan the QR code below to follow, you can also search for the public account by entering “Automotive Electronics Expert Growth Path” in WeChat and click to follow. If you have any opinions on the views in this article, please feel free to point them out in the comments.
Your attention, likes, and shares are the greatest affirmation of my hard work!