Understanding the ARM Cortex-M Processor Family

Today, the ARM Cortex-M processor family has 8 processor members. In addition, ARM’s product line includes many other processor members. For many beginners, and even for some chip designers with rich experience but not familiar with ARM series processors, it is easy to confuse these products. Different ARM processors have different instruction sets, system functions, and performance. This article will delve into the key differences between the Cortex-M series processors and how they differ from other ARM series processors.

1.1 ARM Processor Family

Over the years, ARM has developed a considerable number of different processor products. As shown in the figure below (Figure 1): ARM processor products are divided into classic ARM processor series and the latest Cortex processor series. Moreover, based on different application scopes, ARM processors can be classified into three series.

Application Processors – High-end processors aimed at mobile computing, smartphones, servers, and other markets. These processors operate at very high clock frequencies (over 1GHz) and support memory management units (MMUs) required by complete operating systems like Linux, Android, MS Windows, and mobile operating systems. If the planned product needs to run one of the aforementioned operating systems, you need to choose an ARM application processor.

Real-time Processors – A high-performance processor series aimed at real-time applications, such as hard disk controllers, automotive transmission systems, and wireless communication baseband control. Most real-time processors do not support MMUs, but usually have MPUs, Cache, and other memory features designed for industrial applications. Real-time processors operate at relatively high clock frequencies (for example, 200MHz to >1GHz), with very low response latency. Although real-time processors cannot run the full versions of Linux and Windows operating systems, they support a large number of real-time operating systems (RTOS).

Microcontroller Processors – Microcontroller processors are typically designed to be very small in area and have a high energy efficiency ratio. Typically, these processors have short pipelines and operate at low maximum clock frequencies (although there are such processors on the market that can run above 200MHz). Furthermore, the new Cortex-M processor family is designed to be very easy to use. Therefore, ARM microcontroller processors are very successful and popular in the microcontroller and deeply embedded systems markets.

Figure 1: Processor Family

Table 1 summarizes the main characteristics of the three processor series.

Table 1: Processor Characteristics Summary

1.2 Cortex-M Processor Family

The Cortex-M processor family focuses more on the low-performance end, but these processors are still very powerful compared to many traditional processors used in microcontrollers. For example, the Cortex-M4 and Cortex-M7 processors are used in many high-performance microcontroller products, with maximum clock frequencies reaching 400MHz.

Of course, performance is not the only criterion for choosing a processor. In many applications, low power consumption and cost are key selection criteria. Therefore, the Cortex-M processor family includes various products to meet different needs:

Table 2: Cortex-M Processor Family

Unlike the older classic ARM processors (e.g., ARM7TDMI, ARM9), Cortex-M processors have a very different architecture. For example:

– Support only ARM Thumb® instructions, extended to support both 16-bit and 32-bit instructions in the Thumb-2 version

– Built-in nested vectored interrupt controller responsible for interrupt handling, automatically handling interrupt priorities, interrupt masking, interrupt nesting, and system exception handling.

– Interrupt handling functions can be programmed using standard C language, and the nested interrupt handling mechanism avoids using software to determine which interrupt needs to respond. At the same time, the interrupt response speed is deterministic and has low latency.

– The vector table changes from jump instructions to the starting addresses of interrupt and system exception handling functions.

– The register group and certain programming modes have also changed.

These changes mean that much assembly code written for classic ARM processors needs to be modified, and old projects need to be changed and recompiled to migrate to Cortex-M products. The specific details of software migration are documented in ARM documentation:

ARM Cortex-M3 Processor Software Development for ARM7TDMI Processor Programmers

http://www.arm.com/files/pdf/Cortex-M3_programming_for_ARM7_developers.pdf

1.3 Common Features of Cortex-M Series Processors

There are many similarities between Cortex-M0, M0, M3, M4, and M7, such as:

– Basic programming model (Section 3.1)

– Nested vectored interrupt controller (NVIC) interrupt response management

– Sleep modes designed in architecture: sleep mode and deep sleep mode (Section 4.1)

– Operating system support features (Section 3.3)

– Debugging features (Section 6)

– Ease of use

For example, the nested vectored interrupt controller is a built-in interrupt controller

Figure 2: Nested Vectored Interrupt Controller of Cortex-M Processors

Supports many peripheral device interrupt inputs, a non-maskable interrupt request, an interrupt request from the built-in clock (SysTick) (see Section 3.3), and a certain number of system exception requests. More content on NVIC and the exception handling model is described in Section 3.2. The similarities and differences between other Cortex-M processors will be discussed in the rest of this article.

2. Cortex-M Processor Instruction Set

2.1 Introduction to the Instruction Set

In most cases, application code can be written in C or other high-level languages. However, a basic understanding of the instruction set supported by Cortex-M processors helps developers choose the right Cortex-M processor for specific applications. The instruction set (ISA) is part of the processor architecture, and Cortex-M processors can be divided into several architecture specifications

Table 3: Specification of Cortex-M Processor ARM Architecture Specifications

All Cortex-M processors support the Thumb instruction set. The entire Thumb instruction set becomes quite large when extended to the Thumb-2 version. However, different Cortex-M processors support different subsets of the Thumb instruction set, as shown in Figure 3.

Figure 3: Instruction Set of Cortex-M Processors

2.2 Cortex-M0/M0/M1 Instruction Set

Cortex-M0/M0/M1 processors are based on the ARMv6-M architecture. This is a small instruction set that supports only 56 instructions, most of which are 16-bit instructions, as shown in Figure 3, occupying only a small portion. However, the registers and data lengths processed in these processors are 32 bits. For most simple I/O control tasks and general data processing, these instructions are sufficient. Such a small instruction set can be implemented with very few logic gates in the processor design, with the minimum configuration of Cortex-M0 and Cortex-M0 requiring only 12K gates. However, many of the instructions cannot use high registers (R8 to R12), and the ability to generate immediate values is limited. This is a result of balancing ultra-low power consumption and performance requirements.

2.3 Cortex-M3 Instruction Set

Cortex-M3 processors are based on the ARMv7-M architecture and support a richer instruction set, including many 32-bit instructions that can efficiently use high registers. Additionally, M3 also supports:

Table jump instructions and conditional execution (using IT instructions)

Hardware division instructions

Multiply-accumulate instructions (MAC)

Various bit manipulation instructions

A richer instruction set enhances performance in several ways; for example, 32-bit Thumb instructions support a larger range of immediate values, jump offsets, and memory data range address offsets. It supports basic DSP operations (e.g., supports several MAC instructions that require multiple clock cycles to execute, and saturation operation instructions). Finally, these 32-bit instructions allow performing bucket-type shift operations on multiple data with a single instruction.

Supporting a richer instruction set leads to larger area costs and higher power consumption. Typical microcontrollers have a gate count for Cortex-M3 that is more than twice that of Cortex-M0 and Cortex-M0. However, the processor’s area is only a small part of most modern microcontrollers, and the extra area and power consumption are often not significant.

2.4 Cortex-M4 Instruction Set

Cortex-M4 shares many similarities with Cortex-M3: pipeline, programming model. Cortex-M4 supports all the features of Cortex-M3 and additionally supports various DSP application-oriented instructions, such as SIMD, saturation operation instructions, a series of single-cycle MAC instructions (Cortex-M3 only supports a limited number of MAC instructions, which are executed over multiple cycles), and optional single-precision floating-point operation instructions.

Cortex-M4’s SIMD operations can process two 16-bit data and four 8-bit data in parallel. For example, the QADD8 and QADD16 operations shown in Figure 4:

Figure 4: SIMD Instruction Examples: QADD8 and QADD16

The uses of SIMD enable much faster computation of 16-bit and 8-bit data in certain DSP operations as the calculation can be parallelized. However, in general programming, C compilers are unlikely to utilize the SIMD capability. That is why the typical benchmark results of the Cortex-M3 and Cortex-M4 are similar. However, the internal data path of the Cortex-M4 is different from Cortex-M3, which enables faster operations in a few cases (e.g., single-cycle MAC, and allows write back of two registers in a single cycle).

2.5 Cortex-M7 Instruction Set

Cortex-M7 supports a similar instruction set as Cortex-M4, adding:

· Floating-point data architecture based on FPv5, rather than FPv4 of Cortex-M4, so Cortex-M7 supports additional floating-point instructions

· Optional double-precision floating-point data processing instructions

· Support for cache data prefetch instructions (PLD)

The pipeline of Cortex-M7 is very different from that of Cortex-M4. Cortex-M7 has a 6-stage dual-issue pipeline that can achieve higher performance. Most software designed for Cortex-M4 can run directly on Cortex-M7. However, to fully leverage the pipeline differences for optimal performance, software needs to be recompiled, and in many cases, the software needs some minor upgrades to take full advantage of new features like Cache.

2.6 Cortex-M23 Instruction Set

The instruction set of Cortex-M23 is based on the ARMv8-M Baseline sub-specification, which is a superset of ARMv6-M. The extended instructions include:

· Hardware division instructions

· Comparison and jump instructions, 32-bit jump instructions

· Instructions supporting TrustZone security extensions

· Mutex data access instructions (commonly used for semaphore operations)

· 16-bit immediate value generation instructions

· Load acquire and store release instructions (supporting C11)

In some cases, these enhanced instruction sets can improve processor performance and are useful for SoC designs with multiple processors (e.g., mutex access helps with semaphore handling in multi-processor systems).

2.7 Cortex-M33 Instruction Set

Because the design of Cortex-M33 is highly configurable, some instructions are also optional. For example:

· DSP instructions (supported by Cortex-M4 and Cortex-M7) are optional

· Single-precision floating-point operation instructions are optional, these instructions are based on FPv5 and have more than those of Cortex-M4.

Cortex-M33 also supports new instructions introduced by ARMv8-M Mainline:

· Instructions supporting TrustZone security extensions

· Load acquire and store release instructions (supporting C11)

2.8 Summary of Instruction Set Features Comparison

ARMv6-M, ARMv7-M, and ARMv8-M architectures have many instruction set feature characteristics, making it difficult to introduce all details. However, the table below (Table 4) summarizes those key differences.

Table 4: Summary of Instruction Set Features

The most important feature of the Cortex-M processor instruction set is upward compatibility. The instructions of Cortex-M3 are a superset of those of Cortex-M0/M0/M1. Therefore, theoretically, if the memory allocation is consistent, binary files running on Cortex-M0/M0/M1 can run directly on Cortex-M3. The same principle applies to Cortex-M4/M7 and other Cortex-M processors; the instructions supported by Cortex-M0/M0/M1/M3 can also run on Cortex-M4/M7.

Although Cortex-M0/M0/M1/M3/M23 processors do not have floating-point unit configuration options, the processors can still utilize software for floating-point data operations. This also applies to products based on Cortex-M4/M7/M33 but without floating-point unit configurations. In this case, when floating-point numbers are used in the program, the compiler toolkit will insert the required runtime software library during the linking phase. Software mode floating-point operations require longer run times and slightly increase code size. However, if floating-point operations are not frequently used, this solution is suitable for such applications.

3. Architectural Features

3.1 Programming Model

The programming model of the Cortex-M processor family is highly consistent. For example, all Cortex-M processors support R0 to R15, PSR, CONTROL, and PRIMASK. Two special registers—FAULTMASK and BASEPRI—are only supported by Cortex-M3, Cortex-M4, Cortex-M7, and Cortex-M33; the floating-point register group and the FPSCR (floating-point status and control register) registers are used for optional floating-point unit in Cortex-M4/M7/M33.

Figure 5: Programming Model

The BASEPRI register allows programs to block interrupts and exceptions of specified priority or lower priority. This is important for ARMv7-M, as Cortex-M3, Cortex-M4, Cortex-M7, and Cortex-M33 have a large number of priority levels, while ARMv6-M and ARMv8-M Baseline have only a limited 4 priority levels. FAULTMASK is typically used in complex error handling (see Section 3.4).

Implementation of non-privileged levels is optional for ARMv6-M processors and always supported for ARMv7-M and ARMv8-M processors. For Cortex-M0 processors, it is optional, and Cortex-M0 and Cortex-M1 do not support this feature. This means that the CONTROL registers of various Cortex-M processors are slightly different. The configuration of the FPU will also affect the CONTROL register, as shown in Figure 6.

Figure 6: CONTROL Register

Another difference in the programming model is the details of the PSR register (program status register). For all Cortex-M processors, the PSR register is further divided into Application Program Status Register (APSR), Execution Program Status Register (EPSR), and Interrupt Program Status Register (IPSR). ARMv6-M and ARMv8-M Baseline series processors do not support the Q bit of APSR and the ICI/IT bits of EPSR. The ARMv7E-M series (Cortex-M4, Cortex-M7) and ARMv8-M Mainline (Cortex-M33 with DSP extension configured) support the GE bit. In addition, the range of interrupt number digits of IPSR for ARMv6-M series processors is small, as shown in Figure 7.

Figure 7: PSR Differences

Please note that the programming model of Cortex-M is different from classic ARM processors like ARM7TDMI. In addition to the differences in the register groups, the definitions of “mode” and “state” in classic ARM processors are also different from those in Cortex-M. Cortex-M has only two modes: Thread mode and Handler mode, and Cortex-M processors always operate in Thumb state (ARM instructions are not supported).

3.2 Exception Handling Model and Nested Vectored Interrupt Controller NVIC

All Cortex-M processors include the NVIC module, adopting the same exception handling model. If an exception interrupt occurs, its priority level is higher than the currently running level, and it is not masked by any interrupt mask registers, the processor responds to this interrupt/exception and pushes certain registers onto the current stack. Under this stack mechanism, interrupt handlers can be written as ordinary C functions, and many small interrupt handlers can respond immediately without extra stack handling overhead.

Some interrupts and system exceptions used by ARMv7-M/ARMv8-M Mainline series processors are not supported by ARMv6-M/ARMv8-M Baseline products, as shown in Figure 8. For example, the number of interrupts for Cortex-M0, M0, and M1 is limited to below 32, with no debug monitor exceptions, and error exceptions are limited to HardFault only (for error handling details, see Section 3.4). In contrast, Cortex-M23, Cortex-M3, Cortex-M4, and Cortex-M7 processors can support up to 240 peripheral device interrupts. Cortex-M33 supports up to 480 interrupts.

Another difference is the number of available priority levels:

ARMv6-M architecture – ARMv6-M supports 2 fixed levels (NMI and HardFault) and 4 programmable levels (represented by two bits of each priority level register). This is sufficient for most microcontroller applications.

ARMv7-M architecture – ARMv7-M series processors can configure the number of programmable priority levels from 8 levels (3 bits) to 256 levels (8 bits), depending on area constraints. ARMv7-M processors also have a feature called interrupt priority grouping, which can further divide interrupt priority registers into group priorities and sub-priorities, allowing detailed specification of preemptive priority behavior.

ARMv8-M Baseline – Similar to ARMv6-M, M23 also has 2-bit priority level registers. With the optional TrustZone security extension component, secure software can convert the priority level of interrupts in non-secure environments to the lower half of the priority range, ensuring that certain interrupts/exceptions in secure environments are always higher than those in non-secure environments.

ARMv8-M Mainline – Similar to ARMv7-M. It can support 8 to 256 interrupt priority levels and interrupt priority grouping. It also supports the priority adjustment feature of ARMv8-M Baseline.

Figure 8: Types of Exceptions and Interrupts in Cortex-M Processors

All Cortex-M processors rely on the vector table for exception handling. The vector table holds the starting addresses of the exception handling functions (as shown in Figure 8). The starting address of the vector table is determined by a register called the Vector Table Offset Register (VTOR).

· Cortex-M0, Cortex-M3, and Cortex-M4 processors: by default, the vector table is located at the start of the memory map (address 0x0). Cortex-M0, Cortex-M3, and Cortex-M4: The vector table is by default placed at the starting address of the memory space (address 0x0).

· In Cortex-M7, Cortex-M23, and Cortex-M33 processors: the default value for VTOR is defined by chip designers. Cortex-M23 and Cortex-M33 processors can have two separate vector tables for Secure and Non-secure exceptions/interrupts. Cortex-M7, Cortex-M23, and Cortex-M33: The initial value of VTOR is defined by chip designers. Cortex-M23 and Cortex-M33 processors have two separate vector tables for secure and non-secure exceptions/interrupts.

· Cortex-M0 and Cortex-M1 do not implement programmable VTOR, and the starting address of the vector table is always 0x00000000. Cortex-M0 and Cortex-M1 do not implement programmable VTOR; the starting address of the vector table is always 0x00000000.

The VTOR of Cortex-M0 and Cortex-M1 processors is optional. If VTOR is implemented, the starting address of the vector table can be changed by setting VTOR, which is useful for the following situations:

Relocating the vector table to SRAM to dynamically change the entry point of the exception handling function

Relocating the vector table to SRAM for faster vector reading (if flash memory is slow)

Relocating the vector table to different locations in ROM (or Flash), allowing different exception handlers for different program execution stages

The NVIC programming model between different Cortex-M processors also has additional differences. The differences are summarized in Table 5:

Table 5: Differences in NVIC Programming Model and Features

In most cases, operations on the interrupt control characteristics of NVIC are handled through APIs provided by CMSIS-CORE, which are included in the device driver library provided by microcontroller manufacturers. For Cortex-M3/M4/M7/M23/M33 processors, even if interrupts are enabled, their priorities can be changed. ARMv6-M processors do not support dynamic priority adjustment; when you need to change the interrupt priority, you need to temporarily disable the interrupt.

3.3 Operating System Support Features

The Cortex-M processor architecture was designed with operating system support in mind. The features for operating systems include:

Shadow stack pointer

System service call (SVC) and pending system call (PenSV) exceptions

SysTick – A 24-bit decrementing timer that generates periodic exception interrupts for operating system timing and task management

Cortex-M0/M3/M4/M7/M23/M33 support non-privileged execution and memory protection unit (MPU)

The system service call (SVC) exception is triggered by the SVC instruction, which allows application tasks running in non-privileged state to initiate privileged operating system services. The pending system call exception is very helpful for scheduling non-critical operations like context switching in the operating system.

To fit Cortex-M1 into very small FPGA devices, all features used to support operating systems are optional for Cortex-M1. For Cortex-M0, Cortex-M0, and Cortex-M23 processors, the system clock SysTick is optional.

Generally, all Cortex-M processors support operating systems. Applications running on Cortex-M0, Cortex-M3, Cortex-M4, Cortex-M7, Cortex-M23, and Cortex-M33 can run in non-privileged execution states and can simultaneously utilize the optional memory management unit (MPU) to avoid illegal memory access. This can enhance the robustness of the system.

3.4 TrustZone Security Extensions

In recent years, the Internet of Things (IoT) has become a hot topic among embedded system developers. IoT system products have become more complex, and the pressure of time to market is increasing day by day. Embedded system products need better solutions to ensure system security while also being convenient for software developers. The traditional solution is to divide the software into privileged and non-privileged parts, with privileged software using MPU to prevent non-privileged applications from accessing critical system resources, including security-sensitive information. This solution is suitable for some IoT systems, but in some cases, only two layers of division are not enough. Especially for systems that contain many complex privileged software components, a flaw in privileged code can lead to hackers completely controlling the system.

The ARMv8-M architecture includes a security extension called TrustZone, which introduces an orthogonal division of secure and non-secure states.

· Ordinary applications are in non-secure state

· Software components and security-related resources (e.g., secure storage, cryptographic accelerators, true random number generators (TRNG)) are in secure state.

Figure 9: Isolation of Secure and Non-secure States

Software in non-secure state can only access non-secure state storage and peripheral devices, while secure software can access all resources in both states.

With this solution, software developers can develop applications in non-secure environments in the same way as before. At the same time, they can use secure communication software libraries provided by chip manufacturers to perform secure IoT connections. Moreover, even if privileged programs running in non-secure environments have vulnerabilities, the TrustZone security mechanism can prevent hackers from controlling the entire device, limiting the impact of the attack, and enabling remote recovery of the system. Additionally, the ARMv8-M architecture also introduces stack boundary checks and enhanced MPU designs, promoting the adoption of additional security measures.

The security architecture definition also extends to the system level, where each interrupt can be set to secure or non-secure attributes. Interrupt exception handlers will also automatically save and restore register data in secure environments to prevent leakage of security information. Thus, the TrustZone security extension allows the system to support the demands of real-time systems, providing a solid security foundation for IoT applications and making it easier for software development to create applications on this technology.

The TrustZone module is optional for Cortex-M23 and Cortex-M33 processors. For more information on ARMv8-M TrustZone, please refer to The Next Steps in the Evolution of Embedded Processors for the Smart Connected Era. For more TrustZone resources, please visit the “TrustZone for ARMv8-M Community” on community.arm.com.

3.5 Error Handling

One distinction between ARM processors and microcontrollers of other architectures is error handling capability. When an error is detected, an error exception handler is triggered to perform appropriate processing. Situations that can trigger errors may include:

Undefined instructions (e.g., Flash memory corruption)

Accessing illegal address space (e.g., stack pointer crash) or MPU illegal access

Illegal operations (e.g., when the processor attempts to trigger SVC exceptions in interrupts with higher priorities than SVC)

The error handling mechanism allows embedded systems to respond more quickly to various issues. Otherwise, if the system crashes, the watchdog timer would take a long time to restart the system.

In the ARMv6-M architecture, all error events trigger the HardFault handler, which has a priority of -1 (higher than all programmable exceptions but lower than non-maskable interrupts NMI). All error events are considered unrecoverable, and we typically only run error reporting in the HardFault handler and then trigger an automatic reset.

In the ARMv8-M Baseline architecture, similar to ARMv6-M, there is only one error exception (HardFault). However, the priority of the HardFault in ARMv8-M Baseline can be -1 or -3 when the TrustZone security extension is implemented.

ARMv7-M and ARMv8-M Mainline products have several configurable error exceptions in addition to HardFault:

Memmanage (memory management error)

Bus error (bus return error response)

Usage error (undefined instruction or other illegal operations)

SecureFault (only supported by ARMv8-M Mainline products, handling illegal operations in TrustZone security extension)

The priorities of these exceptions can be programmed to change and can be individually enabled and disabled. If needed, they can also use the FAULTMASK register to elevate their priorities to the same level as HardFault. ARMv7-M and ARMv8-M Mainline products also have several error status registers that can provide clues about the triggering error exception event and the error address registers to determine the access address that triggered this error exception, making debugging easier.

Additional error handlers in the ARMv7-M and ARMv8-M Mainline product sub-specifications provide flexible error handling capabilities, and error status registers make it easier to locate and debug error events. Many commercial development kits’ debuggers have embedded functionality to diagnose error events using error status registers. Furthermore, error handlers can perform some repair work at runtime.

Table 6: Summary of Error Handling Features Comparison

4. System Features

4.1 Low Power Consumption

Low power consumption is a key advantage of Cortex-M processors. Low power consumption is an integral part of its architecture:

WFI and WFE instructions

Architecture-level sleep mode definitions

In addition, Cortex-M supports many other low-power features:

Sleep and deep sleep modes: features supported at the architecture level, which can be further extended through device-specific power management registers.

Sleep-on-exit mode: a low-power technique for interrupt-driven applications. When enabled, after the exception handler ends and there are no other pending exception interrupts to be processed, the processor automatically enters sleep mode. This avoids extra instruction execution in thread mode, thus saving power and reducing unnecessary stack read/write operations.

Wake-up Interrupt Controller (WIC): an optional feature that detects interrupt conditions by a small module independent of the processor in specific low-power states. For example, in the state-retaining power management (SRPG) design, when the processor is powered off.

Clock shut down and architecture-level clock shut down: saving power by turning off the clock input of the processor’s registers or sub-modules

All these features are supported by Cortex-M0, Cortex-M0, Cortex-M3, Cortex-M4, Cortex-M7, Cortex-M23, and Cortex-M33. In addition, various low-power design techniques are used to reduce processor power consumption.

Due to fewer circuits, Cortex-M0 and Cortex-M0 processors consume less power than Cortex-M3, Cortex-M4, and Cortex-M7. In addition, the additional optimization of Cortex-M0 reduces program access (e.g., jump backup) to maintain low power consumption at the system level.

Cortex-M23 is not as small as Cortex-M0 and Cortex-M0, but under the same configuration, it still has the same energy efficiency as Cortex-M0.

Due to better performance and low power optimizations, Cortex-M33 has better energy efficiency than Cortex-M4 under the same configuration.

4.2 Bit-band Feature

Cortex-M3 and Cortex-M4 processors support an optional feature called bit-banding, which allows two segments to be bit-addressable through bit-band alias addresses in a 1MB address space (one segment starting from address 0x20000000 in SRAM space, the other segment starting from address 0x40000000 in peripheral device space). Cortex-M0, M0, and Cortex-M1 do not support bit-band features, but bit-band functionality can be implemented at the system level using bus-level components in the ARM Cortex-M System Design Kit (CMSDK). Cortex-M7 does not support bit-band, as the M7’s Cache functionality cannot be used with bit-banding (the cache controller does not know the alias addresses of the memory space).

The TrustZone of ARMv8-M does not support bit-band, because the two different addresses required for bit-banding aliases may be in different security domains. For these systems, bit operations on peripheral device data can instead be handled at the peripheral device level (e.g., by adding bit set and clear registers).

4.3 Memory Protection Unit (MPU)

Except for Cortex-M0, other Cortex-M processors have optional MPU to implement access permissions and attributes for memory space or to define memory regions. Embedded systems running real-time operating systems will define access permissions for memory space and memory configurations for each task to ensure that each task does not corrupt the address space of other tasks or the operating system kernel. Cortex-M0, Cortex-M3, and Cortex-M4 have 8 programmable region spaces and a very similar programming model. The main difference is that the MPU of Cortex-M3/M4 allows two levels of memory space attributes (e.g., system-level cache types), while Cortex-M0 only supports one level. Cortex-M7’s MPU can be configured to support 8 or 16 regions with two levels of memory space attributes. Cortex-M0 and Cortex-M1 do not support MPU.

Cortex-M23 and Cortex-M33 also support the MPU option, and if the TrustZone security extension is implemented (one for secure software programs and another for non-secure software programs), there can be up to two MPUs.

4.4 Single Cycle I/O Interface

The single-cycle I/O interface is a unique feature of the Cortex-M0 processor, allowing it to quickly perform I/O control tasks. Most Cortex-M processors’ bus interfaces are based on the AHB Lite or AHB 5 protocol, which are pipelined bus protocols operating at high clock frequencies. However, this means that each transfer requires two clock cycles. The single-cycle I/O interface adds an additional simple non-pipelined bus interface connected to certain device-specific peripherals like GPIO (general-purpose input/output). Combined with the single-cycle I/O and the naturally low jump cost of Cortex-M0 (only two pipeline stages), many I/O control operations will run faster than those of most other microcontroller architectures.

5. Performance Considerations

5.1 General Data Processing Capability

In the general microcontroller market, benchmark data is often used to measure the performance of microcontrollers. Table 7 shows the performance data of commonly used benchmark tests for Cortex-M processors:

Table 7: Performance Scores of Common Benchmark Tests for Cortex-M Processors

(Source: CoreMark.org website and ARM website)

Regarding Dhrystone, it is worth noting that the Dhrystone used for testing is compiled from the official source code without enabling inline and multi-file compilation options (official scores). However, many microcontroller manufacturers reference the data obtained from fully optimized compiled Dhrystone tests.

However, performance test data from benchmark tools may not accurately reflect the performance your application can achieve. For example, the single-cycle I/O interface and the acceleration effects of SIMD used in DSP applications, or the use of FPU in Cortex-M4/M7, are not reflected in these test data.

Generally, Cortex-M3 and Cortex-M4 provide higher data processing performance for the following reasons:

Richer instruction set

Harvard bus architecture

Write cache (single-cycle write operations)

Branch target prediction

Cortex-M33 is also based on a Harvard bus architecture with a rich instruction set. However, unlike Cortex-M3 and Cortex-M4, the pipeline of Cortex-M33 is a redesigned efficient pipeline that supports limited dual-issue (up to two instructions can be executed in one clock cycle).

Cortex-M7 supports higher performance because it has a dual-issue six-stage pipeline and supports branch prediction. Moreover, by supporting instruction and data caches, and even using slow memory (e.g., embedded Flash), it can avoid performance loss through tightly coupled memory to achieve higher system-level performance.

However, certain I/O-intensive tasks may run faster on Cortex-M0 due to:

Shorter pipelines (jumps only require two cycles)

Single-cycle I/O ports

Of course, device-related factors also play a role. For example, system-level design and memory speed can also affect system performance.

Your own application is often the best benchmark you need. A CoreMark score that is double that of another processor does not mean that executing your application will also be twice as fast. For I/O-intensive applications, device-related system-level architecture has a significant impact on performance.

5.2 Interrupt Latency

Another performance-related metric is interrupt latency. This is typically measured by the number of clock cycles from the interrupt request to the execution of the first instruction of the interrupt service routine. Table 8 lists the interrupt latency comparisons of Cortex-M processors under zero-wait memory system conditions.

Table 8: Interrupt Latency Comparison Under Zero-Wait Memory System Conditions

In fact, the actual interrupt latency is affected by the memory system’s wait states. For example, many microcontrollers running at frequencies above 100MHz are paired with very slow Flash memory (e.g., 30 to 50MHz). Although Flash access acceleration hardware is used to improve performance, interrupt latency is still affected by the wait states of the Flash storage system. Therefore, it is entirely possible for a Cortex-M0/M0 system running in a zero-wait memory system to have shorter interrupt latency than Cortex-M3/M4/M7.

When evaluating performance, do not forget to consider the execution time of the interrupt handlers. Some 8-bit or 16-bit processor architectures may have very short interrupt latencies but may take several times the clock cycles to complete interrupt handling. Very short interrupt response times and very short interrupt handling times are what is truly effective.

6. Debugging and Tracing Features

6.1 Introduction to Debugging and Tracing Features

There are several differences between different Cortex-M processors. These are summarized in Table 9.

Table 9: Comparison of Debugging and Tracing Features

The debugging architecture of Cortex-M processors is designed based on the ARM CoreSight debugging architecture, which is a highly extensible architecture that supports multi-processor systems.

Table 9 lists typical design considerations. Under the CoreSight architecture, debugging and tracing interface modules are separated from the processor. Therefore, the debugging and tracing connections of the device you use may differ from those in Table 9. Additional CoreSight debugging components may also be added to enhance debugging features.

6.2 Debug Connections

The debugging interface allows debuggers to:

– Access control registers for debugging and tracing features.

– Access memory space. For Cortex-M series processors, memory space access can be executed even when the processor is running. This is called real-time memory access.

– Access processor core registers. This can only be operated when the processor is stopped.

– Access the tracing history generated by the micro-trace buffer (MTB) in Cortex-M0 processors.

In addition, the debugging interface is also used for:

– Flash programming

Cortex-M series processors can choose between the traditional 4 to 5 pins (TDI, TDO, TCK, TMS, and optional nTRST) JTAG interface or the new serial debugging protocol interface that requires only two pins, which is very suitable for devices with a limited number of pins.

Figure 10: Serial wire or JTAG debugging interface allows access to processor’s debug features and memory space including peripherals

The serial wire debug protocol interface can handle all features supported by JTAG and supports parity. The serial debug protocol is widely adopted by ARM tool vendors, and many debugging adapters support both protocols, sharing the TCK and TMS pins on the serial wire model.

6.3 Trace Interface

The trace interface allows debuggers to collect information about program execution in real-time (with minimal delay). The collected information can be from the embedded trace macrocell (ETM) supported by Cortex-M3/M4/M7/M33, which generates program instruction flow information (instruction tracing), from the data watchpoint and trace (DWT) that generates data/event/performance analysis information, or from the instrumentation trace macrocell (ITM) that generates information.

There are two types of trace interfaces available:

– Trace port – multiple data lines plus a clock signal line. It has a higher trace bandwidth than SWV and can support all types of tracing in addition to instruction tracing. On devices with Cortex-M3/M4/M7 or Cortex-M33, the trace port usually has 4 data lines and one clock line (Figure 11).

– Serial wire viewer (SWV) – a single-pin trace interface that can selectively support data tracing, event tracing, performance analysis, and measurement tracing (Figure 12).

Figure 11: Trace port supports instruction tracing and other tracing functions with necessary bandwidth

The trace interface provides the ability to obtain a wealth of useful information while the processor is running. For example, the embedded trace macrocell (ETM) can capture the history of instruction execution, the data watchpoint and trace (DWT) allows software to generate messages (e.g., via printf) and utilize the trace interface to obtain them. Additionally, Cortex-M3/M4/M7/M33 supports the data watchpoint and trace (DWT) module.

– Optional data tracing: Information about memory addresses (e.g., a combination of address, data, and timestamp) can be collected when the processor accesses this address

– Performance analysis tracing: The number of clock cycles used by the CPU for different operational tasks (e.g., memory access, sleep)

– Event tracing: Provides the execution time and history of interrupts/exceptions responding to server requests

Figure 12: Serial wire viewer provides a low-cost, low-pin count tracing solution

These tracing features are widely adopted by various tool vendors, and the collected information is presented in various ways visually. For example, the data obtained by DWT can be displayed in waveform form in the Keil µVision debugger (part of the Keil microcontroller development tools) as shown in Figure 13.

Figure 13: Logic analyzer of Keil µVision debugger

Although Cortex-M0 and Cortex-M0 do not support trace interfaces, Cortex-M0 supports a feature called micro-trace buffer (MTB, Figure 14). MTB allows users to allocate a small portion of system SRAM as a cache for storing instructions, usually set as a circular cache, which can capture the latest instruction execution history and display it on the debugger.

This MTB tracing feature is also supported by Cortex-M23 and Cortex-M33.

Figure 14: MTB of Cortex-M0/M23/M33 provides a low-cost instruction tracing solution

7. Product Development Based on Cortex-M Processors

7.1 Why Cortex-M Series Processors are Easy to Use

Although the Cortex-M series processors have many features, they are easy to use. For example, almost all development can be done using high-level programming languages like C. Although products based on the Cortex-M series processors vary greatly (e.g., with different memory sizes, different peripherals, performance, and packaging, etc.), the consistency of the architecture allows developers to easily start using new Cortex-M processors once they have experience with one of them.

To achieve easier software development, better software reusability, and portability, ARM has developed CMSIS-CORE, where CMSIS stands for Cortex-Microcontroller Software Interface Standard. CMSIS-CORE provides a standard hardware abstraction layer (HAL) for various features of the processor, such as terminal management control, through a set of APIs, and CMSIS-CORE is integrated into the device driver libraries provided by various microprocessor manufacturers and supported by various development tool suites.

In addition to CMSIS-CORE, CMSIS also includes a DSP software library (CMSIS-DSP). This library provides various DSP functions optimized for Cortex-M4 and Cortex-M7, and of course, it also supports other Cortex-M series processors. Both CMSIS-CORE and CMSIS-DSP libraries are free and can be downloaded from GitHub (CMSIS 4, CMSIS 5) and are supported by many tool vendors.

7.2 Processor Selection

For most microcontroller users, the selection criteria for microcontroller devices mainly depend on cost and peripheral support. However, many of you may be choosing processors for the next chip product as chip designers, in which case the processor itself will be the focus of consideration.

Clearly, in such cases, performance, chip area, power consumption, and cost will be crucial factors. At the same time, there are various other factors to consider. For example, if you are developing an internet-connected product, you may need to choose a processor with TrustZone security extension and MPU, so you can use TrustZone to protect critical security feature data, run certain tasks at non-privileged levels, and use MPU to protect memory space. On the other hand, if you need to certify your product in some aspects, the instruction tracing generated by Cortex-M23, Cortex-M33, Cortex-M3, Cortex-M4, and Cortex-M7 supported ETM will be very helpful for code coverage certification.

In other chip design areas, if you are designing small sensors that can run on energy-harvesting device power, then Cortex-M23 and Cortex-M0 would be the best choices because they are very small and have the most advanced power consumption optimizations.

7.3 Ecosystem

One of the key advantages of using ARM Cortex-M series processors is the extensive support from mature devices, development toolchains, and software libraries. Currently, there are:

– More than 15 microcontroller manufacturers selling microcontroller products based on ARM Cortex-M series cores

– More than 10 development kits supporting ARM Cortex-M series processors

– Operating systems from over 40 operating system vendors supporting Cortex-M series processors

This gives you a lot of choices, allowing you to obtain the best combination of devices, development tools, and middleware suitable for your target application.

8. Conclusion

There is always a need to balance performance, features, chip area, and power consumption. To this end, ARM has developed various Cortex-M processors with different levels of instruction set features, performance, system, and debugging features. This article introduces the various similarities and differences of the Cortex-M processor family.

Although there are differences, the consistency of the architecture and the standardized APIs of CMSIS-CORE allow for better portability and reusability of software in the Cortex-M series processors. At the same time, the Cortex-M series processors are very convenient to use. Therefore, the Cortex-M series processors have quickly become the most popular 32-bit processor architecture in the microcontroller market.

Source: ARM

Sponsor Advertisement

Understanding the ARM Cortex-M Processor Family

ARM releases first self-developed ISP Mali-C71: targeting autonomous driving cars

Qualcomm must be upset! It is rumored that Apple has given 65% of the iPhone8 baseband orders to Intel

Chinese chip manufacturers have experienced a “Great Leap Forward”, with multiple projects paused due to overheating

ASML’s latest EUV lithography machine shipped: one machine is worth a F35 fighter jet, Intel’s 7nm relies entirely on it

These six companies overly dependent on Apple have already lost two, who will be next?

Foxconn’s highest bid still wants to exit! Broadcom may become the new owner of Toshiba’s semiconductor business!

Another chip supplier is about to be abandoned by Apple? The other party’s stock price plummeted by 36%

Domestic mobile phones encounter screen supply crisis, how to break the “assembly plant” dilemma?

Samsung’s largest flash chip factory will be put into production in July: mass production of 64-layer 3D NAND flash memory

Analyzing the progress of localization in China’s display panel industry from downstream to upstream!

Breaking Samsung’s monopoly, Google invests 1 trillion won to assist LG in increasing production of flexible OLED screens

Foxconn is determined to acquire Toshiba semiconductors: bid of 186 billion!

Has MediaTek been abandoned by many domestic mobile phone brands? Chip shipments in the first quarter fell below 100 million pieces

Why are mobile phone components rising continuously, but fingerprint recognition chips are falling?

For industry communication and cooperation, please add WeChat: xintiyan001. Submissions can be sent to: [email protected]. Chip News official group: 221807116

Related posts

Leave a Comment Cancel reply