1 Understanding Thumb-2
First, let’s start our discussion of energy-saving technologies from a seemingly non-obvious starting point – the instruction set. All Cortex-M CPUs use the Thumb-2 instruction set, which combines the 32-bit ARM instruction set and the 16-bit Thumb instruction set, providing a flexible solution for raw performance and overall code size. A typical Thumb-2 application on a Cortex-M core can reduce code size by up to 25% compared to the same functional application fully implemented with ARM instructions, while achieving 90% execution efficiency (when optimized for runtime). Thumb-2 includes many powerful instructions that can effectively reduce the number of clock cycles required for basic operations. Reducing the number of clock cycles means you can now accomplish your tasks with less CPU power consumption. For example, suppose we need to perform a 16-bit multiplication operation (as shown in Figure 1). Performing this operation on an 8-bit 8051 core MCU would take 48 clock cycles and occupy 48 bytes of Flash storage space. Using a 16-bit core MCU (like the C166) to perform the same operation would take 8 clock cycles and occupy 8 bytes of Flash storage space. In contrast, completing the same operation on a Cortex-M3 core using the Thumb-2 instruction set takes only 1 clock cycle and occupies 2 bytes of Flash storage space. The Cortex-M3 core can save energy by completing the same task with fewer clock cycles while also reducing Flash memory access frequency by occupying minimal Flash storage space, achieving the ultimate goal of energy savings (in addition, smaller application code also allows the system to choose smaller Flash memory, further reducing overall system power consumption).
Figure 1 Comparison of Clock CyclesFigure 2 Interrupt Response of ARM7 and Cortex-M3
2 Interrupt Controller Energy-Saving TechniquesThe interrupt controller in the Cortex-M architecture (Nested Vectored Interrupt Controller or NVIC) also plays a critical role in reducing CPU power consumption. The previous ARM7-TDMI required “up to” 42 clock cycles, while the Cortex-M3 NVIC only requires 12 clock cycles of transition time from the occurrence of an interrupt request to the execution of the interrupt handling code, significantly improving CPU execution efficiency and reducing CPU time wastage. In addition to entering the interrupt handler faster, the NVIC also makes switching between interrupts more efficient. In the ARM7-TDMI core implementation, several clock cycles are first spent returning from the interrupt handler to the main program before entering the next interrupt handler, and the “push-and-pop” operations between interrupt service routines can consume up to 42 clock cycles. The Cortex-M NVIC employs a more efficient method to achieve the same task, known as “tail-chaining.” This method allows the required information to enter the next interrupt service routine with only 6 clock cycles of processing. By adopting tail-chaining, there is no need for a complete push-and-pop cycle, reducing the number of clock cycles required to manage the interrupt process by 65% (as shown in Figure 2).3 Memory Energy-Saving ConsiderationsThe memory interface and memory accelerator can significantly affect CPU power consumption. Branches and jumps in code can cause refresh impacts on the pipeline that provides instructions to the CPU, in which case the CPU needs to delay several clock cycles to wait for the pipeline to be refilled. In the Cortex-M3 or Cortex-M4 cores, the CPU is equipped with a 3-stage pipeline. Refreshing the entire pipeline will cause a CPU delay of 3 clock cycles, and if there are Flash memory wait states, the time will be longer to complete the refilling process. These delays completely waste power without any utility. To help reduce delays, the Cortex-M3 and M4 cores include a feature called Speculative Fetch, which fetches instructions for branches while also fetching possible branch targets in the pipeline. If the possible branch target hits, Speculative Fetch can reduce the delay to 1 clock cycle. Although this feature is useful, it is clearly not enough, and many Cortex-M product vendors have added their own IP to enhance this capability.
For example, even in the popular ARM Cortex-M class of MCUs, the operation methods of instruction buffering can differ. MCUs with simple instruction buffering, such as Silicon Labs’ EFM32 products, can store 128×32 (512 bytes) of the currently executing instructions (by logically checking if the requested instruction address is in the buffer). The EFM32 reference manual states that typical applications will have over a 70% hit rate in this buffer, meaning minimal Flash accesses, faster code execution speeds, and lower overall power consumption. In contrast, ARM MCUs with a 64×128-bit branch buffer can store the initial few instructions (depending on the mix of 16-bit or 32-bit instructions, with a maximum of 8 instructions and a minimum of 4 instructions per branch). Therefore, branch buffering can fill the pipeline for any hit in the buffer within 1 clock cycle, eliminating any CPU clock cycle delay or waste. Both buffering techniques provide significant performance improvements and power reductions compared to CPUs without buffering features.
4 Exploring the M0+ CoreFor power-sensitive applications, every nano-watt counts, and the Cortex-M0+ core is an excellent choice. The M0+ is based on the Von-Neumann architecture (while the Cortex-M3 and Cortex-M4 cores are Harvard architecture), meaning it has fewer gate counts to achieve lower overall power consumption with only a slight loss in performance (Cortex-M0+ has 0.93 DMIPS/MHz compared to Cortex-M3/M4’s 1.25 DMIPS/MHz). It also uses a smaller subset of the Thumb-2 instruction set (as shown in Figure 3). Almost all instructions have 16-bit opcodes (52×16-bit opcodes and 7×32-bit opcodes; data operations are all 32 bits), allowing it to implement some interesting functional options to reduce CPU power consumption.
Figure 3 Cortex-M0+ Instruction Table
The primary measure for energy-saving functional options is to reduce Flash access frequency. A major 16-bit instruction set means you can alternately clock cycle access Flash memory (as shown in Figure 4), and can fetch two instructions for the pipeline with each Flash access. Suppose you have two instructions in memory aligned into a 32-bit word; if the instructions are not aligned, the Cortex-M0+ will disable the remaining half of the bus to save every bit of energy.
Figure 4 Alternating Clock Cycle Flash Memory Access Based on Cortex-M0+
Additionally, the Cortex-M0+ core can also reduce power consumption by decreasing to a two-stage pipeline. In a typical pipelined processor, the next instruction is fetched while the CPU executes the current instruction. If a branch is generated in the program and the next fetched instruction cannot be used, then the power used for fetching (branch shadow buffer) is wasted. In a two-stage pipeline, this branch shadow buffer is reduced, thus saving power (albeit only a small amount), which also means that during pipeline refresh, it takes less than one clock cycle to refill the pipeline (as shown in Figure 5).
Figure 5 Pipeline and Branch Shadow BufferFigure 6 Cortex-M Existing Low Power Modes
5 Using GPIO Ports for Energy SavingThe Cortex-M0+ core provides energy-saving features in another area: its high-speed GPIO ports. In the Cortex-M3 and Cortex-M4 cores, the process of toggling a bit or GPIO port is to “read-modify-write” a 32-bit register. Although the Cortex-M0+ can also use this method, it has a dedicated 32-bit wide I/O port that allows single clock cycle access to GPIO, enabling efficient bit/pin toggling. Note: This is an optional feature on the Cortex-M0+ and not all vendors have this useful GPIO feature.6 CPU Sleep ModesOne of the most effective ways to reduce CPU power consumption is to turn off the CPU itself. There are various sleep modes in the Cortex-M architecture, each of which compromises between power consumption and the startup time to re-execute code (as shown in Figure 6). It also allows the CPU to automatically enter a sleep mode after completing interrupt service without executing any code to accomplish this task. This method can save CPU clock cycles for tasks commonly found in ultra-low-power applications. In deep sleep mode, the wake-up interrupt controller (WIC) can also alleviate the burden on the NVIC. When using the WIC, the CPU can be awakened from low-power mode by external interrupts without providing a clock for the NVIC.
7 Autonomous Peripherals to Lighten CPU LoadAutonomous on-chip peripherals have the advantage of reducing power consumption. Most MCU vendors have implemented autonomous interactions between peripherals in their product architectures, such as Silicon Labs’ EFM32 MCU using a peripheral reflection system (PRS). Autonomous peripherals can achieve very complex peripheral action chains (triggering rather than data transfer) while keeping the CPU in sleep mode. For example, using the PRS feature on the EFM32 MCU, applications can be configured to trigger a timer to start decrementing when the on-chip comparator detects a voltage value exceeding its preset threshold while the CPU is in low-power sleep mode. When the timer reaches 0, it triggers the DAC to start outputting – all events can occur while the CPU remains in sleep mode. Automating such complex interactions allows peripherals to accomplish a lot of work without CPU involvement. Moreover, peripherals with built-in intelligence (such as sensor interfaces or pulse counters) can use preset conditions to interrupt and wake up the CPU, for instance, waking up the CPU after accumulating 10 pulses. In this example, when the CPU is awakened by a specific interrupt, it knows exactly what to do without needing to check counters or registers to determine what happened, thus saving a considerable number of clock cycles and better completing other important tasks. We have introduced various easy-to-implement methods to reduce CPU power consumption on Cortex-M devices. Of course, there are other factors that affect power consumption, such as the processing technology used for the device or the memory technology used to store application code. Technology and storage techniques can significantly impact runtime power consumption and leakage in low-power modes, so they should also be incorporated into the overall power consumption design considerations of embedded developers.
Leave a Comment
Your email address will not be published. Required fields are marked *