Low Power Design

5.1 Power Consumption Sources

Surge、Static Power Consumption and Dynamic Power Consumption are the three main sources of power consumption.

• Surge current refers to the maximum instantaneous input current generated when the device is powered on, and is also known as startup current in applications.
• Standby current refers to the current generated when the main power supply is turned off or the system enters standby mode. The power consumption caused by standby current is called standby power consumption. Standby power consumption is closely related to the electrical characteristics of the components, similar to surge power consumption. Standby power consumption is also referred to as static power consumption. It should be noted that static power consumption also includes the power consumption caused by the leakage current of transistors in the circuit.
• Dynamic power consumption, or switching power consumption, is the power consumption caused by logic transitions when the output of a gate circuit switches. The definition of dynamic power consumption (P_dynamic) is as follows:

: refers to the average number of transitions through the entire circuit per clock cycle.: gate parasitic capacitance.: supply voltage.: clock frequency.

Dynamic power consumption plays a major role in large-scale IC design. In typical applications, dynamic power consumption accounts for 80% of total power consumption.

5.2 Reducing Power Consumption at Various Design Abstraction Levels

5.3 System-Level Low Power Techniques

System on Chip Approach:

For high-end chips at the nanometer level, the I/O uses a higher voltage than the chip core logic (typically 3.3V), which accounts for more than 50% of total power consumption. If the entire system contains multiple chips, the interconnections between these chips will consume a significant amount of power. In modern digital design practices, the system on chip methodology mainly focuses on reducing power consumption, minimizing area, and lowering costs.

Software/Hardware Partitioning

Low Power Software:

For embedded applications, existing industrial-grade C code is often used in the design. C code may use several loops. In some applications, 90% of runtime may be spent executing these loops. Several techniques can be used to optimize these loops. If two loops are executed sequentially under the same index, they can be merged, thereby reducing the number of executed instructions.

Choosing Processors

5.4 Architectural-Level Power Reduction Techniques

Advanced Gated Clock:

In synchronous digital systems, clock distribution contributes to a significant portion of the overall digital dynamic power. In many cases, most unused circuits can be turned off using gated clocks. Combinational gated clocks are based on the fact that there is a feedback loop from the output of registers to the input, hence they are also called “feedback loop-based gated clocks”.

Note that in combinational gated clocks, the functionality of the circuit before and after inserting the gated clock remains unchanged, so it can be verified using consistency checking tools.

Because combinational gated clock schemes disable the clock of flip-flops when the output remains unchanged, they can reduce dynamic power consumption by 5% to 10%. In contrast, sequential gated clocks change the design structure without affecting design functionality. Sequential gated clocks can reduce redundant switching in the design parts connected to registers with gated clocks.

Figure 5.5 shows a schematic of a combinational gated clock, while Figure 5.6 shows a schematic of the same circuit with a sequential gated clock.Note that when using sequential gated clocks, subsequent pipeline stages also use the same conditions for gating operations.

In Figure 5.6, it can be seen that the sequential gated clock introduces additional logic. Due to the need for this extra logic, this technique is not suitable for multi-bit data.

Dynamic Voltage Frequency Scaling (DVFS):

Reducing clock speed and supply voltage during frequency-insensitive application phases can significantly reduce power consumption with moderate performance loss.

Cache-Based System Architecture:

Relevant data is pre-fetched from main memory into the cache before the processor needs it. Using small-range caches can significantly reduce computational energy consumption and greatly improve system energy efficiency.

Logarithmic Number System:

For applications requiring large-scale computations, using a logarithmic number system (LNS) is better than using a linear system. LNS achieves higher efficiency than linear systems by reducing average bit activity while performing multiplication and division using addition and subtraction. Therefore, implementing FFT based on LNS can save a significant amount of power. The downside is that the width of adders and subtractors will increase, leading to exponentially larger lookup tables.

Asynchronous Clockless Design:

In synchronous clock-based designs, clock distribution consumes a large portion of energy. Traditional design methodologies form large-scale clock tree structures, which potentially increase the average power consumption of SoCs. The most likely issue is clock skew, which is the difference in time it takes for the clock to reach different parts of the circuit. When the circuit is large and slow, clock skew is not significant. However, as the process technology shrinks and speeds increase, the differences in clock skew become more pronounced, requiring additional design time and circuitry to address this issue. Distributing the clock across the entire chip is not an easy task and can lead to irregular layouts. Another issue is the considerable power consumption involved.

Due to the series of problems caused by clocks, removing them from the design is an attractive idea. This is the fundamental intention of asynchronous design. However, clocks cannot simply be removed directly, as some form of control over circuit operations is still required.Asynchronous circuits inherently perform self-control, hence they are also referred to as self-timed circuits.

Figure 5.8 shows an asynchronous system where two modules interact using a handshake interface.

Removing the clock improves energy efficiency. Clocks consume a lot of energy, especially in large-scale high-speed systems, so their removal significantly enhances energy efficiency. Additionally, since inactive components consume almost no energy, the dynamic power consumption of asynchronous circuits approaches zero.

Asynchronous circuits rely on delay-insensitive encoding for signal exchange interfaces, with the most popular being dual-rail encoding.

Dual-rail encoding transmits each data bit using two wires, hence it is called dual-rail (single-rail uses only one wire to transmit each data bit).

In dual-rail encoding, one wire represents logic 1, while the other represents logic 0. The two parts can reliably communicate with each other without being affected by delays on the wires. This protocol is insensitive to delays. The encoding is as follows:

“LL”=”No Data”, “LH”=”0”, “HL”=”1” where L=logic “0” and H=logic “1”

Here, 0 or 1 represents valid data. “11” indicates invalid.

As clock speeds are likely to exceed 5GHz, and inter-chip communication costs occupy 5 to 20 clock cycles, a method to establish hierarchical clock speeds is needed to achieve local synchronization and global asynchronous connections. This requires tools that can handle asynchronous, multi-cycle interconnects as well as local synchronous, high-performance communication.

Power Gating:

• Fine-Grained Power Gating: In fine-grained power gating, there is a switch transistor between each gate and ground. This method allows for turning off the connection to ground when certain functions are not in use. This control can be applied to each element in the library.The size of the power gating must meet the demand for switching current in all situations. This gate must be large enough to avoid measurable voltage drop (IR). Additionally, choices must be made for the head switch (P-MOS) and the foot switch (N-MOS), as shown in Figure 5.9. Typically, when the provided switching current is the same, the area of the foot switch is smaller. Dynamic power analysis tools can accurately measure switching current and predict the size of power gating.

• Coarse-Grained Power Gating: In coarse-grained power gating, the power gating transistor is part of the power network rather than a standard cell. Coarse-grained essentially creates a power switch network, where groups of switching transistors can parallelly turn an entire module on or off, as shown in Figure 5.10.

Unlike fine-grained power gating, the coarse-grained method does not completely depend on the quality of the library, but is more influenced by the processing capabilities of EDA tools.The operation of coarse-grained power gating is similar to that of fine-grained power gating. However, the implementation and analysis of this mechanism are quite complex. The size and number of standby transistors placed in the voltage cutoff region will affect the driving capability of that region. This can lead to variations in IR (voltage) drop and degrade performance. When all head switches are turned back on simultaneously, instantaneous charging/discharging current and short-circuit current will be generated from VDD to VSS. This behavior is illustrated in Figure 5.11.

Multi-Threshold Voltage:

Multi-cell libraries help address leakage and dynamic power consumption issues.A typical multi-cell library contains at least two sets of functionally identical cells but with different threshold voltages. High-threshold voltage cells are slower but have less leakage; conversely, low-threshold voltage cells are faster but have more leakage.High-threshold voltage cells typically reduce leakage by 50% compared to low-threshold voltage cells, without any side effects such as increased area.

Multi-Voltage Supply

Memory Power Gating:

In a typical SoC, SRAM consumes one-third of total power, with the remainder consumed by the clock tree and random logic. Therefore, memory architecture is a key factor for good power management strategies. The simplest form is to turn off the memory matrix when it is not in use. A comparison is needed in terms of power consumption between using a single large memory matrix or multiple small memories. Another technique is called body-biasing memory. In this method, designers reverse bias the memory body when it is not in use, effectively raising the threshold voltage and reducing leakage power consumption. Another popular method is to use multi-mode power supply for memory. In this technique, designers utilize memory with several power supply modes. Many designs use dual-function memory so that when the processor performs read/write operations on the memory, the body operates at full voltage to ensure operations can proceed. However, when the memory does not need to read or write, the processor can programmatically lower its supply voltage to just maintain data integrity. Another method to reduce memory power consumption at the packaging level is to use stacked memory, which significantly reduces interconnect capacitance and lowers memory power consumption by nearly 30%. For performance-critical applications requiring large memory bandwidth (such as imaging, multimedia, and modulation), stacked packaging of relevant memory can be used, while the operating system or other applications can be placed in external memory.

5.5 Reducing Power Consumption at the Register Transfer Level

In large-scale ASICs, at least 80% of power consumption is determined when RTL (Register Transfer Level) is completed. The backend process cannot resolve all power consumption issues.

State Machine Encoding and Decoding:

Among various state machine encoding types, Gray code is the most suitable for low power design.Figure 5.12 compares binary encoded state machines with Gray code state machines. For binary encoding, multiple flip-flops may flip during state transitions, for example, from state D (“011”) to state E (“100”), which consumes more energy than Gray code, where only one flip-flop changes during each state transition. Additionally, state machines encoded in Gray code eliminate the risk of glitches present in state-dependent combinational equations.

Binary Number Representation:

In most applications, using two’s complement to represent binary numbers is often more common than using signed numbers. However, for certain special applications, signed numbers may have advantages during transitions.

Gated Clock Basics:

Consider a 32-bit register “test_ff” that will write a 32-bit input data value when the load enable “load_cond” is true; otherwise, the value of this register remains unchanged. Below is the RTL code for this logic. Figure 5.15 shows the corresponding circuit implementation.

always @(posedge clock or negedge reset_b)
  if(!reset_b)
    test_ff <= 32'b0;
  else
    test_ff <= test_nxt;

assign test_nxt = load_cond ? test_data : test_ff;

Note that according to the RTL writing style, no gated clock can be inferred on the clock line. Some backend tools may generate gating at this level or after pre-flattening, but this method should not be relied upon.

The following is an example of the same logic in another coding style, which allows for automatic inference of gated clocks.

always @(posedge clock or negedge reset_b)
  if(!reset_b)
    test_ff <= 32'b0;
  else if(load_cond)
    test_ff <= test_data;

Using this RTL coding style, the HDL compiler can see the entire picture of the module and recognize that “load_cond” is shared across 32 register bits. Additionally, during backend environment setup, a gated clock (integrated library cell) will replace the 32 multiplexers. Integrated clock units are usually bypassed in scan mode (not shown in Figure 5.16). In cases where certain synthesis processes fail to recognize, gated clocks can be explicitly specified to dynamically control all functionalities.

One-Hot Code Multiplexer

There are many ways to infer multiplexers in RTL. “case” statements, “if” statements, and state machines can generally achieve this effect. The most common way to represent a multiplexer (MUX) is using binary encoding, as shown in the following figure.

case (SEL)
  2'b00: OUT = a;
  2'b01: OUT = b;
  2'b10: OUT = c;
  2'b11: OUT = d;
endcase

Note that if each input of the MUX is a multi-bit bus, significant switching activity will occur, resulting in power consumption. If the “case” conditions are encoded in one-hot encoding rather than binary encoding, the output will be faster and more stable, and the unselected buses can be masked out early, thus achieving low power effects.

case (SEL)
  4'b0001: OUT = a;
  4'b0010: OUT = b;
  4'b0100: OUT = c;
  4'b1000: OUT = d;
  default: OUT = X;
endcase

Eliminating Redundant Tasks

In the absence of a default state, bus data often undergoes meaningless transitions. If the transitioned data is not actually sampled, it is redundant, and removing such transitions can significantly reduce power consumption. Figure 5.19 shows an example with redundant transitions, which consumed energy while reading all input signals, but the final output was not used.

Note that if “load_out” is not set to valid, then the related “load_op” should also not be set to valid, which can save some power consumption.

Resource Sharing

For designs involving many arithmetic operations, if the same operation is used in multiple places, the corresponding arithmetic logic must avoid being duplicated in multiple locations. Using duplicate logic increases area and power consumption.

always@(*)
  case (SEL)
  3'b000: OUT = 1'b0;
  3'b001: OUT = 1'b1;
  3'b010: OUT = (value1 == value2);
  3'b011: OUT = (value1 != value2);
  3'b100: OUT = (value1 >= value2);
  3'b101: OUT = (value1 <= value2);
  3'b110: OUT = (value1 < value2);
  3'b111: OUT = (value1 > value2);
endcase

assign cmp_equal = (value1 == value2);
assign cmp_greater = (value1 == value2);
always@ (*)
case (SEL)
  3'b000: OUT = 1'b0;
  3'b001: OUT = 1'b1;
  3'b010: OUT = cmp_equal;
  3'b011: OUT = !cmp_equal;
  3'b100: OUT = cmp_equal || cmp_greater;
  3'b101: OUT = !cmp_greater;
  3'b110: OUT = !cmp_equal && !cmp_greater;
  3'b111: OUT = cmp_greater;
endcase

Using Ripple Counters to Reduce Power Consumption

Bus Inversion

When the Hamming distance between the current data and the next data is greater than N/2 (where N is the bus width), the next data is inverted before transmission to reduce the number of bits that change on the bus. This is known as bus inversion encoding. Directly flipping data through the I/O port can lead to significant power consumption, but by adding a control bit during I/O input to indicate whether the transmitted data is inverted, it avoids direct flipping from the I/O input, thus reducing power consumption. The inverted signal is then output, and during data processing, the data can be flipped back for processing.

As shown in the example in Figure 5.30, the difference in total transitions after bus inversion is significant.

High Activity Networks

The approach to handling such designs is to distinguish between high-activity networks and low-activity networks, and then place them as deep as possible within the logic cloud. The logic cloud shown in Figure 5.31 is a function of X and Y. The variation frequency of X is low, while Y is a high-activity network. In the specific implementation on the right, the logic cloud is duplicated, one assuming Y=0 and the other assuming Y=1, and then the value of Y determines which one to use. Since for a fixed value of Y, the sizes of these two new logic clouds are usually reduced.

Enabling and Disabling Logic Clouds

When operating large-scale logic clouds (including wide adders, multipliers, etc.), they are often turned on only when needed.

5.6 Low Power Techniques at the Register Level

Technology Level

Layout Optimization

Substrate Biasing

Since leakage current is a function of threshold voltage, as shown in Figure 5.33, substrate biasing, also known as “reverse biasing,” can reduce leakage power consumption. By using this technique, the substrate or appropriate well region voltage can be biased to increase the transistor threshold, thus reducing leakage. In PMOS, this means biasing the transistor substrate to a level above the threshold voltage. In NMOS, it means biasing the substrate voltage to a level below the threshold voltage.

Increasing the threshold voltage also affects performance, so dynamic biasing can be used, maintaining a small bias voltage during active mode while strengthening the bias voltage during standby mode. The effect of substrate biasing is related to process size, so when using smaller processes, the effect of substrate biasing is greatly reduced.

Chapter 5: Low Power Design

Low Power Design

5.1 Power Consumption Sources

5.2 Reducing Power Consumption at Various Design Abstraction Levels

5.3 System-Level Low Power Techniques

System on Chip Approach:

Software/Hardware Partitioning

Low Power Software:

Choosing Processors

5.4 Architectural-Level Power Reduction Techniques

Advanced Gated Clock:

Dynamic Voltage Frequency Scaling (DVFS):

Cache-Based System Architecture:

Logarithmic Number System:

Asynchronous Clockless Design:

Power Gating:

Multi-Threshold Voltage:

Multi-Voltage Supply

Memory Power Gating:

5.5 Reducing Power Consumption at the Register Transfer Level

State Machine Encoding and Decoding:

Binary Number Representation:

Gated Clock Basics:

One-Hot Code Multiplexer

Eliminating Redundant Tasks

Resource Sharing

Using Ripple Counters to Reduce Power Consumption

Bus Inversion

High Activity Networks

Enabling and Disabling Logic Clouds

5.6 Low Power Techniques at the Register Level

Technology Level

Layout Optimization

Substrate Biasing

Reducing Oxide Layer Thickness

Multi-Oxide Layers

Using Custom Design to Reduce Capacitance

Leave a Comment Cancel reply

Low Power Design

5.1 Power Consumption Sources

5.2 Reducing Power Consumption at Various Design Abstraction Levels

5.3 System-Level Low Power Techniques

System on Chip Approach:

Software/Hardware Partitioning

Low Power Software:

Choosing Processors

5.4 Architectural-Level Power Reduction Techniques

Advanced Gated Clock:

Dynamic Voltage Frequency Scaling (DVFS):

Cache-Based System Architecture:

Logarithmic Number System:

Asynchronous Clockless Design:

Power Gating:

Multi-Threshold Voltage:

Multi-Voltage Supply

Memory Power Gating:

5.5 Reducing Power Consumption at the Register Transfer Level

State Machine Encoding and Decoding:

Binary Number Representation:

Gated Clock Basics:

One-Hot Code Multiplexer

Eliminating Redundant Tasks

Resource Sharing

Using Ripple Counters to Reduce Power Consumption

Bus Inversion

High Activity Networks

Enabling and Disabling Logic Clouds

5.6 Low Power Techniques at the Register Level

Technology Level

Layout Optimization

Substrate Biasing

Reducing Oxide Layer Thickness

Multi-Oxide Layers

Using Custom Design to Reduce Capacitance

Related posts

Leave a Comment Cancel reply