Impact and Efficiency Trade-offs of DMA Memory Address Alignment in ARM Cortex-M Cores

@[toc]

Impact and Efficiency Trade-offs of DMA Memory Address Alignment in ARM Cortex-M Cores

1. Introduction

In embedded systems based on high-performance microcontrollers such as ARM Cortex-M4/M7, Direct Memory Access (DMA) is a key technology for achieving high data throughput and reducing CPU load. However, the efficient operation of the DMA controller is closely related to whether the accessed memory addresses are aligned. This article aims to deeply analyze a series of issues arising when DMA accesses unaligned memory addresses, including performance degradation, hardware fault risks, and data integrity issues. Furthermore, this article will conduct a detailed efficiency comparison between the strategies of “directly using unaligned addresses” and “pre-copying to aligned buffers” to clarify the optimal choice in different scenarios.

2. Definition of DMA and Unaligned Memory Addresses

Memory address alignment refers to the storage address of a data type being a multiple of its size. For example, an address of a 32-bit (4-byte) integer is considered aligned if it is a multiple of 4. DMA transfers also follow this principle, where the configured data width (typically byte, half-word, or word) determines the ideal alignment boundary. When DMA is configured to transfer in words (4 bytes), but its source or destination memory address is not a multiple of 4, it constitutes an unaligned DMA access.

Unaligned access (Address: 0x2000 1001)

Crossing boundary

Crossing boundary

Crossing boundary

Crossing boundary

Byte 0: Addr 0x2000 1001
Word Access at 0x2000 1001
Byte 1: Addr 0x2000 1002
Byte 2: Addr 0x2000 1003
Byte 3: Addr 0x2000 1004
Aligned access (Address: 0x2000 1000)
Word 0: Addr 0x2000 1000
Word 1: Addr 0x2000 1004
Word 2: Addr 0x2000 1008

3. Core Impacts of Unaligned DMA Access on the System

On the ARM Cortex-M4/M7 architecture, unaligned DMA access can lead to various negative consequences, both directly and indirectly.

3.1 Significant Performance Degradation

This is the most direct impact. Although the Cortex-M core supports unaligned memory load/store operations to some extent, it comes at the cost of execution efficiency. The hardware automatically breaks down a single unaligned bus transaction into multiple aligned, smaller-grained accesses.

1. Increased Bus Cycles: A single 4-byte unaligned access may be split by the hardware into two or more aligned accesses of 1 byte/2 bytes. This introduces additional bus cycles, typically resulting in a fixed overhead of 1 to 2 clock cycles per access for Cortex-M4/M7.
2. Cumulative Effect: In DMA transfers, the performance loss from a single access is amplified by the total amount of data. For a large data block transfer, the continuous unaligned penalty will result in the overall data transfer bandwidth being significantly lower than the theoretical peak.



No

Yes

DMA initiates an unaligned word access
Bus interface hardware
Is the address aligned?
Split into multiple aligned byte/half-word accesses
Execute multiple bus transactions
Consume extra clock cycles
Single access delay increases
Execute a single aligned word access
No extra overhead

3.2 Inducing Hardware Faults (Hard Fault)

Under certain conditions, unaligned access is not just a performance issue but can lead to severe system-level errors.

1. Accessing Restricted Memory Areas: When the memory area accessed by DMA is configured by the Memory Protection Unit (MPU) as “Device Memory” or “Strongly-ordered Memory”, any unaligned access is strictly prohibited. Such access will immediately trigger a precise bus error, leading the system into a Hard Fault interrupt service routine, typically resulting in system hang or reset.
2. Undefined DMA Behavior: Some hardware implementations of DMA controllers may not define all unaligned scenarios. Providing an unaligned address while configuring a wider data width may cause the DMA controller to enter an undefined state, resulting in unpredictable data corruption or system lock-up.

3.3 Cortex-M7 Specific Cache Coherency Issues

The Cortex-M7 core introduces a data cache (D-Cache), complicating the address alignment issue. The basic unit of cache operations (such as cleaning or invalidating) is the cache line, which is typically 32 bytes on Cortex-M7.

• Management Efficiency: When the DMA buffer address is aligned with the cache line boundary (a multiple of 32 bytes), cache maintenance operations are most efficient.
• Management Complexity: Unaligned buffers complicate cache management. A buffer may span multiple cache lines, and its start and end addresses do not align with line boundaries, requiring more precise and time-consuming cache operations to ensure consistency between the CPU view and memory view, otherwise, data staleness or loss may occur when sharing data between CPU and DMA.

4. Efficiency Trade-off: “Direct DMA” vs “Pre-copy + DMA”

Faced with unaligned data sources, developers face a choice: should DMA handle unaligned addresses directly, or should the CPU first copy the data to an aligned buffer before starting DMA?

4.1 Time Cost Composition of Two Solutions

1. Solution A: Direct Unaligned DMA (T_direct)

• CPU Cost: Almost zero, only requires configuring and starting DMA.
• DMA Cost: Transfer time is extended due to continuous unaligned penalties.
• Time Formula (Conceptual): T_direct = N * (T_base_transfer + T_unalign_penalty)

• N: Total number of data units transferred.
• T_base_transfer: Base time for an aligned transfer.
• T_unalign_penalty: Additional time incurred due to unalignment for each transfer.

2. Solution B: CPU Pre-copy to Aligned Buffer + Aligned DMA (T_copy_then_dma)

• CPU Cost: Time required for the CPU to execute memcpy operation to copy data from the source address to the aligned buffer.
• DMA Cost: DMA operates at maximum efficiency with no unaligned penalties.
• Time Formula (Conceptual): T_copy_then_dma = T_memcpy + (N * T_base_transfer)

• T_memcpy: Total time for the CPU to perform memory copy.

Solution B: Pre-copy + Aligned DMA

CPU executes memcpy
Copy complete
Start DMA
DMA efficient transfer
Transfer complete
Solution A: Direct Unaligned DMA

Each transfer

Start DMA
DMA transfer
Pay unaligned penalty
Transfer complete

4.2 Efficiency Comparison and the Decisive Role of Data Volume

The advantages and disadvantages of the two solutions depend on the relationship between T_memcpy and <code>N * T_unalign_penalty.

• Small Data Volume Scenarios: When N is very small (e.g., a few dozen bytes), the cumulative penalty of N * T_unalign_penalty is not significant, while the fixed overhead of <code>T_memcpy (such as function calls, address calculations, etc.) is relatively prominent. In this case, performing unaligned DMA directly may be faster.
• Large Data Volume Scenarios: When N is very large (in scenarios where “a large amount of data is sent at once”, such as thousands to tens of thousands of bytes), N * T_unalign_penalty will be dramatically amplified, becoming the main bottleneck of total time. In contrast, while <code>T_memcpy is an upfront cost, its byte-per-unit copy efficiency is very high and stable. In this case, the one-time <code>T_memcpy overhead is far less than the continuous DMA performance loss.Therefore, for large data blocks, the total time of the "pre-copy + DMA" solution is almost always shorter.

4.3 Consideration of Memory Resource Occupation

Concerns about additional memory usage: The clear disadvantage of the “pre-copy + DMA” solution is the need to allocate an additional aligned buffer of the same size as the transferred data. This increases the system’s RAM usage and reduces memory utilization. In systems with extremely tight memory resources, this may be an unacceptable trade-off.

5. Conclusion and Best Practices

1. Impact Summary: On the ARM Cortex-M4/M7 platform, DMA access to unaligned memory addresses can cause significant performance degradation and may trigger fatal hardware faults due to accessing restricted memory areas. For Cortex-M7, it also increases the complexity of cache coherency management.
2. Efficiency Trade-offs:

• For large data block transfers, from the perspective of total time efficiency and system stability, the “CPU pre-copy to aligned buffer + aligned DMA” is the better choice. It trades a one-time CPU overhead and additional memory usage for the highest efficiency of DMA transfer and system robustness.
• For small data blocks and scenarios with extremely high real-time requirements, if it can be ensured that hardware faults are not triggered, performing unaligned DMA directly may be feasible, but this requires careful performance evaluation and risk analysis.

3. Best Practices:

• Compile-time Alignment: During the design phase, DMA buffers should be statically defined to be aligned as much as possible through compiler directives (such as __attribute__((aligned(N)))). This is the most fundamental and efficient way to solve the problem, as it avoids any additional runtime overhead.
• Dynamic Aligned Allocation: If dynamic allocation is necessary, use memory functions that support aligned allocation (such as aligned_alloc) or custom memory pools.
• Reasonable MPU Configuration: Effectively using the Memory Protection Unit (MPU) to manage the attributes of memory regions (such as configuring as “normal memory” to tolerate unaligned access, or configuring as “non-cache” to avoid cache coherency issues) can eliminate many potential fault points at the hardware level.

Ultimately, the choice of strategy depends on the comprehensive consideration of the application’s priorities among time efficiency, memory efficiency, and system stability.