Understanding ARM Architecture and Its Processor Families

Preface

Consider this as the beginning of learning ARM. Although RiscV has been praised by major media recently, I still have a positive outlook.

However, I still think it is very difficult to threaten ARM and x86 in high-profit industries in the short term, due to the characteristics of the instruction set and the individual battles of different manufacturers. I hope to see a strong emergence of RiscV soon.

A friend of mine said their company only has authorization for V8 due to sanctions. Therefore, open-source or self-researched architecture is an unavoidable path. Nevertheless, for us, technology is transferable and relevant, so to some extent, it’s more important not to get too entangled and to take action; of course, making choices is important.

PART ONE – Implementation

1. ARM Processor Family

(1) What is Multicore

Candidates should be familiar with the available processors from ARM and know which of these may be used in multiprocessor configurations.

It is necessary to be familiar with ARM’s processors and to understand which ones are used for multiprocessor configurations.

ARM processors come in various series based on their design and application scenarios. Among them, the Cortex-A series is used in high computational demand fields, such as smartphones, tablets, automotive entertainment systems, digital TVs, etc. It can run rich operating systems and provide interactive media and graphical experiences.

Within the Cortex-A series, Cortex-A9 microarchitecture can be used for scalable multicore processors (Cortex-A9 MPCore multicore processors) as well as more traditional processors (Cortex-A9 single-core processors). Both types of processors support 16, 32, or 64KB of 4-way associative L1 cache configurations, and for optional L2 cache controllers, support up to 8MB of L2 cache configurations. This high level of flexibility allows them to be suitable for specific application areas and markets.

The Cortex-A9 core in the Cortex-A series can be used for multiprocessor configurations. It can be used for scalable multicore processors (Cortex-A9 MPCore multicore processors) as well as for more traditional processors (Cortex-A9 single-core processors).

The Cortex-A9 MPCore multicore processor consists of two or more homogeneous or heterogeneous cores and supports running a lightweight operating system for system management. It is suitable for high-performance, high-throughput computing applications such as servers, high-end smartphones, and tablets.

Additionally, the Cortex-A15 and Cortex-A17 cores are also high-performance processors that can be used in various configurations. They offer stronger performance and efficiency and support running complex operating systems and interactive media and graphical experiences.

In summary, the Cortex-A9, Cortex-A15, and Cortex-A17 cores in the ARM Cortex-A series processors can all be used for multiprocessor configurations to meet high-performance and high-computational demand application needs.

Additionally, Cortex-M series are general-purpose microcontrollers that generally do not run operating systems or only run real-time operating systems (RTOS).

Meanwhile, Cortex-R series are real-time microcontrollers that enhance real-time performance based on Cortex-M and are mainly used in high-security and high-real-time scenarios.

(2) Coherency Hardware

Candidates should be aware of the concept of coherency hardware, what it does, why it is required, and what are the benefits of using it.

It is necessary to understand the meaning of coherency hardware, what it does, why it is needed, and what its advantages are.

Coherency hardware must be used in multicore systems. To understand how it works and how it affects software development, one needs to have knowledge of multicore systems and cache.

In multicore systems, hardware connection and communication are key considerations. To achieve effective communication and data sharing between multiple processor cores, high-speed cache coherence protocols, such as MESI protocol, are typically required. These protocols ensure that when one core modifies shared data, other cores can immediately or eventually obtain the updated data.

In multicore systems, each core has its own cache (L1 or L2), but they need to communicate through shared memory areas. When one core accesses the shared memory area, other cores also need to update their caches to ensure consistency.

To achieve this, hardware needs to maintain a state mechanism, such as the state in the MESI protocol, to track the status of each cache line (e.g., modified, exclusive, shared, etc.). When one core modifies shared data, it sends a message to other cores to notify them to update their cache line status.

This communication mechanism is critical for ensuring data consistency and correctness in multicore systems. It also affects aspects of software development because developers need to consider how to effectively utilize multicore processor resources and ensure data consistency and sharing.

To develop multicore system software more effectively, developers need to understand the hardware connection methods and communication mechanisms to write efficient and reliable parallel code. They also need to understand the characteristics and limitations of multicore processor architectures and how to leverage hardware features to optimize software performance.

The impact of hardware connection and communication in multicore systems on software development mainly manifests in the following aspects:

  • Memory consistency: In multicore systems, each core has its own cache (L1 or L2), but they need to communicate through shared memory areas. To ensure data consistency, cache coherence protocols, such as the MESI protocol, are needed. These protocols ensure that when one core modifies shared data, other cores can immediately or eventually obtain the updated data. Software development needs to consider how to handle this memory consistency issue to avoid data inconsistency or race conditions.
  • Communication overhead: In multicore systems, there is a certain overhead for communication between cores. If many cores communicate frequently, it can affect the system’s performance. Therefore, in software development, it is necessary to avoid unnecessary inter-core communication or adopt efficient communication mechanisms to reduce communication overhead.
  • Load balancing: In multicore systems, if the load distribution is uneven, some cores may be idle while others are busy, thus affecting system performance. Software development needs to consider the load balancing issue and distribute tasks and data reasonably to ensure that each core is fully utilized.
  • Parallel algorithms and programming models: To fully utilize the parallel processing capability of multicore systems, suitable algorithms and programming models for parallel processing need to be adopted. This includes using threads, processes, data parallelism, and selecting appropriate synchronization primitives and communication mechanisms.
  • System integration and testing: In multicore systems, due to the complexity of hardware connections and communication, system integration and testing become more challenging. Software development needs to consider how to effectively integrate and test to ensure the correctness and reliability of the system.

In summary, the impact of hardware connection and communication in multicore systems on software development is multifaceted.

To fully leverage the advantages of multicore systems, developers need to deeply understand the characteristics and limitations of the hardware and adopt appropriate parallel processing technologies and programming models to optimize software performance.

(3) Generic Interrupt Controller (GIC)

GIC is commonly used in Cortex-A5 multicore, Cortex-A9 multicore, and Cortex-R7 multicore processors, and it can also exist as an optional component in Cortex-A7 and Cortex-A15 cores.

It provides a scalable interrupt handling mechanism that allows developers to configure and use as needed.

The Generic Interrupt Controller (GIC) is a generic interrupt controller used in ARM Cortex-A series processors. It provides interrupt management functions for multicore and single-core processors and supports a scalable interrupt handling mechanism.

In multicore processors, GIC allows multiple cores to share interrupt signals and coordinates each core’s handling of shared interrupts. Through GIC, each core can independently respond to and dispatch interrupts, ensuring the parallelism and efficiency of interrupt handling.

Furthermore, GIC provides some programmable interrupt control features, such as setting interrupt priority, sub-priority, and enabling/disabling interrupts. These features allow developers to flexibly configure and control the behavior of interrupt handling.

In summary, GIC is a key component for interrupt management in Cortex-A series processors, providing a generic and efficient interrupt handling mechanism for both multicore and single-core processors.

(4) Architecture Version Identification

Candidates must be able to identify which ARM processors conform to which Architecture versions. Architectures from ARMv4T onwards are included, with the exception of ARMv7-M and ARMv6-M architectures.

It is necessary to be able to identify which specific ARM processors belong to which architecture version. All architectures after ARMv4T need to be understood.

  • ARMv4T architecture:

    • ARM7TDMI series processors (e.g., ARM710T, ARM720T, ARM740T, etc.)
    • ARM9TDMI series processors (e.g., ARM920T, ARM940T, etc.)
    • StrongARM series processors (Intel products)
  • ARMv5TEJ architecture:

    • ARM7EJ-S series processors (based on ARMv5TEJ architecture)
  • ARMv6 architecture:

    • ARM11 series processors (e.g., ARM11MPCore, ARM1176, ARM1156, ARM1136, etc.)
    • XScale series processors (Intel products)
  • ARMv7 architecture:

    • Cortex-A series processors (e.g., Cortex-A5, Cortex-A7, Cortex-A9, Cortex-A15, etc.)
    • Cortex-R series processors (e.g., Cortex-R4, Cortex-R7, etc.)
    • Cortex-M series processors (e.g., Cortex-M3, Cortex-M4, etc.)
  • ARMv8 architecture:

    • AArch64 architecture (64-bit instruction set)
    • AArch32 architecture (32-bit instruction set)
    • Cortex-A50 series processors (based on ARMv8 architecture, supporting 64-bit instruction set)
    • Cortex-A53 series processors (based on ARMv8 architecture, supporting 64-bit instruction set)
    • Cortex-A57 series processors (based on ARMv8 architecture, supporting 64-bit instruction set)
    • Cortex-A72 series processors (based on ARMv8 architecture, supporting 64-bit instruction set)
  • ARMv9 architecture:

    • Cortex-X2
    • Cortex-A710
    • Cortex-A510

These are some of the more common series in ARM’s processor architecture versions. The processor models within each series may vary, but they all adhere to the same architectural specifications. I hope this information is helpful!

(5) Performance

Candidates must be able to distinguish between applications, real-time, and microcontroller profiles, be aware of the distinguishing features between them, and understand the typical target market/applications of each.

It is necessary to be able to distinguish the differences between the ARM Cortex’s application, real-time, and microcontroller series, understand their different characteristics, and the typical target applications and markets they aim for.

The Cortex-A series programmer’s guide, as the name suggests, only introduces the Cortex-A series and does not provide much knowledge about the Cortex-R and Cortex-M series. Cortex-A series processors are mainly used in applications that require MMU memory management units and cache operations, demanding high-performance processing capabilities.

Cortex-R series processors are primarily aimed at real-time applications, targeting embedded systems that require hard real-time response capabilities. This indicates that its characteristics require fast, deterministic interrupt responses, tightly coupled memory (TCM) located on the processor’s local fast bus to provide quick response code and data, and parity or ECC checking mechanisms to ensure error detection and correction.

Cortex-R series processors use memory protection units (MPU) instead of memory management units (MMU) in the Cortex-A series for memory management and protection.

Cortex-M series processors target low-cost and power-sensitive microcontrollers and mixed-signal systems. For example, some terminal devices include smart meters, human-machine interaction devices, industrial control systems, large home appliances, consumer products, and medical instruments. **The key in these application areas is to require minimal code (better code density), ease of use, and energy efficiency.** For example, the Cortex-M0+ core is the lowest power core in ARM processors, achieving mW/MHz. Cortex-M series processors only support 16-bit and 32-bit Thumb instruction sets and usually do not contain caches.

Previously, I worked on a Zigbee smart agriculture project, and the development board I bought used an M core.

2. Instruction Cycle Timing

Candidates should be aware of the effects of pipeline and memory systems on instruction execution timing.

It is necessary to understand the impact of pipelines and memory systems on instruction execution time.

Understanding some basic knowledge related to pipeline architecture can help in understanding the impact of pipelines on execution cycle timing.

The ARM7TDMI technical reference manual provides this pipeline, making it easier to understand the concept of pipelines.

For instance, on ARM7TDMI, branch jumps will cause a loss of 2 cycles because branch jumps occur during the execution stage of the pipeline, at which point the instructions located in the instruction fetch and decode stages will be flushed, and the pipeline will be refilled with the new instruction stream, resulting in a delay of 3 cycles before the first instruction after the jump executes.

Similarly, once a simple pipeline is understood, the concept of “load-use hazard” becomes clear, which describes the situation when a register loads a value from a storage space.

If the next instruction needs to use this register’s value, then it must wait for the register to complete loading the data (which may reach the relevant level of the processor pipeline through a dedicated channel).

The compiler (or assembler) will usually avoid this situation; it generally attempts to separate the instructions that use the loaded data from the loading instruction.

Similarly, once a simple pipeline is understood, the concept of “load-use hazard” becomes clear, which describes the situation when a register loads a value from a storage space.

If the next instruction needs to use this register’s value, then it must wait for the register to complete loading the data (which may reach the relevant level of the processor pipeline through a dedicated channel).

The compiler (or assembler) will usually avoid this situation; it generally attempts to separate the instructions that use the loaded data from the loading instruction.

PART TWO – Software Debugging

1. Standard Debugging Techniques

(1) Differences Between Software and Hardware Breakpoints

Candidates should be aware of the differences and limitations between Hardware and Software breakpoints. They should understand the usage of watchpoint units in configuring these two types of breakpoint.

It is necessary to clearly understand the differences between hardware and software breakpoints and their respective limitations, as well as the use of watchpoint units when configuring these two types of breakpoints.

In earlier processors (like ARM7TDMI), there was no dedicated breakpoint instruction BKPT, but rather used watchpoint units to check specific bit patterns. The debugger can set an unlimited number of software breakpoints in the RAM code space.

The debugger will read the opcode where a breakpoint instruction needs to be placed and save it to the host. This instruction is then replaced by a specific data bit (usually corresponding to an invalid instruction).

Typically, this requires the processor’s data cache to be flushed and the instruction cache invalidated to ensure that the specific data written from the processor’s data side can be correctly accessed by the instruction fetch logic.

In newer processors, including all Cortex series processors, the BKPT instruction is used instead of the original opcode that needed to be placed for breakpoints.

When the breakpoint is removed, the original opcode stored on the host will be written back. In contrast, hardware breakpoints can be set on any type of memory (including RAM and ROM), but their number is limited by hardware.

When learning this part, it is recommended to use a real debugger to understand its usage and limitations, gaining some practical application experience, which can provide more beneficial background knowledge than explanations in books.

Software breakpoints and hardware breakpoints are two different types of breakpoints that have significant differences in implementation and applicable scenarios.

Hardware breakpoints:

Require hardware support from the target CPU.

The number of breakpoints is limited by specific hardware design. For example, ARM7/9 cores support a maximum of two hardware breakpoints, while ARM11 can support up to eight hardware breakpoints.

Hardware breakpoints can be set on code at any location, including ROM and RAM.

Due to the flexibility of hardware breakpoint settings, they are the preferred breakpoint resource.

Software breakpoints:

Implemented by setting characteristic values in the code.

When a software breakpoint needs to be set at a specific address in the code, the emulator first backs up the code at that location, then writes a predetermined breakpoint characteristic value (usually a value like 0x0000 that is unlikely to be confused with code) into that address, overwriting the original code data.

When the program runs to the address where this characteristic value is located, the emulator recognizes it as a soft breakpoint and will generate an interrupt. When the breakpoint is removed, the previously protected code information will be automatically restored.

Since software breakpoints require modifying the corresponding address values, they can generally only be set in RAM, but their number can be unlimited. In summary, software breakpoints and hardware breakpoints each have their characteristics and usage scenarios. Hardware breakpoints, due to their flexibility in setting, are usually the preferred breakpoint resource, but when a large number of breakpoints need to be set, software breakpoints can serve as an effective supplementary resource.

Hardware breakpoints are more commonly used in practical applications. The reasons include:

  • Hardware breakpoints are usually faster because they are directly controlled by hardware and do not require software intervention.
  • Hardware breakpoints are typically more reliable because they are not affected by software execution, such as memory management, process switching, etc.
  • Hardware breakpoints usually allow for more breakpoints to be set because they are directly supported by hardware, unlike software breakpoints that may be limited by software resources.

Of course, in some cases, software breakpoints may also be a better choice. For example, for some low-end devices that lack hardware debugging support, or when fine control over the location and number of breakpoints is needed, software breakpoints may be necessary.

Hardware breakpoints are more commonly used in practical applications, but the specific choice of which type of breakpoint to use depends on the specific application scenario and needs.

(2) Monitoring Mode vs. Stop Mode Debugging

Candidates should be aware of the differences between debugging in each of the two modes and when one would be chosen over the other, e.g., choosing monitor mode if interrupts must continue to be serviced while debugging.

It is necessary to understand the differences between the two debugging modes and when to choose which mode.

For example, if it is necessary to continue responding to interrupts during debugging, then monitor mode should be chosen for debugging.

The main difference between monitor mode and stop mode debugging lies in their purpose and behavior.

Monitor mode debugging is primarily used for real-time monitoring of the program’s operational status and behavior. In this mode, the program continues to run normally while allowing developers to observe variable values, execution flow, and function calls in real time. Monitor mode debugging is suitable for situations where it is necessary to understand the real-time behavior of the program to detect potential issues or anomalies promptly. It is also suitable for situations requiring real-time responses to user input or external events.

Stop mode debugging is mainly used for pausing program execution for in-depth inspection and diagnosis of issues. In this mode, the program stops executing, and developers can step through the program, inspect variable values, view stack traces, etc. to better understand the program’s status and behavior. Stop mode debugging is suitable for situations requiring an in-depth understanding of the program’s internal state and behavior, such as handling complex errors or debugging performance issues.

When it is necessary to continue responding to user interrupts during debugging, monitor mode debugging may be more appropriate. In this mode, the program continues to run normally while allowing developers to observe and monitor the program’s operational status in real time. This can promptly identify and fix issues related to interrupts and ensure the program’s normal operation.

However, in certain cases, it may be necessary to use stop mode debugging to better diagnose issues. For example, when encountering complex errors or performance issues, it may be necessary to pause program execution for a more in-depth inspection and analysis. In such instances, stop mode debugging can help developers better understand the program’s status and behavior and identify the root cause of the issues.

In summary, the choice between monitor mode and stop mode debugging depends on specific needs and scenarios. Monitor mode is more suitable for real-time monitoring and responding to user input or external events, while stop mode is more suitable for in-depth examination and diagnosis of complex issues. Choosing the appropriate debugging mode based on actual conditions will help improve development efficiency and program stability.

(3) Vector Capture

Candidates should be aware of the function of this feature, why and when it is typically used.

It is necessary to understand the function of vector capture, why it is needed, and when to use it.

Vector capture is a method used by debuggers to capture processor exceptions and interrupts, capturing the processor’s exception information before the exception interrupt service handler is fetched during early development.

Many ARM processors accomplish this task with vector capture hardware. In other processors like ARM7TDMI, the debugger can use the method of setting breakpoints at appropriate positions in the exception vector table for interrupt vector capture.

Vector capture is a debugging technique used to capture exception information when a processor exception occurs. It is typically used in embedded system development, especially when debugging and troubleshooting exception issues.

The primary purpose of vector capture is to obtain the processor’s status information at the time of the exception, including the program counter (PC), stack pointer (SP), register contents, etc. This information can help developers quickly locate the specific location where the exception occurred and understand the processor’s behavior at the time of the exception.

Why is vector capture needed? The main reasons include:

  • Quickly locate exceptions: By capturing vectors, developers can obtain detailed information at the time of the exception, thus quickly locating the specific location where the exception occurred. This helps shorten debugging time and improve development efficiency.

  • Deeply understand exception behavior: Vector capture not only provides the processor status at the time of the exception but also helps developers understand the specific behavior of the exception. This helps determine the root cause of the exception and provides important clues for fixing the exception.

  • Optimize exception handling: By capturing vectors, developers can evaluate the performance of the exception handler and optimize it. For example, by analyzing the captured exception information, developers can check whether the exception handler correctly handled the exception and make necessary adjustments.

When is vector capture typically needed? Here are some situations where vector capture may be necessary:

  • When the program cannot start normally: If the program cannot start normally, there may be hardware or software faults. In this case, using vector capture can help developers quickly locate the problem.

  • When the program encounters unhandled exceptions: During program execution, if unhandled exceptions occur, it may lead to program crashes or data loss. By capturing vectors, these exceptions can be captured and debugged to fix the issues.

  • When a deeper understanding of program behavior is needed: Sometimes, developers may need to gain a deeper understanding of the program’s operational behavior. For example, in performance optimization or debugging complex issues, vector capture can help developers better understand the program’s behavior.

In summary, vector capture is a very useful debugging technique that can help developers quickly locate exception issues, deeply understand exception behavior, and optimize handling. In embedded system development, especially when debugging complex exception issues, using vector capture can enhance development efficiency and program stability.

(4) Identifying the Cause of Exception Triggers (e.g., DFSR, SPSR, Link Register)

Candidates should understand how to determine the actual cause of an exception by using additional resources, which might include special registers that contain fault status. For example, using the DFSR, DFAR to determine the cause of a Data Abort or using the SPSR and LR to locate and retrieve SVC comment field.

  • DFSR (Data Fault Status Register): When a data access exception occurs, the related exception type and vector will be written into the DFSR register. For example, when a data access permission violation, invalid address, or data read error occurs, a data exception will be triggered, and the related information will be written into the DFSR.

  • **SPSR (Saved Program Status Register):** When entering exception handling, the current Program Status Register (PSR) will be saved into the SPSR for restoration after exception handling is completed. By checking the status of the SPSR, one can understand the program’s status at the time of the exception.

  • **Link Register:** In certain architectures, such as ARM, the link register is used to save the return address of subroutine calls. When a subroutine starts executing, the return address is pushed onto the stack and stored in the link register. If an exception occurs, this address can be used to recover and jump to the correct return address from the stack.

To determine the specific cause of an exception trigger, it is necessary to analyze based on the actual scenario and the specific manifestations of the exception. This may include checking the Program Status Register (PSR), Program Counter (PC), Stack Pointer (SP), and other relevant system registers’ states, as well as checking related memory and I/O operations, etc. Additionally, context information and call stack information in the exception handler are also important clues.

It is essential to understand how to use additional resources to determine the cause of an exception trigger, which usually includes special registers containing error status, such as using DFSR, DFAR to determine the cause of data abort, and using SPSR and Link Register to locate and retrieve SVC system call parameters.

When debugging a piece of software, it is often necessary to understand why a specific processor exception occurs. To this end, ARM processors provide a series of registers to provide useful information.

In exception interrupts, **the link register (LR) of the current exception mode provides the position of the instruction in the main program that is closest to triggering the exception**. Typically, the position of the instruction that triggers the exception is the value of the LR register minus 4 or 8. Similarly, the mode bit of the SPSR register can indicate the mode the processor was in before entering the exception mode.

For specific modes, there is additional information. In instruction fetch exceptions or data abort exceptions, the error status register (FSR) and fault address register (FAR) of the CP15 register are noteworthy, as they can provide the reason for the exception (such as an external memory error or an invalid translation table entry for that address) and the address of the memory access operation that triggered the exception.

**Note:** The exception handling model of ARMv6-M and ARMv7-M architectures differs from other ARM cores.

(5) Tracing

Candidates should be aware of what Trace is, what on-chip and off-chip components are required to make it work, what it is used for. Also that it is non-intrusive.

It is necessary to understand what tracing is, what on-chip and off-chip components are needed for it to function normally, what it is used for, and that it is non-intrusive.

Through tracing, developers can understand the operations and instructions executed by the processor at runtime. To implement tracing functionality, some on-chip and off-chip modules are required.

On-chip modules include tracing hardware units and related control logic. These modules are responsible for generating tracing data and storing the data in internal buffers or external memory for subsequent analysis. On-chip modules are non-intrusive, meaning they do not affect the normal functionality of the processor.

Off-chip modules include tracing data capture units and external memory. The tracing data capture unit is responsible for obtaining tracing data from the on-chip tracing hardware unit and converting it into an analyzable format. External memory is used to store tracing data for subsequent analysis and debugging.

(6) Cross Triggering

ARM cross-triggering technology is a debugging technique used to trigger and record the behavior of the processor when specific events occur. Through cross-triggering, developers can observe the processor’s behavior during specific events, such as exceptions, interrupts, or specific instruction executions.

Cross-triggering mechanisms are typically implemented through collaboration between hardware and software. The hardware part is responsible for monitoring the occurrence of events and triggering the tracing mechanism when the event occurs. The tracing mechanism records the relevant instructions and operations executed by the processor and stores the data in internal buffers or external memory. The software part is responsible for reading the tracing data and converting it into an analyzable format, enabling developers to gain insights into the processor’s behavior.

In ARM debugging, cross-triggering technology is crucial for diagnosing and resolving issues. Developers can observe the processor’s behavior during specific events through the cross-triggering mechanism, enabling them to locate the root cause of problems or errors. This technology can also be used for performance analysis, code coverage analysis, etc., helping developers optimize code and improve system performance.

ARM cross-triggering technology can be applied in various application scenarios, here are some common application scenarios:

  • Debugging exceptions and interrupts: By using cross-triggering technology, developers can observe the processor’s behavior when exceptions or interrupts occur. This helps determine the cause of the exception or interrupt and how the processor performs during exception or interrupt handling. Performance optimization: Cross-triggering technology can help developers analyze the processor’s performance under different workloads. By observing the processor’s behavior under critical tasks or high-load conditions, developers can identify performance bottlenecks and optimize the code.
  • Code coverage analysis: Cross-triggering technology can be used to analyze the processor’s behavior when executing different code paths. By recording the processor’s behavior during specific events, developers can determine code coverage and identify which code paths were executed and which were not.
  • System stability analysis: Cross-triggering technology can help developers analyze the system’s stability performance under different conditions. For example, in cases of system crashes or hangs, developers can use cross-triggering technology to observe the processor’s behavior before the crash or hang to identify the root cause of the problem.
  • Security vulnerability analysis: Cross-triggering technology can be used in security vulnerability analysis to help developers locate and address security issues. For example, developers can use cross-triggering technology to observe the processor’s behavior when executing malicious code to discover and address security vulnerabilities.

In summary, ARM cross-triggering technology is a powerful debugging tool that can help developers gain deep insights into the processor’s behavior and performance and resolve potential issues or errors. It can be applied in various application scenarios, including debugging exceptions and interrupts, performance optimization, code coverage analysis, system stability analysis, and security vulnerability analysis.

(7) Physical Debug Interface

Candidates should be aware of available options for making physical connections to the target for the purpose of run-control debug (e.g., JTAG, SWD)

It is necessary to understand the physical connection options for making physical connections to the target system for online debugging (e.g., JTAG, SWD).

JTAG is a widely used industrial standard, which typically requires at least 5 signal pins to connect the host computer and the ARM target system board. The serial debugging port SWD is a 2-pin debugging port with similar debugging functions and is usually available only on newer processors.

Additionally, capturing tracing information from the processor requires significantly more bandwidth than controlling code execution, as many 32-bit address and data information is generated every cycle, so dedicated tracing ports require more pins.

Furthermore, tracing information can also be stored in on-chip buffers, using slower debugging ports with fewer physical pins to read this tracing information.

(8) Debugging Access to Memory

Candidates should be aware of how a debugger uses the processor to access memory and be aware of how memory translation affects this. They should also understand the potential impact of cache maintenance caused by debug operations on the state of the memory system.

It is necessary to understand how the debugger uses the processor to access memory and understand how memory translation affects this. It is also necessary to understand the potential impact of debugging operations on the state of the memory system caused by cache maintenance.

Debuggers typically need to display and modify the contents of the target memory (or peripherals), which is usually achieved by executing load or store instructions on the processor. However, some systems can directly perform write (or read) operations on memory through built-in debugging system units without the processor itself participating in memory operations.

A good debugger will minimize the impact of debugging actions on the system. For example, it will typically attempt to reduce modifications to the cache during debugging.

If the debugger displays the contents of a memory area in the debugging window but must use the processor to read memory, it will try to operate in a non-cache manner to ensure that the previously cached content is not flushed.

Programmers must understand that **when the memory management unit (MMU) is enabled, the debugging display uses virtual addresses instead of physical addresses, and the content in the cache may not be consistent with the content in external storage.**

If memory access is directly performed through the on-chip debugging unit (instead of the processor), then the access uses physical addresses instead of the processor’s virtual address translation, and it directly accesses memory (bypassing the processor’s cache).

(9) Semi-Hosting

Candidates should be aware of semi-hosting, understand what its function is and how it can be used during development. They should also be aware of the invasive nature of using this feature and how it may affect timing etc. They should also be aware of the danger of using semi-hosting when no debugger is connected and how to retarget the standard C library.

It is necessary to understand what the function of semi-hosting is and how to use it during development;

It is necessary to be aware of the invasive nature of this mechanism and its effects on timing and other aspects;

It is also necessary to understand the risks of using this mechanism when the target board is not connected to the debugger and how it redirects to the standard C library.

Semi-hosting is a mechanism that allows code on the ARM target board to use some resources provided by the debugger on the host computer.

These resources may include keyboard input, display output, disk I/O. For example, programmers can use standard C library functions provided by this mechanism, such as printf() and scanf() to use the terminal screen and keyboard on the host.

The developed hardware usually does not contain fully functional input/output devices, while semi-hosting allows the host to provide these resources.

Semi-hosting is implemented through a series of dedicated software instructions that generate exceptions.

Applications use appropriate semi-hosting exception calls—the debugging intermediary to respond to exception handling. The debugging intermediary provides the necessary communication to the host.

The semi-hosting interface is usually implemented through the debugging unit provided by ARM as an intermediary. ARM tools use SVC 0x123456 (ARM state) or SVC 0xAB (Thumb state) to represent semi-hosting debugging functions.

Of course, if detached from the development environment, the host running the debugger will not be connected to the target system. At this point, developers need to redirect any library functions that use semi-hosting, such as fputc().

Previously, when debugging a small IoT chip, the lightweight device did not have a system, but ran a protocol stack. All the prints were fput.

This means that it is necessary to use code that can output characters to the designated device instead of using SVC calls to semi-hosting library function code.

2. Standard Debugging Techniques

(1) Call Stack

Candidates must understand what a call stack is and how it may be used in debugging.

It is necessary to understand what a call stack is and how it should be used in debugging.

Application code mainly uses the stack for parameter passing, saving local variables, and saving return addresses.

Data pushed onto the stack by each function is organized as a “stack frame”, when the debugger stops the processor, it can analyze the data in the stack to provide the programmer with a “call stack”,

which shows the call relationship at each level from the top function call to the current subfunction, allowing users to conveniently understand the entire calling path during debugging and understand why the program runs to the current position.

To reconstruct the call stack, the debugger must be able to determine which items in the stack contain the return address information. If included during compilation, this information is usually found in the “debugger information” (DWARF debugging table) or obtained from a “frame pointer” linked list pushed onto the stack by the program.

Of course, the code must use frame pointers. If neither of these types of information is available, the call stack cannot be reconstructed.

In multi-threaded applications, each thread has its own stack, so the call stack information is only relevant to its corresponding thread.

Stack frame information refers to the data structure automatically allocated on the stack by the compiler during function calls to save the function call state and local variable information. In the call stack, each function call corresponds to a stack frame containing all the information required for that function to execute.

The main purpose of the stack frame is to save local variables and function parameters. When a function is called, its local variables and parameters are pushed onto the stack for access during the function’s execution. Additionally, the stack frame also contains the return address of the function call, environment information (such as register values), etc.

During debugging, by analyzing stack frame information, one can understand the hierarchical structure of function calls, the current execution position, the values of variables, etc. This helps understand the program execution flow and locate problems.

To reconstruct the call stack and obtain stack frame information, the compiler needs to generate debugging information (such as DWARF debugging table) during compilation. This debugging information contains metadata about the program structure and variables, as well as layout information for stack frames. The debugger can use this information to parse stack frames and reconstruct the call stack, providing developers with a complete view of program execution.

In summary, stack frame information is an important component of the function call process, allowing us to gain deep insights into the program’s execution state and history and utilize this information for debugging and optimization.

(2) Single Stepping

Candidates must understand what single stepping is and how it may be used in debugging. They should understand the difference between Step-In and Step-Over when single-stepping.

It is necessary to understand what single stepping is and how to apply it in debugging. It is necessary to understand the difference between Step-In and Step-Over when single-stepping.

Single stepping refers to the debugger controlling the execution of part of the code, executing one instruction at a time during debugging. The difference between Step-In and Step-Over can be understood from function calls.

When using Step-Over to debug a function, the entire function will be executed as one step, allowing the programmer to execute the entire function directly without needing to step into the function body. Step-In indicates entering the function to step through its body.

(3) Start/Stop

Candidates must understand what start/stop debugging is and how it may be used.

It is necessary to understand what start/stop debugging is and how to use it.

The meaning of start is pressing the “start” button on the debugger, causing the processor to exit debugging mode and re-enter normal execution mode (starting execution from the current program pointer) until the program encounters a factor that causes it to stop, typically a breakpoint, watchpoint, or capture vector event, or a debugging request generated by external debuggers and other systems blocking.

The start/stop debugging mode is distinctly contrasted with systems that cannot stop code execution during debugging. In some embedded systems (such as automotive engine control systems), the processor cannot simply be stopped during system debugging.

(4) Printf

Candidates must understand how printf may be used in code instrumentation (they may also be expected to understand what other kinds of instrumentation might typically be used).

It is necessary to understand how to use printf in code instrumentation (and understand what other instrumentation functions might be available).

This is a fundamental debugging technique typically used across all processor architectures to insert instructions into the code to output some data, such as displaying the program’s instruction flow or the values of some key variables at a certain moment.

In code instrumentation, the printf function is a commonly used debugging tool for outputting data and information, helping developers understand the program’s execution state and variable values. The printf function is similar to output functions in other programming languages, such as the printf() function in C or the print() function in Python.

Using the printf function in code requires inserting printf statements into the program and specifying what content to output. This content includes various types of variables, macros, expressions, etc. By outputting this content, developers can understand the program’s execution state, variable values, and other key information during debugging.

(5) Bare Metal Code vs. Application Code

Candidates should be aware of the difference in debug options between applications running on OS-based or bare metal systems (e.g., Ethernet, GDB server, ICE).

Example: Candidates should know that it is necessary to run a server program on the target system in order to debug an OS-based application.

It is necessary to understand the differences between debugging OS-based application layer software and debugging bare metal code systems (e.g., Ethernet, GDB Server, ICE).

For example, it is necessary to know that debugging an OS-based application on the target board requires running a debugging service program.

(6) RAM/ROM Debugging

Candidates should be aware of the limitations of debugging code running in different types of memory (e.g., breakpoints and image load).

Example: Candidates should know that it is not possible to set an unlimited number of breakpoints when executing code in ROM. More advanced candidates would understand why this is the case.

It is necessary to understand the limitations of debugging code running in different types of storage (e.g., breakpoints and image load).

For example, it should be known that it is not possible to set an unlimited number of breakpoints when executing code in ROM. A deeper understanding should clarify why this is the case.

Software breakpoints can only be set in RAM because RAM is writable storage. Another common limitation when debugging code executed in ROM is that it is challenging to modify the code to correct erroneous sections.

There are indeed limitations when debugging code in storage, primarily related to the type and characteristics of the storage. Here are detailed explanations of the situations you mentioned:

  • Debugging code in ROM (Read-Only Memory):

    • Limitations: **Because ROM is read-only, the code executed in ROM cannot be modified directly.** This limits the ability to modify the code during the debugging process, such as setting breakpoints.
    • Why: The design purpose of ROM is to store fixed data, such as program code, rather than to modify it at runtime. Therefore, it lacks the write capability of RAM, making it impossible to directly set breakpoints on code executed in ROM.
  • Software breakpoints vs. RAM:

    • Limitations: Software breakpoints can only be set in RAM. This is because data can be written in RAM, while ROM or other types of storage are typically read-only.
    • Reason: Because RAM is writable, breakpoints can be set on it. When the program executes at a specific address, the breakpoint is triggered, and program execution is paused, allowing developers to debug. If an attempt is made to set a breakpoint on ROM or other read-only storage, hardware or the operating system typically rejects this operation as it violates the design purpose of read-only storage.
  • Modifying code in ROM:

    • Limitations: Due to the read-only nature of ROM, it is challenging to directly modify the code to fix errors. Typically, re-compiling and burning a new ROM image is required.
    • Reason: The original intention of ROM’s design is to store fixed, unchangeable data. This means that once ROM is programmed (i.e., data is written), its contents cannot be easily changed. If changes are needed to the code stored in ROM, it usually requires recompiling the application, generating a new ROM image, and burning it into the ROM. This is far more complex than debugging and modifying code in RAM.

In summary, understanding the limitations of debugging code in different types of storage is crucial as they affect the efficiency and flexibility of the debugging process. Debugging code in ROM is generally more challenging due to its read-only nature, which limits the ability to modify code. In contrast, debugging in RAM is more flexible because it is writable, allowing developers to set breakpoints, modify variables, and perform other debugging tasks.

(7) Timing-Related Issues During Debugging

Candidates should be aware that certain debug approaches may affect the timing of the code (e.g., halt mode) and be aware of non-invasive alternatives (e.g., trace).

It is necessary to understand that some debugging methods may affect the timing of code (e.g., halt mode) and to understand other non-invasive debugging methods (e.g., tracing).

During debugging, some traditional debugging methods may impact the timing of the code, such as debugging in halt mode. These methods may alter the program’s execution flow and timing, affecting the program’s performance and results.

To address this issue, some non-invasive debugging methods have been introduced. One commonly used method is tracing.

Tracing is a technique for monitoring the program execution process by recording key events and data during the program’s execution, helping developers understand the program’s execution state and behavior. Unlike traditional debugging methods, tracing does not interfere with the program’s execution flow, thus having less impact on the timing of the code.

Tracing can be implemented by inserting tracing statements into the program or using specialized tracing tools.

Tracing statements can output key information at specific points in the program, such as variable values, function call stacks, etc. This information is recorded and analyzed after the program execution ends. By analyzing this information, developers can understand the program’s execution path, variable values, and other key information.

In addition to tracing, there are other non-invasive debugging methods, such as using watchpoints and breakpoints.

Watchpoints can be set at a specific address in the program with a trigger condition, which triggers when the value of a variable changes. By using watchpoints, developers can observe the changes in variable values and understand the program’s behavior. Breakpoints can be set at specific points in the program as pause points, which pause when the program execution reaches that point. By using breakpoints, developers can step through the program and observe variable values and execution flows.

In summary, it is important to understand that certain debugging methods may affect the timing of the code.

To minimize the impact, developers can choose to use non-invasive debugging methods, such as tracing, watchpoints, and breakpoints. These methods can help developers understand the program’s execution state and behavior and discover potential issues and optimization directions.

(8) The Impact of Debugging Software on System State (System May Be Changed State)

Candidates should be aware of the implications/impact of debugging (e.g., reading memory locations).

It is necessary to understand the meaning and impact of debugging (e.g., reading memory locations).

Debugging is an essential part of the software development process that helps to discover and fix errors and issues in the code. However, using a debugger for debugging may indeed impact the system state of the software. Here are some potential impacts and considerations:

  • 1. System state changes: During debugging, developers may pause program execution, step through code, modify variable values, etc. All these operations can directly or indirectly change the system state. For example, modifying variable values may cause the program’s behavior to differ from normal conditions.

  • 2. Memory reading and writing: While using the debugger, developers often need to read and modify variable values. These operations may involve reading and writing to memory. If not handled properly, it may lead to memory corruption, data loss, or other undefined behavior.

  • 3. Performance impact: During debugging, the program’s execution may be paused, interrupted, or slowed down, which may affect the program’s performance. If debugging is performed frequently, it may lead to a decline in program performance.

4. Dependency issues: During debugging, developers may need to simulate certain external events or conditions. These simulations may cause the program’s behavior to differ from actual runtime, introducing dependency issues.

5. Security considerations: In some cases, debugging may expose sensitive information or allow malicious users to manipulate the program’s behavior using the debugger. Therefore, security issues need to be considered during debugging, and appropriate measures should be taken to protect the system’s security.

To minimize the impact of debugging on the software system state, developers should adopt some best practices:

  • 1. Backup important data: Before starting debugging, it is advisable to back up important system data and configurations to prevent accidental changes from causing data loss or corruption.

  • 2. Limit debugger permissions: Ensure that the debugger does not have unnecessary permissions to reduce potential security risks.

  • 3. Use assertions and logging: Adding assertions and logging in the code can help developers better understand the program’s state and behavior during debugging.

  • 4. Be cautious with memory operations: Try to avoid direct reading and writing to memory, especially when the program is in an unstable state.

  • 5. Limit the scope of debugging: Try to use the debugger only on the portions of code that need debugging, rather than running the debugger on the entire program.

  • 6. Regularly test and validate: During debugging, regularly testing and validating the correctness and performance of the program can help developers promptly discover potential issues.

  • 7. Follow good coding practices: Writing high-quality code can reduce potential errors and issues, thereby reducing the need for debugging.

In summary, while debugging is an essential part of software development, developers should be aware of the potential impact of using a debugger on the system state and take appropriate measures to minimize these impacts.

PART THREE – Architecture

1. Instruction Set

(1) LDREX/STREX

Candidates should be aware of how these instructions work and how they might be used to implement software synchronization (e.g., mutex). At this level, candidates should be able to recognize a simple mutex implementation (in assembler) and be able to explain its operation. They will not be expected to write one.

Example: Candidates should be aware of what a mutex is. More advanced candidates will be aware of the exclusive access instructions which are used to implement mutexes and similar constructs.

It is necessary to understand how these instructions work and how they are used for software synchronization (e.g., mutex);

Candidates should be able to recognize a mutex application (in assembly code) and explain its function but are not required to write such code applications.

  • **LDREX and STREX are two instructions in the ARM architecture used to implement atomic operations.** Atomic operations are operations that are not interrupted by other threads or processes during execution.

  • LDREX instruction reads a value from memory and marks exclusive access to that memory segment. After reading the 4-byte memory value pointed to by register Ry, it saves it to register Rx while marking exclusive access to the memory area pointed to by Ry. If the LDREX instruction finds that it has already been marked for exclusive access during execution, it will not affect the execution of the instruction.

  • STREX instruction checks whether the memory segment has already been marked for exclusive access when updating the memory value, thus deciding whether to update the value in memory. If the STREX instruction finds that it has already been marked for exclusive access during execution, it will update the value in register Ry to the memory pointed to by register Rz and set register Rx to 0. After the instruction executes successfully, it clears the exclusive access marker. Once a STREX instruction executes successfully, any subsequent attempts to update the same memory segment using another STREX instruction will find that the exclusive marker has been cleared, and thus cannot be updated, thereby achieving exclusive access mechanism.

These two instructions are commonly used to implement inter-thread synchronization in concurrent programming to avoid data races and other concurrency issues.

In software development, synchronization is an important technique used to ensure the correct behavior of multiple threads or processes when accessing shared resources. Mutex (mutual exclusion) is a common method for implementing software synchronization, allowing multiple threads to safely access shared resources. To understand the implementation of mutex, developers need to understand how the underlying instructions work.

In assembly language, there are specific instructions used to implement mutex and software synchronization. These instructions typically involve reading and writing operations to memory, as well as altering the state of processes or threads.

For example, a simple mutex implementation might use the following types of instructions:

  • Atomic operation instructions: These instructions are not interrupted by other threads or processes during execution, allowing for safe modification of shared variable values without locks. For example, some CPUs provide CMPXCHG (compare and swap) instruction, which can atomically compare and swap values in memory.

  • Spinlock instructions: These instructions allow a thread to continuously check a condition until it becomes true. This is often used to implement a spinlock, where the thread “spins” while waiting to acquire the lock. For example, CMP (compare) and JXX (conditional jump) instructions can be combined to implement a spinlock.

  • Blocking and waking instructions: These instructions allow a thread to block itself (e.g., through WAIT or PAUSE instructions) until another thread wakes it up (e.g., through SIGNAL or WAKE instructions).

Understanding how these underlying instructions work is crucial for understanding the implementation of synchronization mechanisms like mutex. Developers need to be able to recognize and understand these instructions in assembly code to better understand software’s synchronization behavior.

It is worth noting that while understanding these underlying details is helpful for delving into synchronization mechanisms, in actual development, higher-level abstractions and tools (such as thread synchronization primitives provided by operating systems or concurrent libraries in high-level languages) are typically used rather than directly writing assembly code to implement these mechanisms.

Understanding ARM Architecture and Its Processor Families

Finally, if you have any original articles related to electronic design or other technical topics, we welcome submissions, and selected articles will be published with a reward!

Understanding ARM Architecture and Its Processor Families

Warm reminder:

Due to recent changes in the WeChat public platform push rules, many readers have reported that they did not see the updated articles in time. According to the latest rules, it is recommended to click “Recommended Reading, Share, Collect, etc.” more often to become a regular user.

Recommended Reading:

  • 罗永浩指控荣耀抄袭!技术负责人回应

  • 科技公司裁员 25%!

  • 阿里巴巴在海外裁员!

  • 曝支付宝将进军社交!

  • 产销世界第一!比亚迪奖励经销商 20 亿

Leave a Comment

Your email address will not be published. Required fields are marked *