A Comprehensive Guide for Programmers on Cortex-A Series

Follow Arm Technology Academy by clicking the card below

This article is authorized to be reprinted from the WeChat public account TrustZone. It mainly shares some knowledge points about Cortex-A processors in the Arm system. The article mainly introduces the architecture of Arm, basic structural units, and the Cortex-A series processors.

Introduction to Arm

Arm processors are ubiquitous.

Mobile phones, personal computers, televisions, or cars. In a total shipment of about 3 billion microprocessors, the x86 architecture occupies a very small position (but still very profitable).

History of Arm

Arm processors are not a single processor, but a family of processors with the same instruction set and programmer model, with some degree of backward compatibility.

The first Arm processor (Arm1) was designed by a team led by Sophie Wilson and Steve Furber at Acorn Computers and produced the first silicon chip in April 1985.

Arm1 was quickly replaced by Arm2 (which added multiplication hardware), and Arm2 was used in actual systems, including Acorn’s Archimedes personal computer.

In November 1990, Arm was established in Cambridge, UK, as Advanced RISC Machines Ltd., a joint venture between Apple Computer, Acorn Computers, and VLSI Technology, with the initial 12 employees mainly from the Acorn team.

One reason Arm became an independent company was that the processor had already been used by Apple in the Newton products.

The new company quickly decided that the best way to advance their technology was to register their intellectual property (IP) rather than design, manufacture, and sell chips themselves; they would sell design rights to semiconductor companies.

Famous strategies have many interesting stories behind them.

Arm also registered physical IP – cell libraries (NAND gates, RAM, etc.), graphics and video accelerators, and software development products such as compilers, debuggers, and development boards.

System on Chip (SoC)

It is becoming increasingly rare for a single company to produce all components of such systems. For this reason, Arm and other semiconductor IP companies design and verify components (so-called IP modules or processors).

Semi-conductor companies allow these modules to be used in their own designs, such as microprocessors, DSPs, 3D graphics, and video controllers, and many other functions.

Semi-conductor companies integrate these modules with other parts of specific systems onto chips, forming a system-on-chip (SoC). To form a system, the builders of these devices must choose appropriate processors, memory controllers, on-chip memory, peripherals, interconnect buses, and other logic blocks (which may include analog or RF parts).

The term ASIC (Application Specific Integrated Circuit) refers to an integrated circuit designed for a specific application. A single ASIC may contain an Arm processor, memory, and other components.

This has many similarities to devices known as system-on-chips. SoCs typically refer to a device with high integration, which includes many parts of the system on a single device, possibly including analog, mixed-signal, or RF circuits.

Of course, powerful operating systems like Linux require a lot of memory space, not just running on a single silicon device. Since a single device cannot contain the entire system, the naming of the system-on-chip may not be entirely accurate.

Setting aside the issue of silicon area, in general, many parts of a system require specialized silicon manufacturing processes and should avoid being placed on the same chip.

To some extent, the extension of the SoC concept is the system-in-package (SiP), which integrates many separate chips within a single physical package and is also widely considered as package stacking.

SoC chips use packaging for connections from the bottom (to connect to PCB) to the top (containing connections to independent packages, which may include flash memory or a large SDRAM device).

Embedded Systems

The common definition of an embedded system is software running on a piece of computer hardware to perform a specific task. For example, set-top boxes, smart cards, routers, disk drives, printers, automotive engine management systems, MP3 players, or copiers.

The difference compared to computer systems is that computers have a variety of general-purpose software, as well as input and output devices such as keyboards and some type of graphical display.

Now, this distinction is becoming increasingly blurred, as with mobile phones, a basic phone prototype might only perform the task of making calls, but modern smartphones can run a complex operating system and even download thousands of applications.

Embedded systems may contain simple 8-bit microprocessors, such as Intel 8051 or PIC microcontrollers, or contain some complex 32-bit or 64-bit processors, such as the Arm series.

Systems need some RAM and some form of non-volatile storage space to save the programs executed by the system, and need some additional peripheral devices, for which the actual functionality of these devices (often including general-purpose asynchronous transceivers (UART), interrupt controllers, timers, GPIO controllers) may also include fairly complex modules such as DSPs, GPUs (graphics processors), or DMA controllers.

Software running on embedded systems is usually divided into two independent parts: operating systems (OS) and applications running on the OS.

Operating systems in widespread use range from simple kernels, complex real-time operating systems (RTOS), to fully functional complex operating systems found on computers.

Due to many limitations present on embedded systems, programming embedded systems may face more challenges compared to programming general-purpose PCs.

(1) Memory usage. In many systems, to minimize costs, the size of memory is limited. Programmers may be forced to consider the size of the program and how to reduce memory usage during program execution.

(2) Real-time performance. Certain systems are characterized by a time limit for responding to external events. This may be a “hard” requirement (e.g., the automotive braking system must respond within a certain time) or a “soft” requirement (e.g., audio processing must be completed within a certain time frame to avoid a poor user experience; failure to meet this may render the system worthless).

(3) Power. Many embedded systems are powered by batteries, and programmers and hardware designers must minimize the energy consumption of the system. This can be achieved, for example, by slowing down the clock, reducing supply voltage, or shutting down the processor when not in use.

(4) Cost. Cost may be the greatest constraint in system design.

(5) Time to market. In a competitive market, the time taken to develop a product is an important factor affecting its success.

Arm Architecture and Processors

Arm itself does not manufacture silicon devices. Instead, Arm creates microprocessor designs and licenses them to semiconductor companies and original equipment manufacturers (OEMs), who integrate the microprocessors into system-on-chip devices.

To ensure compatibility in implementation, the architectural specifications defined by Arm explicitly specify the behavior of qualified products. The implementations of processors in the Arm architecture conform to a specific version of the architecture, different processors may have different internal implementations and micro-architectures, but different cycle times and clock speeds conform to the same version of the architecture.

(1) Architecture

Defines a common set or series of processor design behaviors, also defined in the Arm Architecture Reference Manual (Arm Arm), including instruction set, registers, exception handling, and other programming module functions. The architecture defines the behavior visible to programmers, such as which registers are available and the functions that individual assembly language instructions can perform.

(2) Microarchitecture.

Defines how the visible behaviors of the architecture are executed, such as the number of pipeline stages; there will also be some programmer-visible effects, such as how long it takes to execute specific instructions and the results after stall cycles.

(3) Processor.

A processor is a specific implementation of a microarchitecture, and a processor may be licensed to multiple companies for manufacturing, thus it may have been integrated into a wide variety of devices and systems, along with corresponding memory mappings, peripherals, and other modules that perform specific functions. Processors are described in technical reference manuals, which can be found on the Arm website.

(4) Core.

We use this term to describe a single logical execution unit of a multi-core processor.

(5) SoC.

A system on chip contains one or more processors, as well as memory and peripherals, and the device may contain one or more additional processors, memory, and peripherals as part of the system. For these systems, details are generally provided in the documentation of independent SoC or platform vendors.

Versions of the Architecture

Arm regularly releases new versions of its architecture, adding new features or updating existing functionalities. These updates are usually backward compatible, meaning that code written for older versions can still run correctly on newer versions.

Of course, code written to utilize new features may not run on older processors, as older processors typically lack those feature modules.

In all versions of the architecture, some system features and behaviors are left to specific implementations to define. For example, the architecture does not define the size of caches or the timing of instruction cycles, which is determined by the specific implementation of the processors and SoCs.

Each version of the architecture defines optional extensions. In the specific implementations of processors, these extensions may not have been implemented. For example, in the Armv7 architecture, the Advanced SIMD (NEON) technology is an optional extension.

The Armv7 architecture also has the concept of configuration properties. Variants of these architectures have emerged for different markets and uses. The introduction is as follows.

(1) A: Application configuration series, defines an architecture for high-performance processors that supports a virtual memory system with a memory management unit (MMU) and can run complex operating systems while supporting both Arm and Thumb instruction sets.

(2) R: Real-time configuration series, defines a real-time structure for systems that require deterministic timing and lower interrupt response latency, without the need for support for virtual memory systems and MMUs, but rather use a simple memory protection unit (MPU).

(3) M: Microcontroller configuration series, defines an architecture for low-cost and low-performance systems where low-latency interrupt handling is very important. Compared to other configuration series, it uses a different exception handling model and only supports a variant of the Thumb instruction set.

This will not be elaborated on here, as it is basically all based on V8.

History and Extensions of the Architecture

From the first test silicon chips in the mid-1980s to the first Arm6 and Arm7 devices in the early 1990s, changes to the Arm architecture have been relatively minor.

● In version 1 of the architecture, Arm1 implemented most loading, storing, and arithmetic operations of exception modes, as well as a register set.

● Version 2 added multiplication and multiply-accumulate instructions, as well as support for coprocessors, along with some further innovations. These early processors only supported a 26-bit address space.

● Version 3 of the architecture separated the program counter register and the program status register, and added some new modes to support a 32-bit address space.

● Version 4 added half-word load and store operations, as well as an additional kernel-level privilege mode.

● The Armv4T architecture introduced the Thumb (16-bit) instruction set, which has been applied in the Arm7TDMI® and Arm9TDMI® processors, with billions of products shipped.

● The Armv5TE architecture added improved DSP-type operations and saturated arithmetic, and Arm/Thumb interworking.

● The Armv6 architecture introduced some enhancements, such as support for unaligned memory access, significant changes to memory structures, and support for multiprocessor architectures, along with support for SIMD operations on bytes or half-words within 32-bit registers, and it also offers many optional extensions, mainly Thumb-2 and security extensions (TrustZone). Thumb-2 extends Thumb to a mixed-length instruction set (16-bit and 32-bit).

● The Armv7-A architecture enforced the extension of Thumb-2 and added the Advanced SIMD extension (NEON).

● Over the years, Arm adopted a continuously numbered processor system in Arm9,

● Arm9 evolved from Arm8, which in turn came from Arm7.

Throughout the Arm family, various additional numbers and letters are used to indicate different variants. For example, the Arm7TDMI processor uses T to indicate Thumb, D for Debug, M for fast multiplier, and I for embedded ICE.

For the Armv7 architecture, Arm adopted the trademark Cortex and added indications of which configuration series (A, R, or M) the processor supports.

A Comprehensive Guide for Programmers on Cortex-A Series

The figure shows how different versions of the architecture correspond to different processor implementations.

Note that this figure is not comprehensive and does not include all versions of the architecture or processor implementations.

Comparison of Arm V7 and V8:

CPU cores of Arm V7 and V8:

All of these materials can be found on the Arm official website; we just need to have a general understanding here.

Finally, we will briefly introduce some basic structural units.

Some Basic Structural Units

1. DSP Multiply-Accumulate and Saturated Arithmetic Instructions

These instructions, added to the Armv5TE architecture, enhance the capabilities of digital signal processing and multimedia software and are denoted by the letter E.

These new instructions provide many different types of signed multiply-accumulate, saturated addition and subtraction, leading zero count, and are retained in higher versions of the architecture. In many cases, this allows for the elimination of a simple standalone DSP in the system.

2. Jazelle

When power saving is needed, Jazelle DBX (Direct Bytecode Execution) is used in Armv5TEJ to improve the performance of executing Java code, increasing memory availability and improving just-in-time (JIT) compilers, reducing the application value of the processor. Therefore, many Armv7-A processors do not require hardware acceleration.

Jazelle DBX is best suited for providing high-performance Java performance in systems with very limited memory, such as feature phones or low-cost embedded applications. In today’s systems, it is mainly used for backward compatibility.

3. Thumb Execution Environment (ThumbEE)

Due to the introduction and requirements of Armv7-A, ThumbEE is sometimes referred to as Jazelle-RCT (Runtime Compiled Target). It involves minor changes to the Thumb instruction set to allow it to run better in controlled environments (such as managed languages like Java, Dalvik, C#, Python, or Perl).

ThumbEE is used in real-time (JIT) or ahead-of-time (AOT) compilers, which can reduce the size of the recompiled code. The compilation of managed code is not within the scope of this book.

4. Thumb-2

Introduced in Armv6T2, the Thumb-2 technology is also required by Armv7. This technology extends the original 16-bit Thumb instruction set to a 32-bit instruction set.

The combination of 16-bit and 32-bit Thumb instruction sets achieves similar code density to the original Thumb instruction set but with performance similar to the 32-bit Arm instruction set. The resulting Thumb instruction set provides almost all the functionality of the Arm instruction set, as well as additional features.

5. Security Extensions (TrustZone)

The optional security extension TrustZone, introduced in Armv6K, has been implemented in all Arm Cortex-A processors. TrustZone provides a separate security area, which can isolate sensitive code and data from the normal area containing the operating system and applications.

Therefore, the software in the security area is designed to provide secure services to the normal (non-secure) area.

6. VFP

Prior to Armv7, the VFP extension was known as the Vector Floating Point architecture and was used for vector operations. VFP is an extension that implements single-precision and optional double-precision floating-point operations, compliant with ANSI/IEEE standards for floating-point operations.

7. Advanced SIMD (NEON)

Arm NEON technology provides an advanced single instruction multiple data (SIMD) instruction set with its own register file (shared with VFP), and some specific implementations have separate NEON pipeline backends, supporting 8-bit, 16-bit, 32-bit, and 64-bit integers, as well as single-precision (32-bit) floating-point data, and can perform operations on 64-bit and 128-bit registers.

8. Large Physical Address Extension (LPAE)

LPAE is an optional part of the v7-A architecture and has been implemented in Cortex-A7 and Cortex-A15 processors, allowing 32-bit processors to extend their usual maximum 4GB access space to 1TB of access space, achieved by converting 32-bit virtual memory addresses into 40-bit physical memory addresses.

9. Virtualization

The virtualization extension of Arm processors is also an optional extension of the Armv7-A architecture document, which supports switching from one operating system to another via a virtual machine monitor (known as a virtual machine hypervisor).

When executed in a single-processor and multiprocessor system, the virtualization extension can support running multiple virtual machines on one processor.

10. big.LITTLE Configuration

The big.LITTLE configuration was introduced in the Armv7 architecture to address the current industry challenge of how to create system-on-chips (SoCs) that are both high-performance and energy-efficient to extend battery life.

big.LITTLE uses a high-performance Cortex-A15 processor, along with an energy-efficient Cortex-A7 processor, where the Cortex-A15 processor can handle heavy workloads, while the Cortex-A7 processor can take on most tasks for mobile devices.

Key Points of the Arm Cortex-A Series Processor Architecture

Many key points are the same for Cortex-A series devices.

● 32-bit RISC processor with 16 × 32-bit registers and a mode-based register set;

● Improved Harvard architecture (independent, concurrent access to instructions and data);

● Load/store architecture;

● Based on Thumb-2 technology;

● Options for VFP and NEON, expected to become standard features for general-purpose application processors;

● Backward compatibility with code from previous Arm processors;

● 4GB virtual address space and a minimum of 4GB physical address space;

● Hardware translation table for virtual address to physical address;

● Virtual page sizes of 4KB, 64KB, 1MB, and 16MB, with cache attributes and access permissions that can be set individually for each page;

● Support for big-endian and little-endian byte data access;

● Support for unaligned access for basic load/store instructions;

● Symmetric multiprocessing (SMP) support with MPCore™ variants, with complete data at L1 cache level;

● Automatic cache and translation lookaside buffer (TLB) maintenance propagation for efficient SMP operation;

● Data cache with physical index and physical tag (PIPT).

Processors and Pathways

Cortex-A Series Processors

Cortex-A5 Processor

The Cortex-A5 processor supports all architectural features of Armv7-A, including TrustZone security extensions and NEON media processing engine.

It has extremely high area and power efficiency but lower peak performance than other Cortex-A series processors. The Cortex-A5 processor has both single-core and multi-core versions.

As shown in the figure, the Cortex-A5 processor can, in some cases, execute a non-branch jump instruction and the subsequent branch jump instruction simultaneously, and includes complex branch prediction logic to avoid flushing and refilling the pipeline due to branching.

Support for NEON and floating-point hardware is optional. The Cortex-A5 processor also supports Arm and Thumb instruction sets, as well as Jazelle DBX and Jazelle-RCT technologies.

Cortex-A7 Processor

The Cortex-A7 multi-core processor is a high-performance, low-power processor that is fully compatible with other Cortex-A series processors mentioned in this book. The block diagram of the single-core Cortex-A7 processor is shown in the figure.

The Cortex-A7 processor includes all features of the high-performance Cortex-A15 processor, including virtualization, large physical address extensions (LPAE), NEON, and AMBA4 ACE consistency. The Cortex-A7 MPCore processor has the following features:

● Improved memory management and bus interfaces;

● LPAE, addressing up to 1TB of memory;

● AMBA4 technology solves consistency issues for multi-core processor sets;

● AMBA4 cache coherent interconnect technology (CCI) ensures cache consistency across multiple Cortex-A7 MPCore processors.

Cortex-A8 Processor

The Cortex-A8 processor is the first processor to implement the Armv7-A architecture. Many different processors use it, including Samsung S5PC100, Texas Instruments OMAP3530, and Freescale i.MX515, with a wide range of applications, some exceeding 1 GHz in frequency.

Compared to previous Arm processors, the Cortex-A8 processor has a more complex microarchitecture.

The figure is a block diagram of the single-core Cortex-A8 processor, showing the internal structure of the Cortex-A8 processor, including pathways.

The sizes of the independent instruction and data level 1 caches are 16KB or 32KB, which supplement the integrated unified level 2 cache, which can be up to 1MB in size. Both level 1 and level 2 caches provide a 128-bit wide data interface to the processor.

The level 1 data cache uses a virtual index but a physical tag, while the level 2 cache uses both physical address for indexing and tagging. By default, data used by NEON is not cached through L1 (although NEON can read and write data in the L1 data cache).

Cortex-A9 Processor

The Cortex-A9 MPCore processor and single-core Cortex-A9 processor provide higher performance than the Cortex-A5 and Cortex-A8 processors, supporting technologies such as Arm, Thumb, Thumb-2, TrustZone, Jazelle RCT, and DBX. The block diagram of the single-core Cortex-A9 processor is shown in the figure.

The level 1 cache system supports cache consistency for multi-core software with 1 to 4 processors at the hardware level. Arm provides an external level 2 cache controller (L2C-310, formerly known as PL310), supporting cache sizes of up to 8MB.

The processor also includes an integrated interrupt controller, implementing the Arm Generic Interrupt Controller (GIC) architecture specification, which can be configured to support up to 224 interrupt sources. Devices integrating the Cortex-A9 processor include nVidia’s dual-core Tegra-2, ST’s SPEAr1300, and TI’s OMAP4 platform.

Cortex-A15 Processor

The Cortex-A15 MPCore processor is currently the highest-performing Arm processor (as this book is nearing completion, Arm has released the updated Cortex-A57) and is compatible with the applications of other Arm processors described in this book.

The Cortex-A15 MPCore processor introduces several new features, including support for full hardware virtualization and large physical address extensions (LPAE), allowing addressing of up to 1TB of memory, as shown in the figure.

The Cortex-A15 MPCore processor has the following features:

● Out-of-order superscalar pipeline;

● Closely coupled low-latency level 2 cache (up to 4MB in size);

● Improved floating-point and NEON media performance;

● Full hardware virtualization;

● Large physical address extension (LPAE) addressing up to 1TB of memory;

● Fault tolerance capabilities and software error recovery;

● Formed multiple coherent multi-core processor sets via AMBA4 bus technology;

● AMBA4 cache coherent interconnect (CCI) allows full cache consistency between multiple Cortex-A15 MPCore processors.

Qualcomm’s Scorpion

It is not only Arm that designs processors compatible with the Armv7-A instruction set architecture. In 2005, Qualcomm announced, under license from Arm, the creation of its own implementation, using the same name – Scorpion.

The Scorpion processor is part of Qualcomm’s Snapdragon platform, which includes the main functions of netbooks, smartphones, or other mobile internet devices.

Qualcomm has provided less information to the public, although it mentioned that Scorpion has many similarities with the Cortex-A8 processor, also based on the Armv7-A architecture implementation, Armv7-A – superscalar and dual-issue, and supports VFP and NEON (referred to as VeNum media processing engine in Qualcomm’s press releases).

However, there are many differences; Scorpion can process 128-bit data in parallel in its NEON execution, has a 13-stage load/store pipeline, and two integer pipelines, one 10-stage capable of only executing simple arithmetic instructions (like addition and subtraction), while the other is 12-stage capable of executing all data processing operations, including multiplication.

Scorpion also has a 23-stage floating-point/SIMD pipeline and VFPv3 operation pipeline.

This content is indeed a bit outdated, as most are now based on v8 architecture. However, technology is backward compatible, so there’s no need to worry too much.

Recommended Reading

Simple and Direct Interpretation of Cortex-M23/33 (Part 1)
Introduction to Cortex-M3 (Part 1): Overview of the Architecture
Arm Series – Armv8-A

Long press to identify the QR code to add the Arm Technology Academy WeChat (aijishu20) and join the reader group of Arm Technology Academy.

Follow Arm Technology Academy

A Comprehensive Guide for Programmers on Cortex-A Series Click below “Read the Original” to read more about the Arm technology blog’s column articles.

Related posts

Leave a Comment Cancel reply