Understanding ARM CPU Architecture and SoC Design

1. Computer Architecture

Before understanding computer architecture, let’s first get to know a few key figures who made significant contributions to the invention of computers.

1. Charles Babbage

Known as the father of the mechanical computer, Babbage was a British nobleman who created the first difference engine, achieving a calculation precision of 6 decimal places. He later designed a 20-digit precision difference engine, reaching the pinnacle of mechanical design.

From 1985 to 1991, the London Science Museum commemorated Babbage’s 200th birthday by successfully constructing the difference engine No. 2 using pure 19th-century technology based on his 1849 design.

Babbage is often regarded as the greatest mind of the last century, and his brain is preserved in the British Science Museum.

The grandmother of programmers, Ada Lovelace, proposed concepts like program loops and branches while working with Babbage, which we now take for granted in programming.

Understanding ARM CPU Architecture and SoC Design — Difference Engine

2. Alan Turing

Turing is known as the father of computer science and artificial intelligence. In 1931, he entered King’s College, Cambridge, and later pursued a PhD at Princeton University in the USA. After the outbreak of World War II, he returned to Cambridge and assisted the military in breaking the German Enigma code, helping the Allies win the war. Turing made numerous contributions to the development of artificial intelligence, proposing a test method to determine whether a machine possesses intelligence, known as the Turing Test, which still holds competitions annually.

The COLOSSUS machine, developed in 1943 by the institution Turing served during the war, incorporated some of Turing’s concepts. It used 1,500 vacuum tubes and a photoelectric reader, utilized punched tape input, and employed electronic dual-stable circuits to perform counting, binary arithmetic, and Boolean algebra logic operations. A total of 10 COLOSSUS machines were produced, excelling at code-breaking tasks.

I highly recommend the film “The Imitation Game,” which is based on Turing’s life, to appreciate the extraordinary life of this genius.

3. John von Neumann – “Computer and the Brain”

There are mainly two types of computer architecture: Harvard architecture and von Neumann architecture. Most modern computers are based on the von Neumann architecture.

I personally believe that von Neumann was the “smartest person of the last century, no exceptions.”

His personal achievements are extensive, and one can search for them; many of them are so advanced that even I can’t fully comprehend them, only game theory is somewhat familiar.

This article only discusses his contributions to computers. [In fact, computers are not von Neumann’s greatest achievement, and he did not spend much time and energy on computer research.]

In October 1955, von Neumann was diagnosed with cancer. Almost at the end of his life, he wrote a lecture on the relationship between the human nervous system and computers on his deathbed. In 1958, his lecture was published under the title “Computer and the Brain.”

Von Neumann discussed issues like stimulus-response and memory in the nervous system from a mathematical perspective, mainly from the angles of logic and statistical mathematics, proposing that the nervous system has both digital and analog characteristics, exploring the control and logical structure of the nervous system.

4. Von Neumann Architecture

The core of von Neumann is: “Stored program, sequential execution”, which requires that a computer must have the following functions:

Send the required programs and data to the computer;
Must have the ability to store long-term memory of programs, data, intermediate results, and final computation results;
Ability to perform various arithmetic, logical operations and data transfer processing;
Ability to control the program flow as needed and coordinate the operation of the machine’s components according to instructions;
Ability to output processing results to users as required.

5. Harvard Architecture

The von Neumann architecture and Harvard architecture are different.

The von Neumann architecture has a unified coding for both program storage and data storage, while Harvard architecture has separate addressing.

6. Which processors are Harvard Architecture and von Neumann Architecture?

“Harvard Architecture”

MCUs (microcontrollers) are almost all based on Harvard architecture, such as the widely used 51 microcontroller and typical STM32 microcontrollers (which are based on ARM Cortex-M series) are Harvard architecture.

“Von Neumann Architecture”

PC and server chips (such as Intel and AMD), ARM Cortex-A series embedded chips (for example, the Samsung Exynos-4412 with ARM Cortex-A9 core, and Huawei’s Kirin 970 mobile chips) are all von Neumann architecture. These systems require a large amount of memory, so their working memory is DRAM, as they are more suitable for the von Neumann system.

“Hybrid Architecture”

In fact, modern CPUs (more accurately called SoCs) are not purely Harvard or von Neumann architectures but rather hybrid architectures.

For example, the Samsung Exynos 4412 uses the ARM Cortex-A9 core. The development board based on Exynos 4412 is equipped with 1024MB of DDR SDRAM and 8GB of EMMC.

During normal operation, all programs and data are loaded from EMMC to DDR, meaning that regardless of whether it is instructions or data, storage is in EMMC, and during execution, it is in DDR, then sent to the CPU for processing through cache and registers. This is a typical von Neumann system.

However, the Exynos 4412 also has a certain capacity of 64KB iROM and 64KB iRAM, which are used for SoC boot and startup. After powering on the chip, it first executes the code stored in the internal iROM, making it act like an MCU, where iROM is its flash and iRAM is its SRAM, which is again a typical Harvard structure.

This is a hybrid structure design, not a purely designed one. The reason for adopting a hybrid design is simply to leverage the strengths of both architectures.

No matter if it’s a white cat or a black cat, a good cat is one that solves problems.

2. Computer Composition

The computer system = hardware system + software system. Hardware is the material foundation of the computer system, while software is the soul of the computer system. Hardware and software are complementary and inseparable.

1). Input Devices

The task of input devices is to send the programs and raw data prepared by people into the computer and convert them into a form that the computer can recognize and accept. Common examples include keyboards, mice, and scanners.

2). Output Devices

The task of output devices is to send the processing results of the computer in a form that can be accepted by humans or other devices. Common examples include monitors, printers, and plotters.

3). Memory, CPU

See Section 3

4). Computer Bus Structure

The hardware system of a computer is constructed by connecting its major components in a certain way.

The system bus consists of three different functional buses: Data Bus (DB), Address Bus (AB), and Control Bus (CB).

Data Bus (DB) is used to transmit data information. The bit width of the data bus is an important indicator of the microcomputer, usually consistent with the word length of the microprocessor. For example, the Intel 8086 microprocessor has a word length of 16 bits, and its data bus width is also 16 bits.

Address Bus (AB) is specifically used to transmit addresses. The width of the address bus determines the size of the memory space that the CPU can directly address. For instance, an 8-bit microcomputer with a 16-bit address bus can address a maximum space of 216 = 64KB, whereas a 16-bit microcomputer with a 20-bit address bus can address 220 = 1MB. Control Bus (CB) is used to transmit control signals and timing signals. Control signals include those sent from the microprocessor to memory and I/O interface circuits, such as read/write signals, chip select signals, and interrupt response signals; there are also signals fed back to the CPU from other components, such as interrupt request signals, reset signals, bus request signals, and readiness signals. The specifics of the control bus depend on the CPU.

3. CPU Working Principle

The CPU mainly consists of the arithmetic logic unit (ALU) and the control unit.

1) Memory

Memory is used to store programs and data; it is a memory device and the basis for the computer’s ability to implement “stored program control.”

It includes: Cache, main memory, and auxiliary storage.

“Cache” is directly accessible by the CPU and is used to store the active parts of the currently executing program to quickly provide instructions and data to the CPU.

“Main Memory” is directly accessible by the CPU and is used to store the currently executing program and data.

“Auxiliary Storage” is set outside the main unit and cannot be directly accessed by the CPU; it is used to store programs and data that are temporarily not in operation and need to be transmitted to main memory when required.

2) Arithmetic Logic Unit

The core of the arithmetic logic unit (ALU) includes the arithmetic logic operation component and several registers (such as accumulator, temporary registers, etc.).

The ALU can perform arithmetic operations (including basic operations like addition, subtraction, multiplication, and their associated operations) and logical operations (including shifts, logical tests, or comparisons of two values). Compared to the control unit, the arithmetic unit acts based on commands from the control unit, meaning all operations performed by the arithmetic unit are directed by control signals from the control unit, making it the execution component.

3) Control Unit

The control unit is the command center of the entire CPU, consisting of the program counter (PC), instruction register (IR), instruction decoder (ID), and operation controller (OC), and is crucial for coordinating the orderly operation of the computer.

It sequentially retrieves instructions from memory based on pre-programmed user instructions, places them in the instruction register (IR), determines what operations to perform through instruction decoding (analysis), and then sends micro-operation control signals to the corresponding components according to the determined timing via the operation controller (OC). The operation controller (OC) mainly includes a pulse generator, control matrix, clock pulse generator, reset circuit, and start/stop circuit, among other control logic.

4) Summary of CPU Operating Principles

Under the influence of timing pulses, the control unit sends the instruction address pointed to by the instruction counter (which is in memory) to the address bus, after which the CPU reads the instruction from that address into the instruction register for decoding.

For the data needed during the instruction execution process, the data address is also sent to the address bus, and then the CPU reads the data into its internal storage unit (internal registers) for temporary storage, finally commanding the arithmetic unit to process the data.

This process continues repeatedly.

5) Instruction Execution Process

The execution of an instruction typically includes the following 4 steps:

1. Fetch Instruction: The CPU’s controller reads an instruction from memory and places it in the instruction register. 2. Instruction Decode: The instruction in the instruction register is decoded to determine what operation it should perform (the operation code in the instruction) and where the operands are (the address of the operands). 3. Execute Instruction, divided into two phases: “Fetch Operands” and “Perform Operation.” 4. Modify Instruction Counter, determining the address of the next instruction.

6) ARM Technical Features

The success of ARM is attributed to its unique company operation model and, of course, the excellent performance of ARM processors. As an advanced RISC processor, ARM processors have the following characteristics:

Small size, low power consumption, low cost, high performance.
Supports Thumb (16-bit) / ARM (32-bit) dual instruction sets, well compatible with 8-bit/16-bit devices.
Extensively uses registers, resulting in faster instruction execution.
Most data operations are performed within registers.
Flexible and simple addressing modes, with high execution efficiency.
Fixed instruction length. It is necessary to explain the concept of RISC microprocessors and how they differ from CISC microprocessors.

7) Development of ARM Architecture

Architecture is defined as the instruction set (ISA) and the programming model of processors based on that architecture. There can be multiple processors based on the same architecture, each differing in performance and application focus, but all implementations must adhere to this architecture. The ARM architecture provides high system performance for embedded system developers while maintaining excellent power consumption and efficiency.

The ARM architecture is steadily evolving to meet the general needs of ARM partners and design fields. Currently, the ARM architecture defines 8 versions, expanding the functionality of the instruction set from version 1 to version 8. Different series of ARM processors have significant performance differences and varied applications, but if they belong to the same ARM architecture, the application software based on them is compatible.

Let’s briefly introduce the V7/V8 architectures.

ARMv7 Architecture

The ARMv7 architecture was developed based on the ARMv6 architecture. This architecture adopts Thumb-2 technology, which is developed from ARM’s Thumb code compression technology while maintaining full code compatibility with existing ARM solutions. Thumb-2 technology uses 31% less memory than pure 32-bit code, reducing system overhead while providing 38% better performance than existing Thumb-based solutions. The ARMv7 architecture also incorporates NEON technology, enhancing DSP and media processing capabilities by nearly 4 times, and supports improved floating-point operations to meet the needs of next-generation 3D graphics, game physics applications, and traditional embedded control applications.

ARMv8 Architecture

The ARMv8 architecture is developed on the 32-bit ARM architecture and will be primarily used in product areas that require extended virtual addresses and 64-bit data processing technology, such as enterprise applications and high-end consumer electronics. The ARMv8 architecture includes two execution states: AArch64 and AArch32. The AArch64 execution state is aimed at 64-bit processing technology, introducing a new instruction set A64 that can access a large virtual address space, while the AArch32 execution state will support the existing ARM instruction set. Most features of the current ARMv7 architecture will be retained or further expanded in the ARMv8 architecture, such as TrustZone technology, virtualization technology, and NEON advanced SIMD technology.

8) ARM Microprocessor Architecture

ARM cores adopt the RISC architecture. The main features of the ARM architecture are as follows:

Utilizes a large number of registers, all of which can be used for multiple purposes.
Adopts a Load/Store architecture.
Each instruction can be conditionally executed.
Utilizes multi-register Load/Store instructions.
Can complete a normal shift operation and a normal ALU operation within a single clock cycle.
Extends the ARM instruction set through coprocessor instructions, adding new registers and data types in programming modes.
If the Thumb instruction set is also considered part of the ARM architecture, then in the Thumb architecture, instructions can be represented in a high-density 16-bit compressed form.

9) ARM Instructions

ARM instructions are RISC (Reduced Instruction Set Computing), focusing on simplifying the computer structure and speeding up processing. RISC selects the most frequently used simple instructions, discards complex instructions, fixes instruction lengths, reduces instruction formats and addressing modes, and minimizes or eliminates microcode control. These characteristics make RISC very suitable for embedded processors.

RISC can achieve extremely fast microprocessors with relatively few transistors. Research shows that only about 20% of instructions are most commonly used, minimizing the number of executable instructions and optimizing their execution can greatly improve processing speed.

Generally, RISC processors are 50%-75% faster than equivalent CISC (Complex Instruction Set Computer) processors, and RISC processors are easier to design and debug.

The general instruction format is as follows:

“Opcode:” The opcode refers to symbols in assembly language such as mov, add, jmp, etc.;

“Operand Address:” Indicates where the operands required by the instruction are located, whether in memory or in the CPU’s internal registers.

In reality, the machine instruction format is far more complex than this; the following diagram shows commonly used ARM instruction formats:

Regarding these machine instruction formats, we will select a few for analysis; for most readers, it is unnecessary to spend too much energy studying these machine instructions; a general understanding is sufficient.

Understanding CPUs from a macro perspective is sufficient at this stage; we will gradually move into subsequent learning stages, introducing ARM register modes, exceptions, addressing, assembly instructions, and the writing of assembly language embedded in C code.

4. SoC

SoC: A system on a chip is an integrated circuit that integrates a computer or other electronic system onto a single chip. SoCs can process digital signals, analog signals, mixed signals, and even higher frequency signals.

Narrowly speaking, it is the core chip integration of an information system, integrating key components of the system onto one chip; broadly speaking, an SoC is a miniature system. If the CPU is the brain, then the SoC includes the brain, heart, eyes, and hands of the system.

1. ARM-based SoC

System chips are often used in embedded systems. The integration scale of system chips is large, generally reaching millions to tens of millions of gates. SoCs are relatively flexible, allowing the integration of ARM architecture processors with some dedicated peripheral chips to form a system.

The following diagram is a typical ARM-based SoC architecture diagram.

A typical ARM-based SoC architecture usually includes the following major components:

ARM Processor core
Clocks and Reset Controller
Interrupt Controller
ARM Peripherals
GPIO
DMA Port
External Memory Interface
On-chip RAM
AHB, APB Bus

Some ARM processors, such as Hisi-3507 and Exynos-4412, are SoC systems, especially application processors that integrate many peripheral devices, providing strong support for executing more complex tasks and applications.

This architecture is the foundation for understanding assembly instructions and writing bare-metal programs.

When we receive a new SoC datasheet, we first refer to this architecture to check the SoC’s “RAM space, clock frequency, which external device controllers are included, the operating principles of each peripheral controller, the pin multiplexing situation of each peripheral, the SFR addresses of each controller, how the interrupt controller manages numerous interrupt sources, etc.”

2. Samsung Exynos 4412

In early 2012, Samsung officially launched its first quad-core mobile processor, the Exynos 4412.

The following is the SOC structure diagram of the Samsung Exynos 4412.

As shown in the diagram, the Exynos 4412 mainly includes the following modules:

4 (quad) Cortex-A9 processors
1MB L2 Cache
Interrupt Controller, managing all interrupt sources
Interrupt Combiner, managing some interrupt sources within the SoC
NEON ARM architecture processing extension structure, aimed at improving multimedia (video/audio) encoding and decoding, user interface, 2D/3D graphics, and gaming to enhance multimedia experience
DRAM, Internal RAM, NAND Flash, SROM Controller for various storage devices
SDIO, USB, I2C, UART, SPI buses, etc.
RTC, Watchdog Timer
Audio Subsystem
IIS (Integrated Interface of Sound) interface, integrated voice interface
Power Management
Multimedia Block

This new Exynos quad-core processor features a 32nm HKMG (High-K Metal Gate) process, supporting dual-channel LPDDR2 1066. Samsung has increased the graphics processing frequency from the previous 266MHz to 400MHz, and the press release indicates an overall performance improvement of 60% compared to existing dual-core models, with a 50% enhancement in image processing capability.

The Samsung Galaxy S III smartphone uses the Exynos 4412 processor.

END