Understanding ARM Boot Code: A Comprehensive Guide

13.Boot Code

This chapter discusses the boot code that runs on systems based on ARM processors, focusing on two different areas:

The code that runs immediately after the core is reset, operating on a so-called bare-metal system, meaning the code runs without an operating system. This scenario is typically encountered when the chip or system is powered on for the first time.
How the bootloader loads and runs the Linux kernel.

13.1 Booting a bare-metal system

When the core is reset, it will start executing at the reset vector location in the exception vector table (address 0x00000000 or 0xFFFF0000). The reset handler code must perform some or all of the following operations:

In multi-core systems, put non-master cores into sleep mode.
Initialize the exception vector.
Initialize the memory system, including the MMU.
Initialize core mode stacks and registers.
Initialize any critical I/O devices.
Perform any necessary initialization for NEON or VFP.
Enable interrupts.
Switch core modes or states.
Handle any settings required for the secure world.
Call the main application (main()).

The first consideration is the placement of the exception vector table. It must ensure it contains a set of valid instructions that can branch to the corresponding handlers.

The _start directive in the GNU assembler tells the linker to place the code at a specific address and can be used to place the code in the vector table. The initial vector table will be located in non-volatile memory, and except for the reset vector, other vectors can contain self-branching instructions since no exceptions are expected to occur at this time. Typically, the reset vector contains a branch to the boot code in ROM. ROM can be aliased to map to the exception vector address. The ROM is then written to certain memory-mapped peripherals, mapping RAM to address 0, and copying the real exception vector table to RAM. This means that the portion of the boot code handling remapping must be position-independent, as only PC-relative addressing can be used. The following example shows a typical code example that can be placed in the exception vector table.Next, you may need to initialize the stack pointers for the various modes that the application might use. The following example provides a simple example of code that initializes the stack pointers for FIQ and IRQ modes.The next step is to set up caches, the MMU, and the branch predictor. The following example shows the relevant code. We first disable the MMU and caches and invalidate the cache and TLB. This example code is applicable to the Cortex-A9 processor. Some Cortex-A processors automatically invalidate L1 and/or L2 caches on reset, while others require manual invalidation. You must check the technical reference manual (TRM) for the specific core to determine which options are implemented.

The TLB of the MMU must be invalidated. The branch target prediction hardware may not need explicit invalidation, but must be enabled by the boot code. At this point, it is safe to enable branch prediction, which will improve performance.After that, you can create some translation tables as shown in the example code below. The variable ttb_address is used to represent the address of the initial translation table. This area must be a 16KB memory space (its starting address must be aligned to a 16KB boundary), and the code will write the L1 translation table into this memory area.If there is an L2 cache, and the system runs without an operating system, it may also be necessary to invalidate the L2 cache and enable it at this time. Additionally, NEON or VFP access must be enabled. If the system uses TrustZone security extensions, it may be necessary to switch to the normal world after the secure world is initialized.

The next steps will depend on the specifics of the system. For example, it may be necessary to zero-initialize the memory area for uninitialized C variables, copy the initial values of other variables from the ROM image to RAM, and set up the stack and heap space for the application. It may also be necessary to initialize C library functions, call top-level constructors (for C++ code), and other standard embedded C initializations.

A common approach is to allow a single core in the cluster to perform system initialization, while putting the same code on other cores into sleep mode, i.e., into WFI state. After core 0 creates a simple set of L1 translation table entries, other cores may be woken up as these entries can be used by all cores in the system. The following example shows sample code that determines the currently running core and branches to initialization code based on whether it is core 0 or goes into sleep mode. Typically, secondary cores will be woken up later by the SMP operating system.

13.2 Configuration

There are many control register bits in the core that are typically set by the boot code. To achieve optimal performance, the code must run with the MMU, instruction and data caches, and branch prediction enabled. For all non-peripheral I/O device memory regions, their translation table entries should be marked as L1 cacheable, and the default settings should be “read-allocate, write-back cache policy.” For multi-core systems, pages must be marked as “shared,” and the broadcast capability of CP15 maintenance operations must be enabled.

In addition to the CP15 registers required by the ARM architecture, the core typically has registers that control specific implementation features. Programmers of the boot code should refer to the relevant technical reference manual to use these registers correctly.

13.3 Booting Linux

Understanding the process from the execution of the first instruction after the core reset (located at the exception base address 0x00000000 or 0xFFFF0000, if HIVECS or high vectors are chosen) until the Linux command prompt appears is very useful.

When the kernel resides in memory, the boot sequence for ARM-based systems is similar to what happens on desktop computers. However, the boot process can be very different, as ARM-based phones or more deeply embedded devices may not have hard disks or BIOS similar to PCs.

Typically, when the system is powered on, hardware-specific boot code runs from flash memory or ROM. This code initializes the system, including any necessary hardware peripheral code, and then starts the bootloader (e.g., U-Boot). The bootloader initializes the main memory and copies the compressed Linux kernel image into the main memory (from flash devices, onboard memory, MMC, host PC, or elsewhere). The bootloader passes certain initialization parameters to the kernel. Then, the Linux kernel decompresses itself, initializes its data structures, runs user processes, and finally starts the command-line environment. Let’s take a closer look at these processes.

13.3.1 Reset handler

There is usually a small section of system-related boot monitoring code that configures the memory controller and initializes other system peripherals. It sets up the stack in memory and typically copies itself from ROM to RAM, then changes the hardware memory mapping so that RAM maps to the exception vector address instead of ROM. Essentially, this code is independent of the operating system that will run on the board and performs functions similar to a PC’s BIOS. When the execution is complete, it calls a Linux bootloader, such as U-Boot.

13.3.2 Bootloader

Linux needs to run a certain amount of code after reset to initialize the system and perform the basic tasks required to boot the kernel:

Initialize the memory system and peripherals.
Load the kernel image into the appropriate location in memory (which may also include an initial RAM disk).
Generate boot parameters to be passed to the kernel (including machine type).
Set up the console for the kernel (video or serial console).
Enter the kernel.

The specific steps vary between different bootloaders, so for detailed information, refer to the documentation for the bootloader being used. U-Boot is a widely used example, but other bootloaders include Apex, Blob, Bootldr, and Redboot.

When the bootloader starts, it is typically not in main memory. It must first allocate a stack and initialize the core (e.g., by invalidating caches) and install itself into main memory. It must also allocate space for global data and the malloc() function and copy the exception vector entries to the appropriate location.

13.3.3 Initialize memory system

This code is highly dependent on the specific board or system. The Linux kernel does not take responsibility for configuring RAM in the system. The kernel only knows the physical memory layout but has no other knowledge of the memory system. In many systems, the available RAM and its location are fixed, making the bootloader’s task relatively simple. In other systems, code must be written to determine the amount of available RAM in the system.

13.3.4 Kernel images

The kernel image generated from the build process is typically compressed in zImage format (which is a common name for bootable kernel images). Its header code contains a magic number to verify the integrity of the decompressed image, along with the start and end addresses. Kernel code is position-independent and can be placed anywhere in memory. Typically, it is placed at the physical RAM base offset of 0x8000. This leaves space for the parameter block, which is placed at an offset of 0x100 (for translation tables, etc.).

Many systems require an initial RAM disk (initrd) as this allows providing a root filesystem without other driver setups. The bootloader can place the initial RAM disk image into memory and pass its location to the kernel using ATAG_INITRD2 (a tag describing the physical location of the compressed RAM disk image) and ATAG_RAMDISK.

The bootloader typically sets up a serial port in the target device so that the kernel’s serial port driver can detect that port and use it as a console. In some systems, another output device (such as a video driver) can also be used as a console. The kernel command line parameter console= can be used to pass relevant information.

13.3.5 Kernel parameters using ATAGs

Historically, parameters passed to the kernel were provided in the form of a tagged list, placed in physical RAM, with register R2 holding the address of the list. The tag header contains two 32-bit unsigned integers, the first representing the size of the tag (in words), and the second representing the tag value (used to indicate the type of tag). For a complete list of parameters that can be passed, consult the relevant documentation. Examples include ATAG_MEM, which describes the physical memory mapping, and ATAG_INITRD2, which describes where the compressed RAM disk image is located. The bootloader must also provide an ARM Linux machine type number (MACH_TYPE). This value can be hard-coded, or the boot code can check available hardware and allocate a value accordingly.

There is a more flexible or generic method for passing this information using Flattened Device Trees (FDT).

13.3.6 Kernel parameters using Flattened Device Trees

Linux Device Tree (FDT) support was initially introduced in the PowerPC kernel as part of the merge of 32-bit and 64-bit kernels, aimed at standardizing the firmware interface for all PowerPC platforms, servers, desktops, and embedded devices via the Open Firmware interface. It has become the configuration method used in the Linux kernel for PowerPC, Micro Blaze, and SPARC architectures.

A device tree is a data structure that describes the hardware configuration. It contains information about processors, memory size and blocks, interrupt configurations, and peripherals. This data structure is organized in a tree form, with the root node named /. Each node, except for the root node, has a parent node. Each node has a name and can have any number of child nodes. Nodes can also contain named property values for arbitrary data expressed in key-value pairs.

The device tree data format follows the conventions defined in the IEEE standard 1275. To simplify system descriptions, the device tree source format (.dts) is used to express device tree data. Device tree nodes must conform to the following syntax:

[label:] node-name[@unit-address] {
  [properties definitions]
  [child nodes]
}

Nodes are defined using names and unit addresses. Braces mark the start and end of the node definition.

You can use the Device Tree Compiler (DTC) tool to convert device tree source files (.dts) into device tree blob (dtb) format. dtb (or blob) is a flattened device tree, which is a firmware-independent system description in a compressed format that does not require firmware calls to retrieve its properties. The Linux kernel will load the dtb before loading the operating system.

The chosen node is a placeholder for saving any additional environment information, such as kernel boot parameters and default consoles. The properties of the chosen node are typically defined by the bootloader, but .dts files can specify defaults.

The following code snippet shows the root node description for the ARM Versatile platform board. The model and compatible properties are assigned the platform’s name, in the form of <manufacturer>,<model>. This string concatenation serves as a unique identifier for the machine and must be defined in the top-level node.

13.3.7 Kernel entry

The kernel execution must begin when the core is in a fixed state. The bootloader calls the kernel image by directly jumping to its first instruction, which is located in arch/arm/boot/compressed/head.S at the start label. At this time, the MMU and data caches must be disabled, the core must be in Supervisor mode, and the I and F bits in CPSR must be set (IRQ and FIQ are disabled). The R0 register must contain 0, the R1 register contains the MACH_TYPE value, and the R2 register contains the address of the parameter tag list.

The first step in booting the kernel is to decompress the kernel. This is primarily architecture-independent. The parameters passed by the bootloader will be preserved, and caches and MMU will be enabled. Before calling decompress_kernel() in arch/arm/boot/compressed/misc.c, the system will check if the decompressed image will overwrite the compressed image. The caches will then be cleaned and invalidated, and this will be completed before disabling them again. Next, we jump to the kernel entry point in arch/arm/kernel/head.S.

13.3.8 Platform-specific actions

Now perform some architecture-specific tasks. First, use the __lookup_processor_type() function to check the core type, which returns a code specifying the type of core being run. Next, use the __lookup_machine_type() function (as the name suggests) to find the machine type. Then define a basic set of translation tables for mapping the kernel code. Caches and MMU are initialized, and other control registers are set. The data segment is copied to RAM, and finally, start_kernel() is called.

13.3.9 Kernel start-up code

In principle, the remainder of the boot sequence is the same across any architecture, but in practice, certain functions still depend on the hardware.

IRQ interrupts are disabled by local_irq_disable(), while lock_kernel() is used to prevent FIQ interrupts from interrupting the kernel. It initializes tick control, memory systems, and architecture-specific subsystems, and processes command line options passed by the bootloader.
Set up the stack and initialize the Linux scheduler.
Set up various memory regions and allocate pages.
Set up the interrupt and exception tables and handlers, and configure the GIC.

Set up the system timer, enabling IRQ at this time. Additional memory system initialization follows, and a value known as BogoMips is used to calibrate the core clock speed.

Set up the internal components of the kernel, including the filesystem and initialization process, followed by the creation of kernel threads’ thread daemon.

Unlock the kernel (enable FIQ) and start the scheduler.

Call the do_basic_setup() function to initialize drivers, sysctl, work queues, and network sockets. At this point, switch to user mode.The memory mapping used by Linux is shown in the above figure. ZI refers to zero-initialized data. There is a wide separation between kernel memory and user memory: kernel memory is above address 0xBF000000, while user memory is below that address. Kernel memory uses global mapping, while user memory uses non-global mapping, although code and data can be shared between processes. As mentioned before, application code starts at 0x1000, leaving the first 4KB page unused to be able to capture NULL pointer references.

(Ad time)

ARM Architecture Courses:

“From Beginner to Master of ARMv8/ARMv9 Architecture (Phase III)” — Hot Sale
“From Beginner to Master of ARMv8/ARMv9 Architecture (Phase II)”
“From Beginner to Master of ARMv8/ARMv9 Architecture (Phase I)”
“Quick Start to ARMv8/ARMv9 Architecture”
“ARM Live Training Camp (8.11-9.2)” Replay
“Cache Live Training Camp Replay + Cache Special – Single Sale”
“ARM Microarchitecture Discussion Bureau – In-depth Interpretation/Exploration of ARM Microarchitecture Knowledge“
“Feishu Knowledge Base Document – ARM Column“ — Hot Sale
“ARM Basic Architecture – Document Guide” — To be updated
[New Course/Completed] Coresight/Trace/Debug Collection is here, currently 64 lessons 16 hours, 6 major themes, 685 pages PPT — Hot Sale
New Course “Introduction to Arm Coresight Basics”
“SMMU Basic Architecture Explained” First Release/Only One on the Internet
ARM Architecture – Power Management Explained and Practical: Chip-Level Power Management Framework

Hot Selling Security Courses:

Trustzone/TEE Standard Version – 48 lessons/19.5h
“Trustzone/TEE High Configuration Version – 205 lessons/50h”
“optee Entry Practical Version” — Also known as: Trustzone/TEE Practical Version
“optee System Architecture from Beginner to Master” — Also known as:optee Phase II. New course in November 2024, rich content, high quality, strongly recommended！！！！
Secureboot from Beginner to Master Training Camp
“Android15 Security Architecture”

Classic Security Courses:

“ATF Architecture Development Explained”
“optee System Development Explained”
“ATF/optee/hafnium/linux/xen Code Reading”
“Android13 Security Architecture Detailed Explanation”
“Secureboot Explained”
“Feishu Column – TEE Document”
“CA/TA Development from Beginner to Master”
“Trustzone/TEE Quick Start” Experience & Literacy
“TEE Literacy Course – OS Design”
“TEE Literacy Course – System Integration”
“TEE Literacy Course – System Architecture”
Gift “Building and Using optee qemu_v8 Environment – Including Video”
Gift “Building and Using optee qemu_v8 Environment – Direct Use”
“Trustzone/TEE Training Camp Replay” Phase I
“Trustzone/TEE Training Camp Replay” Phase II
“8-Day Introduction to ARM Architecture”
“8-Day Introduction to Trustzone/TEE/Security Architecture”
“Android Keymaster/Keymint Explained”— Hot Sale
MTE/PAC/BTI Memory Protection Trio

Other Courses:

Cortex-M Architecture Explained

Platinum VIP Course Introduction

Selected Arm – Platinum VIP Course – Total 850 lessons +, Total Duration 320h +, Total Value 30,000 +

Introduction to the Best:

Signature Courses: Trustzone Standard Version, Trustzone High Configuration Version
Top Three Sales Courses: ARM Phase III, Secureboot, Android15 Security Architecture
Continuously Updated Courses: ARM Phase III, Platinum VIP
Very GoodVery Good but Ignored Courses: CA/TA Development
Recently Updated/Promoted Courses/Key Courses: optee System Architecture from Beginner to Master