Understanding Arm NEON: CPU Optimization Techniques and Instruction Introduction

Click the card below to follow Arm Technology Academy

This article is selected from the Extreme Technology column “Embedded AI” and is authorized to be reprinted from Zhihu author High-Performance Computing Institute’s “Mobile Algorithm Optimization“. Previously, we learned how to quickly get started with NEON programming, Arm NEON optimization technology, and Arm NEON Learning (III) NEON Assembly and Intrinsics Programming. This article will detail Arm NEON from the perspective of CPU optimization technology.

1. SIMD

Arm NEON is a SIMD (Single Instruction Multiple Data) extension architecture suitable for Arm Cortex-A and Cortex-R series processors.

SIMD uses a controller to control multiple processors, performing the same operation on each data in a set of data (also known as a “data vector”), thus achieving parallel technology.

SIMD is particularly suitable for common tasks such as audio and image processing. Most modern CPU designs include SIMD instructions to improve multimedia performance.

Illustration of SIMD operations

As shown in the figure, scalar operations can only perform multiplication on one pair of data at a time, while using SIMD multiplication instructions allows multiplication on four pairs of data simultaneously.

A. Instruction Stream and Data Stream

The Flynn classification method classifies based on how instruction streams and data streams are processed, dividing them into four types of computers:

Illustration of Flynn classification

1. SISD (Single Instruction Single Data)

The hardware of the machine does not support any form of parallel computation, and all instructions are executed serially. A single core executes a single instruction stream, operating on data stored in a single memory, one operation at a time. Early computers were SISD machines, such as the von Neumann architecture, IBM PC, etc.

2. MISD (Multiple Instruction Single Data)

This uses multiple instruction streams to process a single data stream. In practice, using multiple instruction streams to process multiple data streams is a more effective method, so MISD only appears as a theoretical model and has not been applied in practice.

3. MIMD (Multiple Instruction Multiple Data)

The computer has multiple asynchronous and independently working processors. In any clock cycle, different processors can execute different instructions on different data segments, that is, multiple instruction streams are executed simultaneously, each operating on different data streams. MIMD architecture can be used in multiple application areas such as computer-aided design, computer-aided manufacturing, simulation, and communication switches.

In addition to the above models, NVIDIA has introduced the SIMT architecture:

4. SIMT (Single Instruction Multiple Threads)

Similar to multithreading on CPUs, all cores have their own execution units, with different data but the same commands. Multiple threads each have their own processing units, which differ from SIMD that shares an ALU.

Illustration of SIMT

B. SIMD Features and Development Trends

1. Advantages and Disadvantages of SIMD

2. Development Trends of SIMD

Taking the next generation SIMD instruction set SVE (Scalable Vector Extension) under the Arm architecture as an example, it is a completely new vector instruction set developed for high-performance computing (HPC) and machine learning.

The SVE instruction set has many concepts similar to the NEON instruction set, such as vectors, channels, data elements, etc.

The SVE instruction set also introduces a completely new concept: variable vector length programming model.

SVE scalable model

Traditional SIMD instruction sets use fixed-size vector registers, for example, the NEON instruction set uses fixed 64/128-bit length vector registers.

In contrast, the SVE instruction set, which supports the VLA programming model, supports variable-length vector registers. Therefore, it allows chip designers to choose an appropriate vector length based on load and cost.

The length of the vector registers in the SVE instruction set supports a minimum of 128 bits and a maximum of 2048 bits, in increments of 128 bits. The SVE design ensures that the same application can run on SVE instruction machines supporting different vector lengths without needing to recompile the code.

Arm launched SVE2 in 2019, based on the latest Armv9, expanding more operation types to fully replace NEON while adding support for matrix-related operations.

2. Arm’s SIMD Instruction Set

1. SIMD Support in Arm Processors – NEON

The Arm NEON unit is included by default in Cortex-A7 and Cortex-A15 processors, but it is optional in other Armv7 Cortex-A series processors, and some implementations of Cortex-A series processors configured for Armv7-A or Armv7-R architectures may not include the NEON unit.

The possible combinations of Armv7 compliant cores are as follows:

Understanding Arm NEON: CPU Optimization Techniques and Instruction Introduction

Therefore, it is essential to first confirm whether the processor supports NEON and VFP. This can be checked at compile and run time.

Understanding Arm NEON: CPU Optimization Techniques and Instruction Introduction History of NEON Development

2. Checking SIMD Support in ARM Processors

2.1 Compile Time Check

The simplest way to detect the presence of the NEON unit. In the Arm compiler toolchain (armcc) version 4.0 and above or GCC, check whether the predefined macros ARM_NEON or __arm_neon are enabled.

The equivalent predefined macro for armasm is TARGET_FEATURE_NEON.

2.2 Runtime Check

Runtime detection of the NEON unit requires the help of the operating system. The ARM architecture intentionally does not expose processor features to user-mode applications. In Linux,/proc/cpuinfo contains this information in a readable form, such as:

In Tegra (dual-core Cortex-A9 processor with FPU)

$ /proc/cpuinfo 
swp half thumb fastmult vfp edsp thumbee vfpv3 vfpv3d16

ARM Cortex-A9 processor with NEON unit

$ /proc/cpuinfo 
swp half thumb fastmult vfp edsp thumbee neon vfpv3

Since the /proc/cpuinfo output is text-based, it is usually preferred to check the auxiliary vector <code>/proc/self/auxv, which contains the kernel hwcap in binary format, making it easy to search for the AT_HWCAP record in the <code>/proc/self/auxv file to check the HWCAP_NEON bit (4096).

Some Linux distributions have modified the ld.so linker script to read hwcap via glibc and add extra search paths for shared libraries that enable NEON.

3. Instruction Set Relationships

In Armv7, NEON has the following relationship with the VFP instruction set:
Processors with the NEON unit but without the VFP unit cannot perform floating-point operations in hardware.
Since NEON SIMD operations perform vector computations more efficiently, the vector mode operations in the VFP unit have been deprecated since the introduction of ARMv7. Therefore, the VFP unit is sometimes referred to as the floating-point unit (FPU).
The VFP can provide fully IEEE-754 compatible floating-point operations, while single-precision operations in the Armv7 NEON unit do not fully conform to IEEE-754.
NEON cannot replace VFP. The VFP provides some dedicated instructions that do not have equivalent implementations in the NEON instruction set.
Half-precision instructions are only available on systems with NEON and VFP extensions.
In Armv8, the VFP has been replaced by NEON, and the issues such as NEON not fully conforming to the IEEE 754 standard and having some instructions supported by VFP but not by NEON have been resolved in ARMv8.

3. NEON

NEON is a 128-bit SIMD extension architecture suitable for Arm Cortex-A series processors, with each processor core having a NEON unit, thus achieving multi-threaded parallel acceleration.

1. Basic Principles of NEON

1.1 NEON Instruction Execution Flow

The above figure shows the flowchart of the NEON unit completing accelerated calculations. Each element in the vector register performs calculations synchronously to accelerate the computation process.

1.2 NEON Computational Resources

Relationship between NEON and Arm Processor Resources

– The NEON unit is an extension of the Arm instruction set, using 64-bit or 128-bit registers independent of the original ARM registers for SIMD processing, running on the 64-bit register file. – NEON and VFP units are fully integrated into the processor and share processor resources for integer operations, loop control, and caching. Compared to hardware accelerators, this significantly reduces area and power costs. It also uses a simpler programming model because the NEON unit uses the same address space as the application.

Relationship between NEON and VFP Resources

The NEON registers overlap with the VFP registers; Armv7 has 32 NEON D registers, as shown in the figure below.

NEON Registers

2. NEON Instructions

2.1 Automatic Vectorization

The vectorizing compiler can use C or C++ source code to vectorize it in a way that effectively utilizes NEON hardware. This means that portable C code can be written while still achieving the performance levels brought by NEON instructions.

To assist vectorization, set the loop iteration count to be a multiple of the vector length. Both GCC and ARM compiler toolchains have options to enable automatic vectorization for NEON technology.

2.2 NEON Assembly

For programs with particularly high performance requirements, manually writing assembly code is a more suitable approach.

The GNU assembler (gas) and Arm Compiler toolchain assembler (armasm) both support the assembly of NEON instructions.

When writing assembly functions, it is essential to understand Arm EABI, which defines how to use registers. The ARM Embedded Application Binary Interface (EABI) specifies which registers are used for passing parameters, returning results, or must be preserved, specifying the use of 32 D registers other than the Arm core registers. The following figure summarizes the functions of the registers.

2.3 NEON Intrinsics

NEON intrinsic functions provide a way to write NEON code that is easier to maintain than assembly code while still controlling the generated NEON instructions.

Intrinsic functions use new data types corresponding to the D and Q NEON registers. These data types support creating C variables that directly map to NEON registers.

Writing NEON intrinsic functions is similar to calling functions using these variables as parameters or return values. The compiler does a lot of the heavy lifting usually associated with writing assembly code, such as:

Register allocation code scheduling or reordering instructions

Disadvantages of Intrinsics

It is not possible for the compiler to output the desired code accurately, so there is still potential for improvement when turning to NEON assembly code.

NEON Instruction Simplification Types

NEON data processing instructions can be divided into normal instructions, long instructions, wide instructions, narrow instructions, and saturated instructions. Taking the long instruction of Intrinsic as an example int16x8_t vaddl_s8(int8x8_t __a, int8x8_t __b);– The above function adds two 64-bit D register vectors (each vector contains 8 8-bit numbers) to generate a vector containing 8 16-bit numbers (stored in a 128-bit Q register), thus avoiding overflow of the addition result.

4. Other SIMD Technologies

1. Other Platforms’ SIMD Technologies

SIMD processing is not unique to Arm; the following figure compares it with x86 and Altivec.

SIMD Comparison

2. Comparison with Dedicated DSP

Many Arm-based SoCs also include DSP and other co-processing hardware, so they can simultaneously contain NEON units and DSPs. Compared to DSP, NEON has the following characteristics:

5. Conclusion

This section mainly introduces basic SIMD and other instruction stream and data stream processing methods, the basic principles of NEON, instructions, and comparisons with other platforms and hardware.

Hope everyone gains something.

Series Reading

Arm NEON Learning (I) Quick Start Guide
Arm NEON Learning (II) Optimization Techniques
Arm NEON Learning (III) NEON Assembly and Intrinsics Programming

Follow Arm Technology Academy

Click below “Read the original text“, to read more articles from the “Embedded AI” Extreme Technology column.