Understanding Arm NEON: CPU Optimization Techniques and Introduction to NEON

Click on the card below to follow Arm Technology Academy

This article is selected from the “Embedded AI” column of Jishu, authorized to be reprinted from the Zhihu author High-Performance Computing Institute’s “Mobile Algorithm Optimization“. Previously, we learned how to quickly get started with NEON programming, Arm NEON optimization techniques and Arm NEON Learning (III) NEON Assembly and Intrinsics Programming.This article will detail Arm NEON from the perspective of CPU optimization techniques.

1. SIMD

Arm NEON is a SIMD (Single Instruction Multiple Data) extension architecture suitable for Arm Cortex-A and Cortex-R series processors.

SIMD uses a controller to control multiple processors, performing the same operation on each data in a data set (also known as a “data vector”), thus achieving parallel technology.

SIMD is particularly suitable for common tasks such as audio and image processing. Most modern CPU designs include SIMD instructions to improve multimedia performance.

SIMD operation diagram

As shown in the figure, when performing scalar operations, only one pair of data can be multiplied at a time, whereas with SIMD multiplication instructions, four pairs of data can be multiplied simultaneously.

A. Instruction Stream and Data Stream

The Flynn classification method classifies based on the handling of instruction streams (Instruction) and data streams (Data), and can be divided into four types of computers:

Flynn classification diagram

1. SISD (Single Instruction Single Data)

The hardware of the machine does not support any form of parallel computing, and all instructions are executed serially. A single core executes a single instruction stream, operating on data stored in a single memory, one operation at a time. Early computers were SISD machines, such as the von Neumann architecture, IBM PC, etc.

2. MISD (Multiple Instruction Single Data)

It uses multiple instruction streams to process a single data stream. In practice, using multiple instruction streams to process multiple data streams is a more effective method, so MISD only appears as a theoretical model and has not been applied in practice.

3. MIMD (Multiple Instruction Multiple Data)

The computer has multiple asynchronous and independently working processors. In any clock cycle, different processors can execute different instructions on different data fragments, that is, multiple instruction streams are executed simultaneously, and these instruction streams operate on different data streams. MIMD architecture can be used in various application fields such as computer-aided design, computer-aided manufacturing, simulation, and communication switches.

In addition to the above models, NVIDIA has introduced the SIMT architecture:

4. SIMT (Single Instruction Multiple Threads)

Similar to multithreading on a CPU, all cores have their own execution units, with different data, but the commands executed are the same. Multiple threads each have their own processing units, which is different from SIMD sharing an ALU.

SIMT diagram

B. Features and Trends of SIMD

1. Advantages and Disadvantages of SIMD

2. Trends in SIMD Development

Taking the next-generation SIMD instruction set SVE (Scalable Vector Extension) under the Arm architecture as an example, it is a completely new vector instruction set developed for high-performance computing (HPC) and machine learning.

The SVE instruction set has many concepts similar to the NEON instruction set, such as vectors, channels, and data elements.

The SVE instruction set also introduces a new concept: variable vector length programming model.

SVE scalable model

Traditional SIMD instruction sets use fixed-size vector registers, for example, the NEON instruction set uses fixed 64/128-bit length vector registers.

However, the SVE instruction set, which supports the VLA programming model, allows for variable-length vector registers. This allows chip designers to choose an appropriate vector length based on load and cost.

The length of the vector register in the SVE instruction set supports a minimum of 128 bits and a maximum of 2048 bits, in increments of 128 bits. The SVE design ensures that the same application can run on SVE instruction machines that support different vector lengths without needing to recompile the code.

Arm launched SVE2 in 2019, based on the latest Armv9, expanding more operation types to comprehensively replace NEON while also increasing support for matrix-related operations.

2. Arm’s SIMD Instruction Set

1. SIMD Support for Arm Processors – NEON

The Arm NEON unit is included by default in Cortex-A7 and Cortex-A15 processors, but it is optional in other Armv7 Cortex-A series processors. Some implementations of the Cortex-A series processors that configure the Armv7-A or Armv7-R architecture profiles may not include the NEON unit.

Possible combinations of Armv7 compliant cores are as follows:

Understanding Arm NEON: CPU Optimization Techniques and Introduction to NEON

Therefore, it is necessary to first confirm whether the processor supports NEON and VFP. This can be checked at compile and run time.

Understanding Arm NEON: CPU Optimization Techniques and Introduction to NEON NEON Development History

2. Checking SIMD Support for ARM Processors

2.1 Compile-time Check

The simplest way to check for the existence of the NEON unit. In the Arm compiler toolchain (armcc) v4.0 and later or GCC, check whether the predefined macros ARM_NEON or __arm_neon are enabled.

The equivalent predefined macro for armasm is TARGET_FEATURE_NEON.

2.2 Runtime Check

Detecting the NEON unit at runtime requires the help of the operating system. The ARM architecture intentionally does not expose processor features to user-mode applications. Under Linux, /proc/cpuinfo contains this information in a readable form, such as:

On Tegra (Dual-core Cortex-A9 processor with FPU)

$ /proc/cpuinfo 
swp half thumb fastmult vfp edsp thumbee vfpv3 vfpv3d16

ARM Cortex-A9 processor with NEON unit

$ /proc/cpuinfo 
swp half thumb fastmult vfp edsp thumbee neon vfpv3

Since the /proc/cpuinfo output is text-based, it is usually preferred to check the auxiliary vector <code>/proc/self/auxv, which contains the kernel hwcap in binary format, and can easily search for the AT_HWCAP record in the <code>/proc/self/auxv file to check the HWCAP_NEON bit (4096).

Some Linux distributions have modified the ld.so linker scripts to read hwcap through glibc and add additional search paths for shared libraries that enable NEON.

3. Relationship of Instruction Sets

In Armv7, the NEON instruction set has the following relationship with the VFP instruction set:
Processors with NEON units but without VFP units cannot perform floating-point operations in hardware.
Since NEON SIMD operations perform vector computations more efficiently, vector mode operations in the VFP unit have been deprecated since the introduction of ARMv7. Therefore, the VFP unit is sometimes referred to as the Floating Point Unit (FPU).
VFP can provide fully IEEE-754 compliant floating-point operations, while single-precision operations in the Armv7 NEON unit are not fully compliant with IEEE-754.
NEON cannot replace VFP. VFP provides some dedicated instructions that do not have equivalent implementations in the NEON instruction set.
Half-precision instructions are only available for NEON and VFP systems that include half-precision extensions.
In Armv8, VFP has been replaced by NEON, and issues such as NEON not fully complying with the IEEE 754 standard and some instructions supported by VFP but not by NEON have been resolved in ARMv8.

3. NEON

NEON is a 128-bit SIMD extension architecture suitable for Arm Cortex-A series processors, with each processor core having a NEON unit, thus achieving multithreaded parallel acceleration.

1. Basic Principles of NEON

1.1 NEON Instruction Execution Flow

The above figure is the flowchart of the NEON unit completing accelerated calculations. Each element in the vector register executes calculations synchronously to accelerate the calculation process.

1.2 NEON Computing Resources

Relationship between NEON and Arm Processor Resources

– The NEON unit, as an extension of the Arm instruction set, uses 64-bit or 128-bit registers independent of the original ARM registers for SIMD processing, running on the 64-bit register file. – NEON and VFP units are fully integrated into the processor and share processor resources for integer operations, loop control, and caching. Compared to hardware accelerators, this significantly reduces area and power consumption costs. It also uses a simpler programming model, as the NEON unit uses the same address space as the application.

Relationship between NEON and VFP Resources

NEON registers overlap with VFP registers, with 32 NEON D registers in Armv7, as shown in the figure below.

NEON registers

2. NEON Instructions

2.1 Automatic Vectorization

Vectorizing compilers can use C or C++ source code to vectorize it in a way that effectively utilizes NEON hardware. This means that portable C code can still achieve the performance levels brought by NEON instructions.

To assist vectorization, set the number of loop iterations to be a multiple of the vector length. Both GCC and ARM compiler toolchains have options to enable automatic vectorization for NEON technology.

2.2 NEON Assembly

For programs with particularly high performance requirements, manually writing assembly code is more suitable.

The GNU assembler (gas) and the Arm Compiler toolchain assembler (armasm) both support the assembly of NEON instructions.

When writing assembly functions, it is necessary to understand Arm EABI, which defines how to use registers. The ARM Embedded Application Binary Interface (EABI) specifies which registers are used to pass parameters, return results, or must be preserved, specifying the use of 32 D registers in addition to the Arm core registers. The following figure summarizes the functions of the registers.

2.3 NEON Intrinsics

NEON intrinsic functions provide a way to write NEON code that is easier to maintain than assembly code while still controlling the generated NEON instructions.

Intrinsic functions use new data types that correspond to D and Q NEON registers. These data types support creating C variables that directly map to NEON registers.

Writing NEON intrinsic functions is similar to calling functions that use these variables as parameters or return values. The compiler does some of the heavy lifting typically associated with writing assembly language, such as:

Register allocation, code scheduling, or instruction reordering

Disadvantages of Intrinsics

It is not possible to ensure that the compiler outputs the desired code accurately, so there is still potential for improvement when turning to NEON assembly code.

NEON Instruction Simplification Types

NEON data processing instructions can be divided into normal instructions, long instructions, wide instructions, narrow instructions, and saturation instructions. Taking the long instruction of Intrinsic as an example, int16x8_t vaddl_s8(int8x8_t __a, int8x8_t __b); – The above function adds two 64-bit D register vectors (each vector contains 8 8-bit numbers), generating a vector containing 8 16-bit numbers (stored in a 128-bit Q register), thus avoiding overflow in the addition result.

4. Other SIMD Technologies

1. Other Platforms’ SIMD Technologies

SIMD processing is not unique to Arm, the following figure compares it with x86 and Altivec.

SIMD Comparison

2. Comparison with Dedicated DSP

Many Arm-based SoCs also include DSP and other coprocessor hardware, so they can contain both NEON units and DSP. Compared to DSP, NEON’s characteristics are:

5. Conclusion

This section mainly introduces the basic SIMD and other instruction stream and data stream processing methods, the basic principles of NEON, instructions, and comparisons with other platforms and hardware.

Hope everyone can gain something.

Series Reading

Arm NEON Learning (I) Quick Start Guide
Arm NEON Learning (II) Optimization Techniques
Arm NEON Learning (III) NEON Assembly and Intrinsics Programming

Follow Arm Technology Academy

Click below “Read Original” to read more articles from the “Embedded AI” Jishu column..