Getting Started with Arm NEON Programming: A Quick Guide

Follow the card below to subscribe to Arm Technology Academy

This article is selected from the Jishu column “Infrastructure Open Source Software on Arm” in the Arm NEON learning series. NEON refers to a high-level SIMD (Single Instruction Multiple Data) extension instruction set suitable for Arm Cortex-A series processors, which can accelerate multimedia and signal processing algorithms (such as video encoding/decoding, 2D/3D graphics, gaming, audio and speech processing, image processing technology, telephony, and voice synthesis). This is the first article in the Arm NEON learning series, and it is hoped that NEON beginners can quickly get started with NEON programming after reading this article.

1. Introduction

This article aims to introduce Arm NEON technology, and it is hoped that NEON beginners can quickly get started with NEON programming after reading this article. This article will also inform readers of documentation indexes containing more detailed information.

2. Overview of NEON

This section introduces NEON technology and some background knowledge.

2.1 What is NEON?NEON refers to a high-level SIMD (Single Instruction Multiple Data) extension instruction set suitable for Arm Cortex-A series processors. The NEON technology can accelerate multimedia and signal processing algorithms (such as video encoding/decoding, 2D/3D graphics, gaming, audio and speech processing, image processing technology, telephony, and voice synthesis). NEON instructions can execute parallel data processing:

Registers are viewed as vectors of elements of the same data type
Data types can be: 8/16/32/64-bit integers, single precision (Arm 32-bit platform), single precision floating point/double precision floating point (Arm 64-bit platform)
Instructions perform the same operation across all channels

2.2 History of Arm Advanced SIMD Development

Getting Started with Arm NEON Programming: A Quick Guide

2.3 Why Use NEON

NEON provides

Support for integer and floating-point operations, ensuring suitability for a wide range of application areas from codecs, high-performance computing to 3D graphics.
Closely integrated with Arm processors, providing a unified view of instruction flow and memory, making programming simpler than external hardware accelerators.

3. Introduction to Arm v8 Architecture

Arm v8-A is a significant architectural change that supports the 64-bit execution mode “AArch64” and introduces a new 64-bit instruction set “A64”. At the same time, to maintain compatibility with the Arm v7-A (32-bit architecture) instruction set, the concept of “AArch32” is also introduced. Most Arm v7-A code can run in Arm v8-A AArch32 execution mode.

This section will introduce some features related to NEON in the Arm v8-A architecture. In addition, this section will briefly introduce the general-purpose CPU registers and CPU instructions commonly used in NEON programming, but the focus remains on NEON technology.

3.1 RegistersArm v8-A AArch64 has 31 general-purpose registers, each with 64 bits (X0-X30) or 32-bit mode (W0-W30). The register view is as follows:

Getting Started with Arm NEON Programming: A Quick Guide

Arm v8-A AArch64 has 32 128-bit registers, which can also be used as 32-bit Sn registers or 64-bit Dn registers. The register view is as follows:

Getting Started with Arm NEON Programming: A Quick Guide

3.2 Instruction SetArm v8-A AArch32 instruction set consists of the A32 (Arm instruction, 32-bit fixed-length instruction set) and T32 (Thumb instruction set, 16-bit fixed-length instruction set; Thumb2 instruction set, 16/32-bit length instruction set). It is a superset of the Arm v7 Cortex-A instruction set, so Arm v8-A AArch32 is backward compatible with Arm v7-A to run earlier software. Meanwhile, to maintain consistency with the A64 instruction set, the AArch32 instruction set has added NEON division and encryption instruction extensions.

Compared to the AArch32 instruction set, the AArch64 instruction set A64 (32-bit fixed length) has undergone significant changes, such as completely different instruction formats. However, functionally, the AArch64 instruction set basically implements all functions of the AArch32 instruction set, with additional support for NEON double precision floating point.

3.3 NEON Instruction FormatNow most are already on the Arm v8 platform, so this section will only introduce the AArch64 NEON instruction format. The general description is as follows:{<prefix>}<op>{<suffix>} Vd.<t>, Vn.<t>, Vm.<t></t></t></t></suffix></op></prefix>Here:<prefix></prefix>——Prefix, such as S/U/F/P representing signed integer/unsigned integer/floating point/boolean data type<op></op>——Operator. For example, ADD, AND, etc.<suffix></suffix>——Suffix, usually has the following types

P: Operates on pairs of vectors, such as ADDP
V: Operates across all data channels, such as FMAXV
2: Operates on the high part of the data in wide/narrow instructions. For example, ADDHN2, SADDL2.

ADDHN2: Adds two 128-bit vectors to get a 64-bit vector result and stores the result in the high 64-bit part of the NEON register. SADDL2: Adds the high 64-bit parts of two NEON registers to get a 128-bit result.<t></t> ——Data type, usually 8B/16B/4H/8H/2S/4S/2D, etc. B represents 8-bit data type; H represents 16-bit data width; S represents 32-bit data width, which can be 32-bit integer or single precision floating point; D represents 64-bit data width, which can be 64-bit integer or double precision floating point.

The following lists specific NEON instruction examples:UADDLP V0.8H, V0.16BFADD V0.4S, V0.4S, V0.4SFor more content, please refer to Armasm_user_guide.pdf Chapters 13-15 for A32 and T32 instructions. Chapters 16-20 introduce A64 instructions, with Chapter 20 specifically discussing NEON instructions.

4. Basics of NEON Programming

The previous chapters have introduced the concepts of NEON, hardware resources, and instruction sets. Now we can start using NEON to accelerate our applications. Using NEON technology typically has the following four methods:

Call NEON optimized library functions
Use compiler auto-vectorization options
Use NEON intrinsics instructions
Write NEON assembly

4.1 Calling Library FunctionsThe user only needs to directly call the NEON optimized library functions in the program, which is simple and easy to use. Currently, you have the following libraries to choose from:

Arm Compute Library

A series of low-level function libraries optimized for Arm CPUs and GPUs, used for image processing, machine learning, and computer vision. More information:https://developer.Arm.com/technologies/compute-library

Ne10 Open Source Library

Developed under the leadership of Arm, currently provides a relatively general set of mathematical functions, some image processing functions, and FFT functions. Link:http://projectne10.github.io/Ne10/

4.2 Automatic VectorizationThere is an automatic vectorization compilation option in the GCC compiler options that can help existing code compile to generate NEON code. GNU GCC provides a series of options, some of which can improve performance, while others can reduce the size of the generated executable file. For each line of code, there are many assembly instructions to choose from. The compiler must make trade-offs among many options such as registers, stack space, code size, compilation time, ease of debugging, and instruction execution time to generate the optimal image file.

4.3 NEON IntrinsicsNEON intrinsics can be viewed as an interface that wraps around NEON instructions. When the user calls the NEON intrinsics interface in C programs, the compiler automatically generates the relevant NEON instructions. NEON intrinsics can run across Arm v7-A/v8-A. As long as you program once, you can use the compiler to generate the corresponding NEON code. If the user uses Arm v8-A AArch64-specific NEON instructions in the code, just separate this part of the code with a macro definition (__aarch64__) as shown in the example.

Here is an example of NEON intrinsics.

// The following is the addition of floating-point arrays, assuming count is a multiple of 4

#include<arm_neon.h>

void add_float_c(float* dst, float* src1, float* src2, int count)
 {
     int i;
     for (i = 0; i < count; i++)
         dst[i] = src1[i] + src2[i];
 }

 void add_float_neon1(float* dst, float* src1, float* src2, int count)
 {
     int i;
     for (i = 0; i < count; i += 4)
     {
         float32x4_t in1, in2, out;
         in1 = vld1q_f32(src1);
         src1 += 4;
         in2 = vld1q_f32(src2);
         src2 += 4;
         out = vaddq_f32(in1, in2);
         vst1q_f32(dst, out);
         dst += 4;
// The following code is just an example of how to use AArch64 proprietary code and has no practical meaning.
#if defined (__aarch64__)
         float32_t tmp = vaddvq_f32(in1);
#endif

     }
}

By examining the disassembly, under Arm v7-A, you can see the vld1/vadd/vst1 NEON instructions. Under Arm v8-A, you can see the ldr/fadd/str NEON instructions.

4.4 NEON AssemblyThere are mainly two ways to write NEON assembly:

Independent assembly files
Inline assembly

4.4.1 Independent Assembly FilesIndependent assembly files can have “.S” as the file extension or “.s” as the file extension. The difference is that .S files will be processed by the C/C++ preprocessor, allowing us to use C language features like macro definitions.

When writing NEON assembly files, we need to pay attention to the preservation of registers. For Arm v7/v8, we need to save the following registers:

Getting Started with Arm NEON Programming: A Quick Guide

Here is an example of Arm v7-A/v8-A NEON assembly.

// Define in the header file
void add_float_neon2(float* dst, float* src1, float* src2, int count);

Below is the handwritten assembly code, saved in a .S file

// Arm v7-A/Arm v8-A AArch32 version
    .text
    .syntax unified

    .align 4
    .global add_float_neon2
    .type add_float_neon2, %function
    .thumb
    .thumb_func

add_float_neon2:
.L_loop:
    vld1.32  {q0}, [r1]!
    vld1.32  {q1}, [r2]!
    vadd.f32 q0, q0, q1
    subs r3, r3, #4
    vst1.32  {q0}, [r0]!
    bgt .L_loop

    bx lr

// Arm v8-A AArch64 version
    .text

    .align 4
    .global add_float_neon2
    .type add_float_neon2, %function

add_float_neon2:

.L_loop:
    ld1     {v0.4s}, [x1], #16
    ld1     {v1.4s}, [x2], #16
    fadd    v0.4s, v0.4s, v1.4s
    subs x3, x3, #4
    st1  {v0.4s}, [x0], #16
    bgt .L_loop

    ret

For more code, please refer to:https://github.com/projectNe10/Ne10/tree/master/modules/dsp

4.4.2 Inline AssemblyAs the name suggests, inline assembly is a method that is closely integrated with C code. We can directly embed assembly code in C/C++ code, allowing us to add NEON as needed.

Advantages:

Simple procedure call rules, no need to manually save registers.
Can use C/C++ variables and functions, making it very easy to integrate into C/C++ code

Disadvantages:

Inline assembly has a complex syntax rule
NEON code embedded in C/C++ code is not easy to port to other platforms

Routine:

// Arm v7-A/Arm v8-A AArch32
void add_float_neon3(float* dst, float* src1, float* src2, int count)
{
    asm volatile (
               "1:                                   \n"
               "vld1.32  {q0}, [%[src1]]!            \n"
               "vld1.32  {q1}, [%[src2]]!            \n"
               "vadd.f32 q0, q0, q1                  \n"
               "subs     %[count],  %[count], #4     \n"
               "vst1.32  {q0}, [%[dst]]!             \n"
               "bgt      1b                          \n"
               : [dst] "+r" (dst)
               : [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
               : "memory", "q0", "q1"
          );
}

// Arm v8-A AArch64
void add_float_neon3(float* dst, float* src1, float* src2, int count)
{
    asm volatile (
               "1:                                    \n"
               "ld1    {v0.4s}, [%[src1]], #16        \n"
               "ld1    {v1.4s}, [%[src2]], #16        \n"
               "fadd   v0.4s, v0.4s, v1.4s            \n"
               "subs   %[count],  %[count], #4        \n"
               "st1    {v0.4s}, [%[dst]], #16         \n"
               "bgt    1b                             \n"
               : [dst] "+r" (dst)
               : [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
               : "memory", "v0", "v1"
          );
}

4.5 NEON Intrinsics vs NEON AssemblyNEON intrinsics and NEON handwritten assembly are the most commonly used NEON optimization methods. Below is a simple comparison of the advantages and disadvantages of these two methods.

NEON Assembly	NEON Intrinsic
Performance	For specified platforms, assembly generally shows the best performance.	Current compilers can achieve performance comparable to handwritten assembly.
Portability	Arm v7-A/v8-A platforms have different assembly formats. Even on Arm v8-A platforms, assembly programs may need to be adjusted for different Cortex A53/A57 microarchitectures to achieve the best performance.	By choosing the right compiler options, programming once can easily achieve cross-platform and performance adjustments for the specific platform microarchitecture, such as Arm v7-A Cortex A9/A7/A15 and Arm v8-A Cortex A53/A57.
Maintainability	Compared to C language, more difficult programming, and lower readability	Similar to C language, relatively easy to program and maintain

This is just a simple comparison of advantages and disadvantages. When applying NEON, there will be more special situations when the scenario is more complex. In the next article, “Arm NEON Optimization“, I will analyze this issue further.

With the above foundation, choose a NEON implementation method, and now you can start your NEON programming journey.

This article has an English version, which contains more content comparing Arm v7/v8. English version link: https://community.arm.com/developer/tools-software/oss-platforms/b/android-blog/posts/arm-neon-programming-quick-reference

Recommended Reading Series

Arm Assembly Guide (Part 1)
ARM Assembly Guide (Part 2) Basics
Arm Assembly Guide (Part 3) Architecture
Arm Assembly Guide (Part 4) Security
ARM Assembly Guide (Part 5) Backend
ARM Assembly Guide (Part 6) 5G Miscellaneous
ARM Assembly Guide (Part 7) AI
ARM Assembly Guide (Part 8) Multimedia
ARM Assembly Guide (Part 9) Networking

Follow Arm Technology Academy

Click the “Read Original” button below to read more articles from the “Infrastructure Open Source Software on Arm” Jishu column..

Related posts

Leave a Comment Cancel reply