Accelerators in Embedded Systems: Hardware Acceleration Options

1. Hardware Acceleration Options

There are many hardware options available for accelerating neural network computations in embedded systems. As mentioned in Section 2, processing solutions should vary based on the target application and the expected requirements of the embedded system, such as performance, energy, power, heat, reliability, cost, size, etc. Hardware options include traditional processing units (such as CPU and GPU), Field Programmable Gate Arrays (FPGA), and dedicated accelerators implemented using Application Specific Integrated Circuits (ASIC).Table 1 summarizes embedded processing options by hardware type. The numbers and features in the table represent general cases and do not reflect specific products from certain vendors.

Accelerators in Embedded Systems: Hardware Acceleration Options

CPUs used for high-performance computing can achieve fast inference execution of deep neural networks with small batch inputs [31]. HPC CPUs can include dozens of cores in a single chip, with each core typically equipped with simultaneous multithreading (SMT) capabilities and single instruction multiple data (SIMD) units for vector operations. For example, the AMD EPYC 7002 series processors [32] contain 64 cores in one chip, supporting up to 128 hardware threads. CPUs as general processing units help in the programming and development of neural network applications, while the use of SMT and SIMD features in multi-core processors helps improve energy efficiency, thus providing greater unit power throughput [26]. However, it is well known that CPUs are less energy efficient than other hardware options, so their use for neural network acceleration in embedded systems is usually limited.

CPUs for mobile or edge computing typically integrate asymmetric core hybrids such as the ARM big.LITTLE architecture [33] into a single chip to meet the different needs of high-performance large cores for compute-intensive or latency-sensitive workloads, while providing energy-efficient execution through small cores. For example, the Qualcomm Snapdragon 845 mobile platform [34] is built on the ARM big.LITTLE architecture, which includes four high-performance cores and four efficient cores. In resource-constrained mobile or edge devices, the performance difference between CPUs and GPUs is minimal [35,36]. Therefore, devices without dedicated hardware accelerators for neural network computing typically utilize CPUs to perform inference operations.

High-performance computing GPUs excel in achieving high throughput computations compared to other computing platforms. Since neural network computations are highly data-intensive tasks that require performing numerous repetitive operations on multi-dimensional data, GPU single instruction multiple thread (SIMT) execution is particularly well-suited for such computations. HPC GPUs integrate thousands of cores into a single processor package and have large on-chip memory components, including registers, caches, and shared memory. For example, the NVIDIA A100 GPU [37] contains 6,912 CUDA cores, providing 19.5 TFLOPS of peak throughput for 32 -bit single precision (SP) computations. Such abundant computational and memory resources allow GPUs to execute deep neural networks in a massively parallel manner, offering higher throughput than other computing platforms (such as CPUs). Therefore, GPUs have become the de facto hardware for neural network computing (especially for training deep neural networks). It is well known that GPUs have better energy efficiency than traditional CPUs with general programmability, but they still do not meet the energy efficiency required for neural network computations. GPUs are commonly used for large-scale training of deep neural networks, while processing solutions utilizing other hardware acceleration options for fast inference in embedded systems are highly sought after.

Embedded GPUs in mobile SoCs typically contain only a few cores in the processor package, insufficient to provide adequate computational power for neural network computations. For example, the Samsung Exynos 9810 mobile processor [38] integrates a Mali-G72 MP18 GPU with 18 cores. Thus, if a mobile device is not equipped with a dedicated accelerator, it often utilizes CPU cores to execute deep neural networks, where the performance of the CPU is comparable to that of the embedded GPU [35,36]. For high-end embedded devices like the NVIDIA Jetson series processors [39], GPUs can accommodate hundreds of cores to provide high throughput computational capabilities within embedded processors. For example, the NVIDIA Jetson AGX Xavier includes a Volta architecture GPU with 512 CUDA cores and an 8 core ARM v8.2 64 bit CPU. By incorporating a relatively large GPU in the embedded processor, it can offer higher throughput than other embedded GPUs typically found in mobile devices.

Accelerators in Embedded Systems: Hardware Acceleration Options

The latest trend for high-performance and embedded GPUs (such as NVIDIA A100 and Jetson AGX Xavier) is to supplement traditional GPU microarchitectures with dedicated matrix/vector solvers (such as the recent NVIDIA GPUs with Tensor Cores). Tensor Cores are a combination logic specifically designed to handle matrix multiplication and accumulation (MMA) operations within the processing blocks of streaming multiprocessors (SM). Figure 9 illustrates the block diagram of the processing block, where a pair of Tensor Cores shares the scheduling resources of the processing block with other traditional GPU pipeline elements (including floating point and ALU interleaved), such as the register file and warp scheduler. Tensor Core consists of four element dot products (FEDP) that collectively perform 4×4 MMA operations. A group of four threads in a warp is called a thread group, and the thread group uses tensor cores to process 4×8 data points. Several thread groups form a byte and collectively produce 8×8 data blocks. Thus, 32 threads in a warp will generate 16×16 matrix MMA outputs. The Tensor Core introduced in the NVIDIA Volta architecture has strict limitations on the matrix input size [41], but subsequent Turing and Ampere architectures have relaxed the restrictions, allowing for a more diverse range of data input forms (i.e., dimensions). The Tensor Core in the processing block provides higher computational intensity than traditional CUDA cores by densely packing multipliers and adders into the FEDP. In particular, the NVIDIA Tesla V100 [41] has 16 single precision floating point units in a processing block, while Tensor Core can produce 8 times the throughput at the same precision [42]. Therefore, recent NVIDIA GPUs with Tensor Core have provided computational advantages for neural networks in high-performance and embedded computing domains [43].

FPGAs have quickly become a hardware alternative for accelerating neural network computations. FPGAs are inherently programmable and can accelerate the inference of specific neural networks [44]. They can implement optimized data paths for neural network algorithms, and the mapping to reconfigurable logic synthesis design has higher power efficiency and lower latency compared to traditional CPUs and GPUs with instruction-based pipelined execution. Importantly, the reconfigurable logic and routing in FPGAs enable various forms of neural network accelerators to be efficiently implemented on FPGA architectures [4550]. Despite the hardware support for programmability, utilizing FPGAs for neural network acceleration requires longer development times and a steeper learning curve compared to using traditional general processing options. A single FPGA chip usually does not have enough logic and memory space (such as registers and SRAM) to accommodate large amounts of neural network data. Therefore, FPGAs are typically used to implement fast inference execution in embedded systems rather than for designing high-throughput computational environments for neural network training.

ASIC-based accelerators provide efficient processing solutions for neural network computations. ASIC implementations are designed to optimize the data flow of neural network algorithms to build custom designs targeting specific workloads. Neural network accelerators on ASIC implementations are driven by similar factors to those using FPGAs for hardware acceleration (i.e., customizing accelerators for different workloads). The performance and energy efficiency provided by ASIC implementations far exceed that of other processing solutions such as FPGAs, GPUs , and CPUs [5157]. However, unparalleled efficiency comes at the cost of significantly increased development time and costs, and the custom designs make programming and usage relatively difficult. Thus, for products in highly competitive low-margin markets, ASICs may not be the best choice, as time to market (TTM) is crucial in these markets, or in environments where software and hardware require frequent updates and maintenance.

2. Commercial Options for Neural Network Acceleration

Many hardware solutions have been proposed to accelerate neural network computations across various computing domains, from mobile devices to HPC systems. In addition to the research prototypes presented in academic papers, Table 2 summarizes off-the-shelf commercial devices available for neural network acceleration in embedded systems.

Accelerators in Embedded Systems: Hardware Acceleration Options

Raspberry Pi 4Raspberry Pi 4 Model B [58] is a compact and affordable single-board computing device. It uses a Broadcom processor with a quad-core Cortex-A72 (ARM v8) 64 bit CPU, Videocore VI GPU, 2-8GB LPDDR4-3200 SDRAM , and various on-board I/O, including Wi-Fi, Bluetooth, Gigabit Ethernet, HDMI, USB-C , etc. The GPU in the Raspberry Pi 4 does not support general-purpose graphics processing unit (GPGPU) or dedicated accelerators in the SoC.

Arduino Portenta H7:Similar to Raspberry Pi 4, Arduino Portenta H7[59] provides an affordable hardware solution for embedded processing. It includes a dual-core ARM Cortex-M7 CPU and a 32 bit M4 MCU with Chrom-ART GPU. Arduino Portenta H7 also lacks GPGPU support or dedicated accelerators for neural network computing in the SoC. Both Raspberry Pi and Arduino processors provide convenient processing solutions with similar hardware capabilities and specifications for low-end embedded or IoT devices.

NVIDIA Jetson TX2NVIDIA Jetson TX2 [60] is a high-performance embedded processing platform equipped with a quad-core ARM Cortex-A57 and a dual-core Denver D15 CPU, the Pascal architecture GPU contains 256 CUDA cores, integrated with 8GB 128 bit LPDDR4 memory. The memory is hardwired to the memory controller in the processor and shared with the host CPU (i.e., unified memory)[61].NVIDIA Jetson TX2 does not have dedicated hardware accelerators (such as Tensor Cores) found in its successors.

NVIDIA AGX XavierNVIDIA Jetson AGX Xavier [39] is a higher-end embedded processing platform than its predecessor NVIDIA Jetson TX2. It features an octa-core ARM v8.2 CPU, a Volta architecture GPU with 512 CUDA cores and 64 Tensor cores, integrated with 32GB 256 bit LPDDR4 memory. The Tensor cores introduced in the Volta architecture provide greater throughput for neural network computations. Similar to NVIDIA Jetson TX2, it also supports unified memory shared between the host CPU and GPU.

Edge TPUEdge Tensor Processing Unit (TPU) [62] is a scaled-down ASIC implementation of Google TPU [55] that provides a convenient hardware solution for neural network acceleration in low-power edge devices. Similar to Raspberry Pi and Arduino processors, Edge TPU is implemented as a single-board computing device. The Edge TPU coprocessor integrates a quad-core ARM Cortex-A53 and Cortex-M4F CPU, GC7000 Lite Graphics, 1GB LPDDR4 memory and various I/O peripherals, including USB-C, Gigabit Ethernet, HDMI , etc. The Cloud TPU [63] matrix unit (MXU) used in data centers implements a 128×128 pulsed array with INT8 data types. It is speculated that the Edge TPU inherits a similar design to the Cloud TPU, but the details of the TPU coprocessor (i.e., the accelerator) have not been disclosed.

• Intel Neural Compute Stick 2:Intel Neural Compute Stick (NCS) 2 [64] is a small computing device packaged in a thumb drive chassis. It is a plug-and-play device that can technically connect to any platform via a USB-3.0 interface. Intel NCS 2 is built on the Intel Movidius Myriad X vision processing unit (VPU) which contains 16 streaming hybrid architecture vector engine (SHAVE) cores integrated with a dual-core v8 RISC CPU . Each SHAVE core is a programmable engine with a very long instruction word (VLIW) that features SIMD functionality supporting mixed precision of 32, 16 and 8 bit data types.

• Xilinx Zynq Series FPGA:Xilinx Zynq series FPGA [65] implements a programmable SoC that integrates reconfigurable structures and dual-core ARM Cortex-A9 or quad-core ARM Cortex-A53 CPU into a single package. Zynq (also known as PYNQ)’s Python productivity aims to facilitate the use of Xilinx Zynq FPGA by implementing programmability on board using Python without needing ASIC-style design tools to design programmable logic circuits. Specifically, the PYNQ-Z1 FPGA board has a dual-core ARM Cortex-A9 CPU and supports 13,330 logic slices (or 85K logic units), 220 digital signal processing (DSP) slices and 630KB fast block RAM (BRAM).

• Tesla FSD:Tesla Full Self-Driving (FSD) [66] is an embedded processing solution developed by Tesla for its autonomous driving system. It is not a commercially available product, but it serves as a great example of embedded processing for deep neural networks. The Tesla FSD chip contains 12 Cortex-A72 CPU cores and 96×96 multiplication-accumulation (MAC) units for neural processing.

• HPC processors: In addition to the small single-board computing devices listed (such as Raspberry Pi 4, Arduino Portenta H7 and Edge TPU) and powerful embedded processors (such as Tesla FSD), traditional CPUs and GPUs , embedded systems can also utilize them to provide processing solutions for neural network computations. For example, the AMD EPYC 7742 CPU [32] has 64 cores in one processor package and supports up to 128 hardware threads. The processor operates at a base clock frequency of 2.25 GHz and allows for accelerated execution at a clock frequency of 3.4 GHz. In contrast, the NVIDIA A100 GPU [37] contains 6912 CUDA cores and 432 Tensor cores in the processor. The processor is equipped with 40 GB of onboard memory, providing 1.6 TB/s of memory bandwidth.

Leave a Comment