Who Will Become the Brain of Future Cars: GPU, FPGA, or ASIC?

Autonomous driving systems are extremely complex, integrating numerous cutting-edge technologies, including perception and decision-making capabilities. Only carefully designed hardware can support these resource-intensive tasks. Moreover, autonomous driving is one of the first embedded applications that heavily rely on machine learning algorithms. Given this, a significant amount of research resources has been invested in the development of neural network accelerators to meet specific needs such as redundancy and energy efficiency.

Developing autonomous vehicles is undoubtedly one of the most challenging tasks in the current field of artificial intelligence. Autonomous driving systems must accurately perceive their surroundings and plan appropriate actions to ensure safe driving on the road. They must cope with various complex situations, such as road conditions, weather, intricate intersections, pedestrians, and other road users, all of which increase the complexity of scene understanding. Nevertheless, this task is crucial. To achieve safe and efficient driving, a comprehensive understanding of the vehicle’s surrounding environment is essential. To this end, autonomous vehicles are equipped with various sensors to collect massive amounts of data. However, raw data without processing has limited value and must undergo in-depth analysis to be useful.

Given the complexity of the task, scene understanding relies on learning algorithms, the most notable of which are neural networks. It is worth noting that the training process for such algorithms on state-of-the-art hardware platforms (such as multiple high-performance Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs) [1]) can take days. Therefore, this task clearly cannot be completed during driving. Embedded hardware only needs to compute the forward propagation of data in the neural network, i.e., inference. However, the inference process is also resource-intensive, especially since a high refresh rate must be achieved to reduce perception latency. Fortunately, the forward propagation of most neural networks can be computed through dot products, which is a highly parallelizable operation. Thus, due to the limited number of cores, Central Processing Units (CPUs) are clearly not suitable. Nowadays, GPUs are widely used in both the training and inference stages of machine learning.

However, for critical embedded systems, GPUs still face issues of inefficiency and reliability. To address this, the industry is actively developing alternatives. Neural Network Accelerators (NNAs) are hardware systems specifically designed for neural network computations. We will focus on two of them: Field Programmable Gate Arrays (FPGAs) and Application-Specific Integrated Circuits (ASICs), such as Tesla’s FSD computer.

Requirements for Embedded Computers in Autonomous Vehicles

Today, the most advanced autonomous driving systems primarily rely on computer vision technology. In fact, the core computational task of autonomous vehicles is to extract features from images. The system must deeply understand what it perceives. It has been proven that Convolutional Neural Networks (CNNs) perform exceptionally well in this task. They typically consist of multiple layers of convolution, activation functions, pooling, and deconvolution. Data flows through the layers of the network, extracting valuable information from images or even videos. However, the storage and computational costs of such algorithms are quite high. For example, the classic classifier ResNet152, designed for 224×224 images, requires 11.3 billion floating-point operations (TFLOP) to complete the inference process and needs 400MB of memory to store model parameters [2]. Additionally, current autonomous vehicles are equipped with multiple high-resolution cameras. For instance, the Tesla Model 3 is equipped with eight cameras, each with a resolution of 1280×960, requiring real-time analysis of all eight video sources. It is easy to imagine that this demands enormous computational power. However, convolution operations typically account for over 98% of the operations performed during CNN inference, while ReLU and pooling, being relatively simple logical functions, account for less than half of the computational load. Since convolution is based on dot products, hardware design must enhance the efficiency of dot product calculations, ultimately translating into multiple parallel multiplication/addition operations.

Embedded systems for autonomous vehicles also face specific limitations related to safety, reliability, and real-time requirements. The three main challenges to overcome include:

The processing pipeline must be fast enough to efficiently handle the large amounts of data collected from sensors. The faster the system operates, the more data can be analyzed within a limited time. It is particularly important to note that the frame rate at which the system operates is critical. The refresh rate of the perception system must be extremely high to ensure that even at high speeds, the system can quickly respond to sudden situations.
The system design must avoid single points of failure. The system must have strong fault tolerance, capable of quickly recovering in the event of component failure. It must not only continue to operate under resource constraints but also have fault detection capabilities. Typically, this issue is addressed by introducing redundancy mechanisms and result comparisons. Two independent computational pipelines operate in parallel, and if the results are inconsistent, an error can be identified.
The system must strive for high energy efficiency. Given that autonomous vehicles primarily rely on electric power, energy efficiency directly impacts the feasibility of long-distance travel. Additionally, high energy consumption not only increases the burden on cooling solutions but also adds extra weight and cost burdens.

Three Main Computing Platforms

Graphics Processing Units (GPUs)

The Nvidia Drive platform currently leads the market for GPU-based embedded systems for autonomous driving. This general-purpose computing solution integrates the Drive software stack, designed to enable automakers to focus on the software implementation of autonomous driving solutions. The latest and most powerful iteration of the DrivePX architecture features two Tegra X2 SoCs. Each chip contains four ARM A57 CPUs and one Pascal GPU. Both GPUs are equipped with dedicated memory and optimized instructions for DNN acceleration. To meet the demands of high data transfer, each Tegra is directly connected to the Pascal GPU via a PCI-E Gen2 × 4 bus, with a total bandwidth of up to 4.0 GB/s. The optimized input/output architecture and DNN acceleration technology allow each Tegra chip to achieve performance of up to 24 TFLOP/s.

However, the system consumes up to 250 W. Therefore, even GPU experts have turned to ASICs for their new platform set to launch in 2022. Reportedly, the Nvidia Drive AGX Orin achieves a computation speed of 200 TFLOP/s by integrating six different types of processors (including CPU, GPU, Deep Learning Accelerator (DLA), Programmable Vision Accelerator (PVA), Image Signal Processor (ISP), and Stereo/Optical Flow Accelerator).

Field Programmable Gate Arrays (FPGAs)

In recent years, FPGAs have become an ideal choice for algorithm acceleration. Unlike CPUs or GPUs, FPGAs are configured specifically to run target algorithms, significantly enhancing the efficiency of executing the task at hand. Without empirical measurements, it is difficult to estimate the floating-point performance of FPGAs, but they can easily achieve several TFLOP/s of performance at power levels of tens of watts. Pure FPGAs need to work in conjunction with host systems via PCIe connections to provide data, such as processing images and other sensor outputs in our case. FPGAs are typically used solely as accelerators for neural network inference, with chips configured according to the neural network structure, and model parameters stored in memory. Since the internal memory of FPGAs usually does not exceed several hundred megabits, it is challenging to accommodate most neural network parameters, necessitating reliance on external memory such as DDR SDRAM. However, the bandwidth and power consumption of such external memory become bottlenecks in achieving high-performance systems.

Nevertheless, high-end FPGAs can still demonstrate outstanding performance in our applications. For example, Xilinx’s Zynq UltraScale MPSoC is designed for autonomous driving tasks and achieves speeds and efficiencies more than three times that of the Tesla K40 GPU (14 FPS/W vs. 4 FPS/W) [3] when running CNN inference. When processing object tracking tasks in 1080p real-time video streams, its speed can reach 60 FPS. When designing neural network accelerators on FPGAs, various hardware-level techniques are employed to enhance performance and efficiency, with the design of computational units being particularly critical. In fact, the number of low-level elements (such as gates and flip-flops) available on FPGAs is limited, so using smaller computational unit designs can achieve more computational units and higher peak performance. Additionally, a well-designed array of computational units can also enhance the operating frequency. Guo et al. [3] introduced three main techniques to improve performance through optimized computational unit design:

Low-bit-width computational units: The bit width of the input array directly affects the size of the computational units. The smaller the bit width, the smaller the computational unit. Most recent FPGA designs for neural network applications use fixed-point units instead of 32-bit floating-point units. While 16-bit units are widely used, units as low as 8 bits can also achieve good results [5]. Generally, CNNs and neural networks are quite tolerant of reduced precision [4].
Fast convolution methods: Various algorithms can accelerate convolution operations. For example, the Discrete Fourier Transform or Winograd method can significantly enhance performance (by 4 times) at reasonable kernel sizes [3].
Frequency optimization methods: Routing between on-chip SRAM and Digital Signal Processing (DSP) units may limit peak operating frequency. By using adjacent chips as local RAM to separate different operating frequencies of DSP units, peak operating frequency can be enhanced. Xilinx’s CHaiDNN-v2 project [5] successfully doubled the peak operating frequency using this technique.

The ZynqNet FPGA accelerator [6] is a fully functional proof-of-concept CNN accelerator that integrates the above techniques and several other innovations. As the name suggests, this framework is developed for the Xilinx Zynq board and accelerates CNN inference using nested loop algorithms, minimizing arithmetic operations and memory access.

Application-Specific Integrated Circuits (ASIC) – Tesla FSD Computer

Application-Specific Integrated Circuits (ASICs) can achieve fully flexible hardware implementations to meet specific needs, providing exceptional performance for specific tasks. For automakers with sufficient resources to develop such complex systems, this solution is undoubtedly the best choice. While it provides outstanding performance for autonomous vehicles, the development time and costs are also quite high.

The Tesla Full Self-Driving (FSD) computer was first unveiled on “Autonomy Day” on April 22, 2019. Since its release, the system has been running in all Tesla models and has demonstrated excellent operational performance. According to Elon Musk, the FSD computer will eventually support Level 5 autonomous driving systems. This ASIC must meet the following requirements:

The operating power of the computer must be below 100W.
It must be capable of processing neural network models at a rate of at least 50 TFLOP/s.
Data preprocessing and post-processing still require some GPU support. However, with advancements in software and AI technology, traditional algorithms may gradually be phased out on general-purpose hardware.
Safety and security are core to the design. The failure rate must be lower than the probability of a human driver losing consciousness, which means achieving complete hardware redundancy: each FSD computer is equipped with two power supplies and two independent computing units.
Each image and data set must be processed independently (batch size of 1) to reduce latency.
Rich connectivity ports to meet the needs of various sensors in the vehicle.

Given the application characteristics, the FSD computer excels in image processing. Its interface provides 2.5G pixels/second of serial input, compatible with high-resolution cameras around the vehicle, and provides LPDDR4 DRAM interfaces for other sensors such as radar. Additionally, a dedicated image signal processor is responsible for noise reduction and tone mapping (highlighting shadow details), while an external H.265 video encoder module is used for data output. This unique module is key to Tesla’s software development process. Data is the cornerstone of most machine learning algorithms, and the Tesla fleet provides a vast amount of video data for training autonomous driving models. The extensive database built over the years is crucial to its success. Light data processing is performed by a GPU that supports both 32-bit and 16-bit floating-point operations, operating at a frequency of 1GHz, with a performance of 600 GFLOP/s. Twelve ARM CPUs, operating at 2.2GHz, handle some auxiliary tasks.

However, the core functionality of the FSD computer lies in its Neural Network Accelerator (NNA). To ensure safety, each computer is equipped with two independent NNAs. Each NNA has 32Mb of SRAM for storing temporary results and model parameters. Notably, DRAM memory consumes 100 times more energy than SRAM. In each clock cycle, 2056 bytes of activation data and 1028 bytes of parameters are combined in each NNA’s 96×96 (9216) multiply-add array. This data flow requires at least 1TB/s of SRAM bandwidth (per accelerator). The multiply/add array has in-place accumulation capabilities, allowing up to 10,000 operations per cycle. Therefore, each NNA can provide 36 TFLOP/s of performance at a frequency of 2GHz. However, the NNA is primarily optimized for dot products, and nonlinear operations perform poorly on such chips, making them difficult to implement. To address this, dedicated modules for ReLU and pooling operations have been added.

The current version of Tesla software requires 35 GOP to analyze a single image. Therefore, the FSD computer can analyze a total of 1050 frames of images per second. When running the fully autonomous driving software beta, the FSD computer’s power consumption is 72 watts (15 watts for the NNA).

System Comparison

Due to the complexity of embedded systems, evaluating them has become increasingly difficult. The ideal method to verify the effectiveness of system improvements is to use standard benchmark suites to represent workloads in autonomous driving applications. Benchmark tools can be divided into two main categories: datasets and workload stress. KITTI [7] is the first benchmark for autonomous driving datasets, containing a wealth of perception sensor data, such as monocular/stereo images and 3D LiDAR point clouds. The associated ground truth data can be used to evaluate the performance of algorithms in various autonomous driving scenarios, such as lane detection, odometry, object detection, and tracking. Such datasets can serve as stress sources for systems to assess their peak performance in autonomous driving-related tasks. The second category of benchmarks evaluates new hardware architectures through a set of applications and visual kernels. CAVBench [8] is currently an ideal starting point for assessing the performance of autonomous driving computing systems, creating virtual environments by simulating different scenarios based on datasets to evaluate real-world performance. This suite provides various workload assessment tasks, including object detection, object tracking, battery diagnostics, speech recognition, edge video analysis, and SLAM, with task discretization helping developers identify system performance bottlenecks.

Unfortunately, benchmarking or evaluation processes have not yet been widely adopted in the field of edge computing for autonomous driving. However, Guo et al. [3] have successfully compared various state-of-the-art neural network inference accelerators.

Performance and resource utilization comparison of state-of-the-art neural network accelerator designs [3]

The above figure compares the computational capabilities and efficiency of different accelerators based on FPGA and GPU. Overall, within the range of 10-100 GOP/J, FPGA’s energy efficiency is slightly higher than that of GPUs. The main challenge in improving the performance of FPGA-based solutions lies in scalability. Zhang et al. [9] (refer to the above figure [76]) proposed a solution based on FPGA clusters to approximate GPU performance. They used a 16-bit fixed-point design, combining six Xilinx Virtex-7 FPGAs (device XC7VX690T). This architecture not only has computational capabilities comparable to GPUs but also has the advantage of low energy consumption.

However, FPGA-based NNAs remain a rapidly evolving research area. Researchers are working to optimize architectures for higher energy efficiency and computational capabilities. In fact, GPU-based solutions have reached a high level of architectural optimization, with their performance currently limited mainly by material physical limits and manufacturing processes. On the other hand, hardware-oriented solutions still have vast development prospects. Even general-purpose processor experts like AMD, Intel, or Nvidia are now focusing on hardware accelerators.

ASICs remain the most performance-advantaged NNAs. For example, the total execution speed of the Tesla FSD computer reaches 144 TFLOP/s, while the maximum operating speed of the Nvidia Drive architecture on the Tesla autonomous driving software stack is 24 TFLOP/s. FPGAs are expected to gradually approach the performance of ASICs, but developing ASICs requires a significant investment of engineering resources, involving not only ASIC design but also adjustments to the entire software architecture. Each ASIC ANN must be equipped with a specific compiler to deploy neural networks.

Conclusion

Autonomous vehicles are not ordinary systems; they must ensure the safety of users and their environment. The main challenges faced by autonomous driving computers are providing sufficient computational power, robustness, and efficiency. Since general-purpose computing processors lack the parallelization capabilities necessary for efficiently running neural networks, they are not ideal choices. In fact, the key feature of any Neural Network Accelerator (NNA) is optimizing the dot product operations used in Convolutional Neural Networks (CNNs).

The degree of optimization varies with platform flexibility. Currently, GPU-based artificial neural networks are the most flexible, and although they have thousands of cores and perform excellently in neural network processing, this performance comes at the cost of low energy efficiency. FPGAs have slightly lower flexibility and need to be configured to provide better performance than GPUs at similar power levels, but their scalability is poor. Using FPGAs as NNAs is a daunting task. Ultimately, considering robustness, energy efficiency, and computational power, static hardware systems built from the ground up, such as ASICs, currently provide the best performance.

Despite the excellent performance of ASICs, only large companies have the capability to develop them. Similarly, FPGAs, as complex systems, require carefully designed algorithms to operate efficiently. For these reasons, GPUs are still widely used in autonomous vehicles, but customizable hardware is expected to take their place soon.

References

[1] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, et al. “Data Center Performance Analysis of Tensor Processing Units.” arXiv:1704.04760.

[2] He Kaiming, Zhang Xiangyu, Ren Shaoqing, Sun Jian. “Deep Residual Learning for Image Recognition.” CVPR. arXiv:1512.03385.

[3] Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, Huazhong Yang. “Research on FPGA-based Neural Network Accelerators.” arXiv:1712.08934.

[4] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan. “Deep Learning with Limited Numerical Precision.” arXiv:1502.02551.

[5] Xilinx CHaiDNN-v2 Project. https://github.com/Xilinx/chaidnn Accessed: March 21, 2020.

[6] David Gschwend. “ZynqNet: FPGA-accelerated Embedded Convolutional Neural Networks.” arXiv:2005.06892.

[7] A. Geiger, P. Lenz, and R. Urtasun. “Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite,” Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp.

[8] Y. Wang, S. Liu, X. Wu, and W. Shi. “CAVBench: A Benchmark Suite for Connected and Autonomous Vehicles,” in Proc. Edge Comput. (SEC), Oct. 2018, pp.

[9] Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. “High Energy Efficiency CNN Implementation on Deep Pipelined FPGA Clusters,” Proceedings of the 2016 International Symposium on Low Power Electronics and Design, ACM, 326-331.

[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. “Image Transformers: Transforming Image Recognition with Transformers.” arXiv:2010.11929.

[11] Chen, Jian-Yu, Zhuo Xu, and M. Tomizuka. “End-to-End Autonomous Driving Perception with Sequential Latent Representation Learning.” arXiv:2003.12464.

ClickFollow,Make FriendsTHE END

The content represents the author’s personal views and is not authorized for reproduction.

The opinions and data in this article are for reference only, and any commercial use is at your own risk.

If there are any violations or infringements, please send a private message for deletion.

Reposting, liking, and showing love are all encouragement!