FPGA Surpasses GPU as the Next Generation Deep Learning Engine

New Intelligence Compilation

Source: nextplatform.com

Author: Linda Barney

Translator: Zhang Yi

2017 New Intelligence Open Source · Ecological AI Technology Summit [Countdown 4 days, Click “Read Original” to grab tickets]New Intelligence and industry leader Intel jointly hold, the grand opening ceremony of China AI 2017 is imminent, three highlights not to be missed: ① Led by BAT, supported by Intel AI technology leaders gather to gain insights into the layout of China’s AI army; ② Ignite AI original force, participate in AI technology forums on-site; ③ AI entrepreneur stars shine, see investment leaders predict who will become China’s AI unicorn.

[New Intelligence Guide]Dr. Eriko Nurvitadhi from Intel Accelerator Architecture Lab evaluated emerging DNN algorithms on two generations of Intel FPGAs against the latest GPUs, believing that the emerging low-precision and sparse DNN algorithms show significant improvements in efficiency compared to traditional dense FP32 DNN, but they introduce irregular parallelism and custom data types that are difficult for GPUs to handle. In contrast, FPGAs are designed to achieve extreme customizability when running irregular parallelism and custom data types. This trend makes FPGAs a viable platform for running DNN, AI, and ML applications in the future.

FPGA Surpasses GPU as the Next Generation Deep Learning Engine

The exponential growth of digital data from images, videos, and voice from social media and the Internet drives the need for analysis to make the data understandable and processable.

Data analysis typically relies on machine learning (ML) algorithms. In ML algorithms, deep convolutional neural networks (DNN) provide state-of-the-art accuracy for important image classification tasks and are widely adopted.

At the recent International Symposium on Field Programmable Gate Arrays (ISFPGA), Dr. Eriko Nurvitadhi from Intel Accelerator Architecture Lab presented a paper titled “Can FPGAs beat GPUs in Accelerating Next-Generation Deep Neural Networks”. Their research evaluated emerging DNN algorithms on two generations of Intel FPGAs (Intel Arria 10 and Intel Stratix 10) against the latest high-performance NVIDIA Titan X Pascal * Graphics Processing Unit (GPU).

Dr. Randy Huang, an FPGA architect at Intel Programmable Solutions Group and one of the paper’s co-authors, said: “Deep learning is the most exciting field in AI because we have seen tremendous progress and a plethora of applications brought by deep learning. While AI and DNN research tend to use GPUs, we find that the application domain perfectly aligns with Intel’s next-generation FPGA architecture. We examined the upcoming technological advances of FPGAs and the rapid growth of innovative DNN algorithms, pondering whether future high-performance FPGAs will outperform GPUs for next-generation DNNs. Our research found that FPGAs perform remarkably well in DNN research and can be used in AI, big data, or machine learning research that requires analyzing vast amounts of data. Using pruned or compressed data (as opposed to full 32-bit floating-point data (FP32)), the tested Intel Stratix 10 FPGA outperformed the GPU. In addition to performance, FPGAs are powerful due to their adaptability, allowing teams to easily implement changes by reusing existing chips, enabling progress from concept to prototype in six months (compared to building an ASIC in 18 months).”

Neural Network Machine Learning Used in Testing

Neural networks can be represented as a graph of neurons interconnected by weighted edges. Each neuron and edge is associated with an activation value and a weight, respectively. The graph is constructed as layers of neurons. As shown in Figure 1.

Figure 1 Overview of Deep Neural Networks

Neural network computations occur through each layer in the network. For a given layer, the value of each neuron is calculated by multiplying and accumulating the values of neurons from the previous layer and the weights of the edges. The computation heavily relies on multiply-accumulate operations. DNN computations include forward and backward passes. The forward pass samples at the input layer, traverses all hidden layers, and produces predictions at the output layer. For inference, only the forward pass is needed to obtain predictions for a given sample. For training, the prediction errors from the forward pass are fed back in the backward pass to update the network weights. This is known as the backpropagation algorithm. Training iteratively performs forward and backward passes to adjust the network weights until the desired accuracy is achieved.

FPGA Becomes a Viable Alternative

Hardware: Compared to high-end GPUs, FPGAs have superior energy efficiency (performance/watt), but they do not have high peak floating-point performance. FPGA technology is rapidly evolving, with the upcoming Intel Stratix 10 FPGA offering over 5,000 hardware floating-point units (DSP), over 28MB of on-chip RAM (M20Ks), integrated with high bandwidth memory (up to 4x250GB/s/stack or 1TB/s), and improved frequency from the new HyperFlex technology. Intel FPGAs provide a comprehensive software ecosystem, from low-level Hardware Description Languages to higher-level software development environments with OpenCL, C, and C++. Intel will further leverage the MKL-DNN library to tune FPGAs for Intel’s machine learning ecosystem and traditional frameworks (like Caffe currently provided) as well as other frameworks that will emerge soon. The Intel Stratix 10 based on 14nm process achieves a peak of 9.2 TFLOP/s in FP32 throughput. In contrast, the latest Titan X Pascal GPU has an FP32 throughput of 11 TFLOP/s.

Emerging DNN Algorithms: Deeper networks improve accuracy but significantly increase the number of parameters and model size. This increases the demands on computation, bandwidth, and storage. Thus, using more efficient DNNs has become a trend. An emerging trend is adopting compact low-precision data types far below 32 bits, with 16-bit and 8-bit data types becoming the new standard, as they are supported by DNN software frameworks (such as TensorFlow). Additionally, researchers have made continuous accuracy improvements on extremely low-precision 2-bit ternary and 1-bit binary DNNs, where values are constrained to (0, +1, -1) or (+1, -1), respectively. Dr. Nurvitadhi’s recently co-authored paper first demonstrated that ternary DNNs can achieve currently the highest accuracy on the well-known ImageNet dataset. Another emerging trend is introducing sparsity (zero presence) in DNN neurons and weights through techniques like pruning, ReLU, and ternarization, which can lead to DNNs with ~50% to ~90% zero presence. Since computations do not need to be performed on such zero values, performance improvements can be achieved if the hardware executing such sparse DNNs can effectively skip zero computations.

The efficiency of emerging low-precision and sparse DNN algorithms shows significant improvements over traditional dense FP32 DNNs, but they introduce irregular parallelism and custom data types that are difficult for GPUs to handle. In contrast, FPGAs are designed to achieve extreme customizability when running irregular parallelism and custom data types. This trend makes FPGAs a viable platform for running DNN, AI, and ML applications in the future. Mr. Huang said: “FPGA-specific machine learning algorithms have more headroom. Figure 2 illustrates the extreme customizability of FPGAs (2A), which can effectively implement emerging DNNs (2B).

Figure 2

Hardware and Methods Used in Research

GPU: Using known libraries (cuBLAS) or frameworks (Torch with cuDNN)

FPGA: Using Quartus Early Beta version and PowerPlay

Figure 3: GEMM Test Results. GEMM is a key operation in DNN.

In low-precision and sparse DNNs, Stratix 10 FPGA outperforms Titan X GPU in performance, and even the performance-power ratio is better. Future such DNNs may become a trend.

Research 1: GEMM Testing

DNN heavily relies on GEMM. Conventional DNN relies on FP32 dense GEMM. However, lower precision and sparse emerging DNNs rely on low-precision and/or sparse GEMM. The Intel team evaluated these GEMMs.

FP32 Dense GEMM: Since FP32 dense GEMM has been well-studied, the team compared the peak performance on FPGA and GPU data sheets. The peak theoretical performance of Titan X Pascal is 11 TFLOP and 9.2 TFLOP for Stratix 10. Figure 3A shows that Intel Stratix 10 with many more DSPs will provide stronger FP32 performance than Intel Arria 10, and performance is close to that of Titan X.

Low-Precision INT6 GEMM: To demonstrate the customizable advantages of FPGAs, the team studied FPGA’s INT6 GEMM by packing four int6 into one DSP module. For GPUs that do not natively support INT6, they compared with the peak performance of INT8 GPU. Figure 3B shows that Intel Stratix 10 outperformed the GPU. FPGAs provided a more compelling performance/power ratio than GPUs.

Very Low-Precision 1-Bit Binary GEMM: Recent binary DNNs proposed a very compact 1-bit data type, allowing multiplication to be replaced with xnor and bit-count operations, making it very suitable for FPGAs. Figure 3C shows the team’s binary GEMM test results, where FPGAs generally performed better than GPUs (i.e., achieving ~2x to ~10x depending on frequency targets).

Sparse GEMM: Emerging sparse DNNs contain many zero values. The team tested a sparse GEMM on a matrix with 85% zero values (based on a pruned AlexNet). The team tested the flexibility of using FPGAs to skip zero computations in a fine-grained manner. The team also tested sparse GEMM on GPUs but found performance worse than executing dense GEMM on GPUs (for the same matrix size). The team’s sparse GEMM tests (Figure 3D) showed that FPGAs can outperform GPUs, depending on the target frequency of the FPGA.

Figure 4: Trends in DNN accuracy, and test results of FPGA and GPU on Ternary ResNet DNN.

Research 2: Testing Ternary ResNet DNN

Ternary DNNs recently proposed neural network weight constraints to +1, 0, or -1. This allows sparse 2-bit weights and replaces multiplication with sign bit operations. In this test, the team used FPGA designs customized for zero skipping and 2-bit weights while optimizing the execution of Ternary-ResNet DNN without multipliers.

Unlike many other low-precision and sparse DNNs, ternary DNNs can provide comparable accuracy for state-of-the-art DNNs (i.e., ResNet), as shown in Figure 4A. “Many existing GPU and FPGA studies only target ‘good enough’ accuracy based on AlexNet (proposed in 2012) on ImageNet. The state-of-the-art ResNet (proposed in 2015) provides over 10% higher accuracy than AlexNet. At the end of 2016, in another paper, we first pointed out that low-precision and sparse ternary DNN algorithms on ResNet could achieve accuracy within ±1% of full precision ResNet. This ternary ResNet is our goal in FPGA research. Therefore, we first argue that FPGAs can provide first-class (ResNet) ImageNet accuracy and can outperform GPUs better.” said Nurvitadhi.

Figure 4B shows the performance and performance/power ratio of Intel Stratix 10 FPGA and Titan X GPU on ResNet-50. Even conservatively estimated, Intel Stratix 10 FPGA achieved approximately 60% better performance than Titan X GPU. Moderate and aggressive estimates would be better (2.1x and 3.5x acceleration). Interestingly, the aggressive estimate at 750MHz for Intel Stratix 10 could exceed the theoretical peak performance of Titan X by 35%. In terms of performance/power ratio, from conservative to aggressive estimates, Intel Stratix 10 is better than Titan X by 2.3x to 4.3x.

How FPGA Stacked in Research Tests

The results indicate that the performance (TOP/second) of Intel Stratix 10 FPGA is improved by 10%, 50%, and 5.4 times over Titan X Pascal GPU on sparse, Int6, and binary DNNs’ GEMM, respectively. On Ternary ResNet, the performance of Stratix 10 FPGA improved by 60% over Titan X Pascal GPU, and the performance/power ratio is better by 2.3 times. The results indicate that FPGAs may become the preferred platform for next-generation DNN acceleration.

Future of FPGAs in Deep Neural Networks

Can FPGAs outperform GPUs in the performance of next-generation DNNs? Intel’s evaluation of two generations of FPGAs (Intel Arria 10 and Intel Stratix 10) and the latest Titan X GPU on various emerging DNNs shows that the current trends in DNN algorithms may favor FPGAs, and FPGAs can even provide superior performance. Although these conclusions stem from work completed in 2016, the Intel team continues to test modern DNN algorithms and optimizations on Intel FPGAs (e.g., FFT/winograd mathematical transformations, quantization, compression).

The team also noted that beyond DNNs, FPGAs have opportunities in other irregular applications and latency-sensitive areas (such as ADAS and industrial uses).

“Currently, machine learning using 32-bit dense matrix multiplication is a field where GPUs excel,” Huang said: “We encourage other developers and researchers to work with us to reframe machine learning problems to take full advantage of FPGA’s ability to handle smaller bit processing, as FPGAs can adapt well to the shift towards low precision.”

Original article link: https://www.nextplatform.com/2017/03/21/can-fpgas-beat-gpus-accelerating-next-generation-deep-learning/

New Intelligence “3·27” AI Technology Summit Ticket QR Code:

Leave a Comment Cancel reply