Technical Challenges and Opportunities in Embedded Machine Learning Processors

In December 2016, Vivienne Sze, Yu-Hsin Chen, and others (authors of Eyeriss) published a good review article titled “Hardware for Machine Learning: Challenges and Opportunities” on arXiv. Here, I would like to discuss my understanding of the “challenges” and “opportunities” in conjunction with this article [8] and the ISSCC2017 paper [1-7].

First, this discussion mainly focuses on embedded machine learning processors (primarily for inference). In fact, there are also dedicated processors for training/inference on the cloud, but that is another topic.

Several key metrics for evaluating an embedded machine learning system (including hardware and software) include: accuracy, energy consumption, throughput/latency, and cost. We mainly discuss the hardware part. However, hardware design is closely related to algorithms and software, and good hardware design can achieve the most reasonable balance for the entire system.

In addition, programmability (flexibility) is also very important for machine learning processors (or accelerators). Of course, flexibility is relative, and the flexibility of embedded machine learning processors lies between GPUs and hardware accelerators (which have basically no programmability). The specific choice is a trade-off considered during design. For example, some machine learning processors can support both CNN and RNN simultaneously, while others only support one of them, leading to significant differences in implementation.

Among these metrics, accuracy, throughput/latency, and programmability can be considered performance indicators; energy consumption and cost can be collectively referred to as cost indicators. The design challenge is the conflict between performance indicators and cost indicators. Generally speaking, there is no free lunch, and controlling costs inevitably sacrifices performance. Of course, there is also a situation where technological innovation or progress can achieve better results without sacrificing either side, or even both can be improved simultaneously.

In the cloud, performance optimization is the main theme; while in embedded applications, the most important thing is to meet system cost requirements with minimal sacrifice of performance. Below, I will introduce the technical opportunities or design methods from several aspects: algorithm and hardware co-optimization, architecture (coarse-grained), micro-architecture (fine-grained), circuits, and others. For specific sources, you can refer to [8] and my previous analysis of the ISSCC papers (you can input isscc17 in the public account to view).

1. Joint Optimization of Algorithms and Hardware

Those of us working on SoC chip-based hardware-software systems should be well aware that for an optimization goal (such as energy consumption), optimizations made at the application/algorithm/software level are often much more effective than those made at the hardware architecture and circuit levels. For example, when we are working on communication basebands, if an algorithm can be optimized to reduce some accuracy requirements, even reducing the quantization bits by just one bit can have a significant impact on the corresponding hardware processing paths (multipliers, adders, storage, etc.). The same principle applies to optimizing machine learning processors. Hardware optimization should first consider whether it is possible to reduce computational load, data transfer, and storage requirements at the algorithm (application) level.In terms of opportunities at the algorithm level for machine learning, they include:

a)Reducing Precision:

Using fixed-point numbers and reducing bit width. Currently, 8-bit width has become quite common in hardware.With more significant changes to the network, it is possible to reduce bitwidth down to 1-bit for either weights [56] or both weights and activations [57, 58] at the cost of reduced accuracy. The impact of 1-bit weights on hardware is explored in [59].[8]

b) ExploitingSparsity:

At the algorithm level, pruning can reduce the number of MACs and weights (though it may not necessarily optimize energy consumption); pruning based on an energy model can directly minimize energy consumption. Recently, I also saw a “shrink” technology (Pilot ai lab), but I haven’t seen the details, so I can’t evaluate it.

In hardware design, this ISSCC paper basically used masking technology to avoid processing when the data channel or memory input (also sometimes placed at the output) is all zeros (or extremely small values).

c) Compression:

Pruning at the algorithm level is essentially a form of “compression” of neural networks. In addition, hardware can also utilize various forms of lightweight compression to reduce data transfer.Lossless compression directly reduces the data transfer in and out of the chip. Simple run-length coding reduces bandwidth by up to 1.9 times. Vector quantization and other lossy compressions can also be used for feature vectors and weights. Generally speaking, the cost of hardware compression/decompression is on the order of several thousand kgates, with minimal energy overhead.

In addition, at this ISSCC conference, KAIST’s face recognition processor[6] used “a separable filter approximation for convolutional layers (SF-CONV) and a transpose-read SRAM (T-SRAM) for low-power CNN processing,” which is a very good example of algorithm and hardware co-optimization. It approximates the 2D convolution of CNN with two-stage 1D convolutions (algorithm level) and then designs a T-SRAM that supports vertical reading (column reading) to support this algorithm.

2. Architecture (Coarse-grained)

From the current design trends of machine learning processors, the acceleration of CNN mainly adopts a 2D MAC array (Convolution) + 1D SIMD or scalar (ReLU, pool, ALU). The focus of research and design is how to efficiently provide data to the MAC array (including cached data). This part has been discussed extensively in the “Eyeriss” processor [9] and the “ENVISION” processor [5] presented at this conference, mainly exploring the inherent parallelism and data reusability of DNN. I will discuss this issue in more detail later.

3. Microarchitecture Level

This part mainly focuses on optimizing data channels, especially the optimization of MAC (which is the most commonly used operation in DNN). Some of the more insightful technologies used in this ISSCC include:

a)TheDynamic-Voltage-Accuracy-Frequency Scaling (DVAFS) MAC used in the ENVISION processor. [5]

b)KAIST’s DNPU uses “Layer-by-layer dynamic fixed-point with online adaptation and optimized LUT-based multiplier,” dynamic fixed-point numbers. [2]

c) The dynamic data channel configuration implemented in ST’s paper (through flexible DMA design), while not innovative, is a very valuable implementation method. [1]

d)In addition to optimizing MAC, this ISSCC paper [7] also discussed optimization of memory access.

4. Circuit Level

Several papers at ISSCC mentioned circuit-level optimizations. The most interesting one should be from Harvard University [3], which utilizes

Razor timing violation detection technology combined with the inherent fault tolerance of DNN to reduce critical path timing requirements, thereby lowering operating voltage.In the ENVISION design, Body-bias is adjusted to balance dynamic and static power consumption while considering operational accuracy [5]; additionally, paper [7] discusses optimizations for memory circuits.

5. Others

a)Mixed Signal: In [6], “analog-digital hybrid Haarlike face detector (HHFD) integrated on the CIS for low-power face detection (FD).” Of course, this is not too far-fetched, as there should be a lot of imaginative space in this field. If we can optimize MAC and memory access through mixed signal processing, it may greatly improve overall performance. Some specific examples can be found in [8].

b)Computing Embedded Sensors: This is relatively easy to understand; bringing computation closer to the sensor can reduce a lot of unnecessary data transfer. This should also be a trend in edge computing development.

c)Computing Embedded Memory: I personally have a favorable view of this direction. First, this is also an optimization that brings data and computation as close as possible; second, the development direction of neural networks increasingly demands higher requirements for memory, so making efforts in storage also adapts to this trend.

d)New Storage Technologies: Due to the large storage requirements of machine learning, any new technology in storage, such as embedded DRAM (eDRAM), Hyper Memory Cube (HMC), etc., will be very helpful.

In summary, with the rapid development and practical application of artificial intelligence and deep learning, dedicated chips related to these fields have become a focal point of attention in both academia and industry. Greater challenges will inevitably bring about more exciting innovations.

T.S.

P.S. You can input isscc17 in the public account to view the paper guide

References:

[1] G. Desoli, STMicroelectronics, Cornaredo, Italy, “A 2.9TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems”

[2] D. Shin, et al., “DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks”

[3] P. N. Whatmough, et al., “A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications”

[4] M. Price, et al., “A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models and Voice-Activated Power Gating”

[5] B. Moons, et al., “ENVISION: A 0.26-to-10TOPS/W Subword-Parallel Dynamic-Voltage-Accuracy-Frequency-Scalable Convolutional Neural Network Processor in 28nm FDSOI”

[6] K. Bong, et al., “A 0.62mW Ultra-Low-Power Convolutional-Neural-Network Face-Recognition Processor and a CIS Integrated with Always-On Haar-Like Face Detector”

[7] S. Bang, et al., “A 288μW Programmable Deep-Learning Processor with 270KB On-Chip Weight Storage Using Non-Uniform Memory Hierarchy for Mobile Intelligence”

[8] Vivienne Sze, Yu-Hsin Chenet al., Vivienne Sze, Yu-Hsin Chen, Joel Emer, Amr Suleiman, Zhengdong Zhang”,arXiv:1612.07625v1 [cs.CV] 22 Dec 2016

[9] Y. Chen, et al., “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks” ISSCC, pp. 262-264, 2016.

The cover material comes from the internet, and the copyright belongs to the owner.

Leave a Comment

×