Efficient Deep Learning Infrastructure for Embedded Computing Systems

Efficient Deep Learning Infrastructure for Embedded Computing Systems
Deep Neural Networks (DNNs) have recently achieved remarkable success in a range of real-world visual and language processing tasks, covering everything from image classification to many other downstream visual tasks such as object detection, tracking, and segmentation. However, despite the previously mature deep neural networks maintaining superior accuracy, the increasingly deep and wide network structures inevitably require substantial computational resources for training and inference. This trend further expands the computational gap between compute-intensive deep neural networks and resource-constrained embedded computing systems, making it more challenging to deploy powerful deep neural networks on real-world embedded computing systems for ubiquitous embedded intelligence.
To alleviate the aforementioned computational gap and promote ubiquitous embedded intelligence, this review focuses on recent efficient deep learning infrastructures for embedded computing systems, covering various aspects from training to inference, from manual to automated, from convolutional neural networks to transformers, from transformers to vision transformers, from visual models to large language models, and from software to hardware, and from algorithms to applications. Specifically, we discuss the efficient deep learning infrastructures for embedded computing systems from the following perspectives:
  1. Efficient manual network design for embedded computing systems
  2. Efficient automated network design for embedded computing systems
  3. Efficient network compression for embedded computing systems
  4. Efficient edge learning for embedded computing systems
  5. Efficient large language models for embedded computing systems
  6. Efficient deep learning software and hardware for embedded computing systems
  7. Efficient intelligent applications for embedded computing systems
Moreover, we also look forward to future directions and trends that have the potential to achieve broader embedded intelligence. We believe this review has its unique value in providing insights for future research, helping researchers quickly and smoothly enter this emerging field.
Efficient Deep Learning Infrastructure for Embedded Computing Systems

1 Introduction

With the increasing popularity of large-scale datasets and advanced computing paradigms, deep neural networks (DNNs) have achieved significant success in a wide range of intelligent applications, demonstrating strong performance. These intelligent applications may cover everything from image classification to downstream visual tasks such as object detection, tracking, segmentation, to natural language processing (NLP) tasks such as automatic speech recognition, machine translation, and question answering. In the following years, deep neural networks have continuously evolved, with network structures becoming deeper and deeper, in order to maintain state-of-the-art accuracy on target tasks. Meanwhile, new network structures and advanced training techniques have emerged, further pushing the achievable accuracy. These powerful deep learning (DL) networks and advanced training techniques, from VGGNet to ResNet, mark the advent of the deep learning era.
The tremendous breakthroughs in deep neural networks have subsequently attracted widespread attention from academia and industry, driving the deployment of powerful deep neural networks on real embedded computing systems, including mobile phones, autonomous vehicles, and the healthcare sector, to promote the development of embedded intelligent applications. In practical applications, this may bring significant benefits. For instance, embedded computing systems significantly support real-time edge data processing, greatly enhancing processing efficiency, thus improving user experience. Moreover, data security and privacy are also ensured, as all data can be processed locally without needing to be uploaded to remote servers. Despite the aforementioned potential benefits, the deployment of powerful deep neural networks on real embedded computing systems still faces some critical limitations. On one hand, to maintain competitive accuracy, representative networks have continuously deepened in recent years, with layer counts reaching hundreds, thus leading to enormous computational complexity. For example, ResNet50, one of the most representative deep networks, contains over 4 billion floating-point operations (FLOPs) and 25 million parameters, requiring over 87MB of device storage to process a single input image. On the other hand, real-world embedded computing systems, such as mobile phones and autonomous vehicles, typically have limited computational resources to optimize power and energy consumption on the device side. In summary, the continuous evolution of network complexity continues to expand the computational gap between compute-intensive deep neural networks and resource-constrained embedded computing systems, making the realization of ubiquitous embedded intelligence increasingly challenging.
To bridge the aforementioned computational gap and promote ubiquitous embedded intelligence, a large number of model compression techniques have recently been proposed, including network pruning, network quantization, and network distillation, which strive for a better accuracy-efficiency balance to fit the limited computational resources in real embedded scenarios. For example, network pruning mainly reduces network redundancy by removing redundant network units, such as weights, channels, and layers, to enhance efficiency on target hardware while minimizing accuracy loss on target tasks. Besides network compression, another parallel option is to manually design resource-efficient networks, such as SqueezeNet, MobileNets, ShuffleNets, and GhostNets, which have dominated early advancements in efficient network design. Although these efficient networks can exhibit outstanding efficiency, they heavily rely on manual expertise, exploring new network structures through iterative experimentation, which also requires substantial engineering effort and non-negligible computational resources. To address these limitations, recent network design practices have shifted from manual design to automated design, also known as neural architecture search (NAS) or automated machine learning (AutoML), focusing on automatically exploring new network structures. The tremendous success of NAS has subsequently spurred many hardware-aware NAS works, such as MnasNet, ProxylessNAS, FBNet, and Once-for-All, which automatically design accurate and efficient hardware network solutions, demonstrating a strong accuracy-efficiency balance, and are widely applied in real embedded computing systems to provide intelligent services.
In addition to the aforementioned main focus on improving edge inference efficiency, recent research has also shifted towards improving edge training efficiency. The underlying theory is that although previous representative networks can exhibit outstanding accuracy, they often require training for hundreds of epochs, which may take days to train on powerful GPUs. Worse yet, the expensive training process on remote GPUs does not allow for customization on local hardware, especially in resource-constrained embedded scenarios. It is noteworthy that local customization on the edge has the potential to further enhance accuracy, particularly as local sensors continuously collect new data from users. To overcome these limitations, several efficient edge learning techniques have recently been established, such as edge continual learning, edge transfer learning, and edge federated learning, enabling the training and fine-tuning of powerful deep networks on local hardware to further enhance performance.
Recently, large language models (LLMs), such as GPT-3 and GPT-4, have achieved impressive success in various real-world language processing tasks. However, the strong learning capabilities of these powerful LLMs also come with enormous computational complexity. For instance, OpenAI’s GPT-3, as one of the most representative LLMs, contains 175 billion parameters. Moreover, to achieve state-of-the-art performance, recent LLMs continue to evolve towards larger and more complex models, with model sizes constantly increasing. These factors make it increasingly challenging to deploy recent powerful LLMs on modern embedded computing systems for intelligent language processing services. To overcome these limitations, a series of effective techniques have recently been proposed, focusing on alleviating the computational complexity of LLMs, exploring computationally efficient LLMs, including efficient LLM architecture design, efficient LLM compression techniques (i.e., pruning, quantization, and knowledge distillation), and efficient LLM system design.
In parallel with the rapid rise of powerful deep networks and advanced training techniques, many representative deep learning software frameworks and hardware accelerators have also emerged, aimed at supporting efficient deep learning solutions for embedded computing systems, such as TensorFlow, PyTorch, Google Edge TPU, Nvidia Edge GPU, and Intel Neural Compute Stick. These deep learning software and hardware have been widely adopted and have brought two main benefits to the deep learning era. On one hand, these deep learning software and hardware eliminate the barriers faced by software and hardware engineers, allowing them to rapidly develop intelligent embedded applications, such as edge object detection, tracking, and segmentation, without excessive domain expertise. On the other hand, these deep learning software and hardware often come with domain-specific optimizations, achieving an outstanding accuracy-efficiency balance with minimal engineering effort. For example, Nvidia Jetson AGX Xavier, as a representative Nvidia Jetson edge GPU, supports the development of intelligent embedded applications using INT8 (i.e., 8-bit weights) precision, significantly improving efficiency without compromising the accuracy of target tasks compared to full precision (32-bit weights).

1.1 Organization of the Paper

This paper aims to summarize recent efficient deep learning infrastructures that are expected to support current and future embedded computing systems, promoting the development of ubiquitous embedded intelligence. In fact, some existing reviews typically focus on efficient deep learning algorithms, but as deep learning infrastructures, especially from the perspective of large language models, are rapidly evolving, these reviews may have become outdated. Unlike these previous reviews, our focus is to provide a more comprehensive and holistic perspective, outlining recent efficient deep learning infrastructures for embedded computing systems, covering various aspects from training to inference, from manual to automated, from convolutional neural networks to transformers, from transformers to vision transformers, from visual models to large language models, and from software to hardware, and from algorithms to applications. Specifically, we discuss the efficient deep learning infrastructures for embedded computing systems from the following perspectives:
  1. Efficient manual network design for embedded computing systems
  2. Efficient automated network design for embedded computing systems
  3. Efficient network compression for embedded computing systems
  4. Efficient edge learning for embedded computing systems
  5. Efficient large language models for embedded computing systems
  6. Efficient deep learning software and hardware for embedded computing systems
  7. Efficient intelligent applications for embedded computing systems
We believe this review has its unique value in providing insights for future research, helping researchers quickly and smoothly enter this emerging field. Finally, we present the organization of this paper, as shown in Figure 1, summarized as follows:
  • Chapter 2 extensively discusses recent representative efficient manual networks.
  • Chapter 3 extensively discusses recent representative efficient automated networks.
  • Chapter 4 extensively discusses recent representative network compression techniques.
  • Chapter 5 extensively discusses recent representative edge learning techniques.
  • Chapter 6 extensively discusses recent representative large language models.
  • Chapter 7 extensively discusses recent representative deep learning software and hardware.
  • Chapter 8 extensively discusses recent representative intelligent embedded applications.
Additionally, at the end of each chapter, we also look forward to possible future directions in their respective fields, which have the potential to pave the way for future ubiquitous embedded intelligence.

Efficient Deep Learning Infrastructure for Embedded Computing Systems

For convenient viewing, visit the link below or click the bottom “Read Original”

https://www.zhuanzhi.ai/vip/aed6e89af2cc6d82f8c441260d9775c6

Efficient Deep Learning Infrastructure for Embedded Computing Systems

Click “Read Original“, to view and download this article

Leave a Comment

×