Focus on Model Compression, Low-Bit Quantization, Mobile Inference Acceleration Optimization, and Deployment

Introduction: This time there are 18 items. 【News】 Graphcore releases second-generation 7nm IPU M2000, Imagination launches the safety-focused XS GPU for automotive ADAS, analysis of the Snapdragon 4100 series designed for smartwatches, and ARM China releases a lightweight processor “Star” for IoT devices. 【Papers】 Yitu & National University of Singapore propose a precision-performance dual-super MobileNetV2 MobileNeXt; lightweight super-resolution network LESRCNN; interpretation of the EasyQuant post-quantization algorithm. 【Open Source】 OPEN AI LAB releases the embedded all-scenario inference framework Tengine-Lite, with a blog post featuring a 212 keypoint Tengine-Lite demo; Baidu releases the PaddleOCR project suitable for edge deployment inference, providing a complete solution; SenseTime & CUHK release the MMLab algorithm platform; Xiaovision Technology open-sources a RGB image-based silent liveness detection algorithm. 【Blog】 A learning note from the NCNN uploader: step-by-step guide on how to make the inference framework support RISC-V! Tengine-Lite’s face 212 keypoint demo; Huawei’s two papers, one on optimizing ARM CPU 1-bit and tiny-bert. How to train BNN models with Pytorch, and an article analyzing the computational load in Conv space and channel domain are also worth a look.

First, some other warm-up news:

Cambricon to be listed on the Science and Technology Innovation Board on July 20: raising nearly 2.6 billion, with Lenovo and Midea participating;
Qualcomm may launchSnapdragon 875G: based on Cortex-X1 + A78 core combination, with the next flagship integrating 5G modem instead of external, followed by the mid-range 735G;
TSMC and MediaTek have released their June revenue reports, both showing significant growth, with MediaTek up 21%, reaching a new monthly high since October 2016;
Samsung Electronics modifies chip process route, skipping 4nm and going directly from 5nm to 3nm;
Alibaba Cloud IoT and Tmall Genie jointly establish an AIoT innovation center, integrating Alibaba’s technological capabilities including Damo Academy;
Android 11 adds a permanent dedicated smart home control widget, integrating with Google Assistant;
Sony launches an edge low-power solution Spresense microcontroller board, powered by Sony’s CXD5602 microcontroller (ARM® Cortex®-M4F × 6 cores) with a clock speed of 156 MHz;
NVIDIA surpasses Intel to become the highest-valued chip company in the U.S.

Note: Some links may not open, please click the end of the article [Read the original] to jump

Industry News

Graphcore releases second-generation IPU and IPU-M2000: Three disruptive technologies defining the future of AI computing | Graphcore Summary: Graphcore releases the second-generation IPU and large-scale system-level product IPU-Machine: M2000 (IPU-M2000), the new generation product has stronger processing power, more memory, and built-in scalability to handle extremely large machine intelligence workloads. The IPU-Machine M2000 is a plug-and-play machine intelligence computing blade designed for easy deployment and supports scalable systems. The compact 1U blade provides 1 PetaFlop of machine intelligence computing power and includes integrated networking technology optimized for AI horizontal scaling. Each IPU-M2000 core is the new Graphcore Colossus™Mk2 GC200 IPU. This chip is developed using TSMC’s latest 7nm process technology, with over 59.4 billion transistors on an 823 square millimeter die, making it the most complex processor ever. In addition to supporting FP32 IEEE floating-point operations, it also supports FP16.32 (16-bit multiply and 32-bit accumulate) and FP16.16 (16-bit multiply accumulate). What is unique is that it has hardware support for random rounding. The AI-Float arithmetic block also provides native support for sparse arithmetic floating-point operations. We provide library support for various sparse operations, including block sparsity and dynamic sparsity. On the communication side, the dedicated AI networking IPU-Fabric™ provides low latency and high bandwidth. Each IPU-M2000 can provide 2.8Tbps. As you connect more IPU-M2000 chassis together, the total bandwidth increases to many Petabits per second. IPU-Fabric uses a 3D ring topology for maximum efficiency, as it maps well to the three dimensions of parallelism in machine intelligence computing. Download the Chinese version of the Moor research report: https://zhuanlan.zhihu.com/p/159707631
Imagination releases XS GPU: The ultimate GPU series focused on the automotive industry | Imagination Tech Summary: For many years, Imagination Technologies has been known as a leading GPU supplier in the mobile sector, but it also provides GPU IP for two of the three major players in the automotive application processor market, holding the largest market share. This is also the industry’s first functional safety GPU product series: the XS GPU product series has been independently verified and complies with ISO 26262 standards (automotive electronic product safety standards). With the launch of the XS GPU product series designed for functional safety, they can be used for in-car tasks that previously did not involve GPUs. This is divided into two categories: one is graphical tasks, such as displaying safety-related information on dashboards or panoramic surround view systems; the other is computational tasks for ADAS, commonly referred to as “computing.” Our GPUs are very suitable for both graphical and computational tasks (for the latter, we have dedicated neural network accelerators). Regardless of the task, the entire operating system must be functionally safe and comply with ISO 26262 standards. This is designed to ensure the safety of all automotive electronic devices, thereby reducing the impact of any failures (even if failures cannot be avoided). In real automotive driving, this means that if the system experiences a non-catastrophic failure, the vehicle can still safely park or drive directly to a repair center. This is known as functional safety (FuSa).
Qualcomm Snapdragon 4100 series analysis: The hope for the revival of Android smartwatches | San Yi Life Summary: In the last briefing, we published news about the Snapdragon 4100 series, this analysis will go deeper. The Snapdragon 4100 wearable platform uses a 12nm semiconductor process, a new 1.7GHz quad-core A53 CPU, and a new architecture Adreno 504 graphics processor. Qualcomm designed two Hexagon QDSP6 V56 digital signal processors (DSP), not to achieve double DSP performance, but to dedicate one DSP to handle baseband communication and GPS positioning, while the other DSP is dedicated to audio processing and sensor signal integration (Sensor Hub). This way, the load on a single DSP is reduced, power consumption is significantly lowered, and the stability of continuous positioning or continuous health monitoring can be improved. The Snapdragon 4100 platform includes two different hardware solutions, namely Snapdragon 4100 and Snapdragon 4100+. The plus version adds a co-processor based on the Cortex-M series architecture—QCC1110 Enhanced (literally translated as enhanced version). The independent processor, independent memory, and independent DSP of QC1110 can continue to drive the watch display of 65536 colors, support continuous heart rate monitoring, continuous sleep monitoring, and continuous exercise statistics even when the main SoC is in low-power mode. In other words, with Snapdragon 4100+, Android smartwatches can continue to use nearly all health and exercise functions for more than a day even with only 15% battery remaining, completely solving the “battery anxiety” problem.
Arm China “Star” processor STAR-MC1: A lightweight real-time processor for IoT devices | Electronic Products World Summary: On July 8, Arm China held a media sharing session for the “Star” processor, where Liu Shu, Vice President of Product Development at Arm China, disclosed the latest progress of its first fully self-developed 32-bit embedded processor IP “Star (STAR-MC1)”. STAR-MC1 is the first CPU in Arm China’s Microcontroller series. From the establishment of Arm China in April 2018 to the release of the first EAC version in September 2019, it took only 17 months to launch STAR-MC1. Its technical highlights include but are not limited to: Armv8-M architecture, providing extreme energy efficiency of up to 4.02 Coremark/MHz; the memory subsystem introduces tightly coupled memory and cache technology to ensure real-time performance and execution efficiency; DSP instructions and co-processors help achieve significant improvements in algorithm performance. TrustZone technology ensures the information security of IoT devices. The Star processor provides security protection at the system level, and will subsequently link with “Shanhai” and PSA certification to achieve security protection; the custom instruction set helps all developers interconnect, providing templates and technical support for customers within reserved space according to Arm’s instruction format, ensuring specific customer needs while also supplementing the limited instruction set.

Papers

[2007.02269v2] MobileNeXt: Breaking the routine, super improvement of inverse residual modules, the next generation mobile model surpassing MobileNetV2 | Extreme City Platform Title: Rethinking Bottleneck Structure for Efficient Mobile Network Design Link: https://arxiv.org/abs/2007.02269v2 Code: https://github.com/zhoudaquan/rethinkingbottleneckdesign Summary: A network architecture MobileNeXt proposed by Yitu Technology & the team of Yan Shuicheng from the National University of Singapore. It conducts a deep analysis of the problems in the core module of MobileNetV2’s inverse residual module and proposes a novel SandGlass module, which is used to construct the MobileNeXt architecture. SandGlass is a universal module that can be easily embedded into existing network architectures to enhance model performance. This paper is one of the few excellent terminal models in recent years.To be more suitable for mobile devices, the authors introduce a new hyperparameter: identity tensor multiplier, to reduce addition operations that have almost no improvement in accuracy and decrease the number of memory access operations..
[2007.04344v2] Lightweight Image Super-Resolution with Enhanced CNN Link: https://arxiv.org/abs/2007.04344v2 Summary: The super-resolution (SR) model has stunning effects, but the computational cost and runtime memory usage are high. This paper proposes a lightweight super-resolution convolutional network, with an overall architecture consisting of three sub-modules: Information Extraction and Enhancement Module (IEEB), Reconstruction Module (RB), and Information Refinement Module (IRB). The IEEB extracts low-resolution features and aggregates them to improve shallow memory capacity. The role of the IEEB module is to remove redundant information, followed by the RB module, which converts low-frequency features into high-frequency features by fusing global and local features, achieving a complementary effect between the RB and IEEB modules. The IRB module uses coarse-grained high-frequency features from the RB module to learn more precise super-resolution features and reconstruct the super-resolution result image. The proposed LESRCNN can achieve high-quality images and outperform existing SOTA SISR methods.
EasyQuant: Post-training Quantization via Scale Optimization | GiantPandaCV Link: https://arxiv.org/pdf/2006.16669.pdf Code: https://github.com/deepglint/EasyQuant Summary: The post-training quantization algorithm proposed in this paper introduces similarity as the objective function, by alternately searching for the quantization factors (scale) of weights and activations to maximize the similarity of activation values before and after quantization, to find the optimal quantization factors for weights and activations. In the actual inference phase on the edge, weight and activation int7 quantization is used, with intermediate int16 accumulators accumulating multiple times, making the inference speed superior to weight and activation int8 quantization, where the intermediate int16 accumulator only accumulates twice, while also maintaining the accuracy of the quantized algorithm.

Open Source Projects

Note: Each item is prefixed with the repository owner and repository name on GitHub, completing the address as github.com/<repo_owner>/<repo_name>.

OAID/Tengine-lite: OPEN AI Lab open-source embedded all-scenario inference framework Tengine-Lite | OPEN AI LAB Open Intelligence Summary: Tengine-Lite was officially open-sourced by OPEN AI LAB on July 6, featuring a complete set of convenient MCU AI development tools, and establishing an open development ecosystem for the embedded AI software industry. Rebuilt with pure C code, Tengine-Lite is simpler, more efficient, and has better code readability, achieving extreme lightweight and dependency-free. The Linux version of the library is under 500KB, and the Android library is under 1MB; compilation is fast; the framework and computation library are separated using a plug-in design, also supporting the mounting of other heterogeneous computing libraries; compared to Tengine, the Lite version shows significant stability improvements in single-core and multi-core performance (see performance scatter plot omitted here);It runs on embedded all-scenario, minimal real-time operating systems like FreeRTOS, RTT, Lite-OS, or bare-metal, and can also run on low-power, resource-limited IoT chip controllers like MCU, RISC-V, etc. It can be used in various scenarios such as voice and vision.
PaddlePaddle/PaddleOCR: 8.6M ultra-lightweight Chinese and English OCR model open-sourced, training and deployment in one go, online demo available | Quantum Bit Project Address: https://github.com/PaddlePaddle/PaddleOCR Online Demo: https://www.paddlepaddle.org.cn/hub/scene/ocr PaddleOCR releases an ultra-lightweight model, mainly consisting of a 4.1M detection model and a 4.5M recognition model. The detection model’s base model uses the DB algorithm, and the text model’s base model uses the classic CRNN algorithm. Given the superior performance of MobileNetV3 in edge series models, both models choose to use MobileNetV3 as the backbone network, reducing the model size by over 90% initially. In addition, strategies such as reducing the number of feature channels are also employed to further compress the model size. The accuracy of the general OCR model may not meet the needs, and it can also be quickly customized for training. In terms of prediction platform support, it not only supports native inference libraries for X86 CPU, NV, and Kunlun chips, but also has Paddle-Lite supporting mobile embedded ends, including ARM CPU, GPU, Huawei NPU, Bitmain, RK NPU, MTK APU, Baidu Kunlun chips, etc.
open-mmlab: The Hong Kong Chinese University-SenseTime joint laboratory MMLab open-source algorithm platform project homepage: http://openmmlab.org/ In less than two years, MMLab has included numerous SOTA computer vision algorithms. OpenMMLab is not a single project on GitHub, in addition to the well-known MMDetection target detection library with over ten thousand stars, there are other code libraries and datasets in various directions, which are very worth paying attention to for friends engaged in computer vision research and development. Recently, OpenMMLab has been intensively updated, adding multiple libraries covering more than 10 research directions, opening over 100 algorithms and 600 pre-trained models, with a total of over 17,000 stars on GitHub. It is a systematic and active open-source platform in the CV direction. Most of these libraries are based on the deep learning PyTorch framework, with algorithms keeping up with the forefront, easy to use, and relatively rich documentation, making it worth understanding for both research and engineering development friends.
minivision-ai/Silent-Face-Anti-Spoofing: 9ms silent liveness detection | I Love Computer Vision Summary: The Xiaovision Technology team open-sourced a liveness detection model based on RGB images, specifically aimed at industrial landing scenarios, compatible with models in various complex environments. Combining Fourier spectrum graph auxiliary network supervision and attention mechanism, the results are outstanding. The lightweight process uses self-developed model pruning methods, reducing the flops of MobileFaceNet from 0.224G to 0.081G. In addition, using large-scale network ResNet34 distilled pruning networks MiniFASNetV1 and MiniFASNetV2, accuracy is improved. On the Kirin 990 5G chip, it only takes 9ms. Meanwhile, the model trained based on PyTorch can be flexibly converted to ONNX format for full-platform deployment.
NVIDIA Jetson Power Estimation Tool: PowerEstimator | Jipu Xun Technology Summary: Tool class. PowerEstimator is a power estimation tool for systems on Jetson modules (SOM). It estimates the average SOM power consumption and generates nvpmodel configuration files for system configuration and target workloads to be modeled. The tool provides a set of input knobs, such as clock frequency, number of active cores, load level, and device operating state, to define target workloads and obtain power estimates before applying configurations to target devices.

Blog Posts

I have ported ncnn to RISC-V! | NeuralTalk Summary: RISC-V (pronounced “risk-five”) is an open-source instruction set architecture (ISA) based on the principles of reduced instruction set computing (RISC). Its significance lies in its design to make it suitable for modern computing devices (such as warehouse-scale cloud computers, high-end mobile phones, and tiny embedded systems). Designers have considered performance and power efficiency in these uses. This instruction set also has a wealth of supporting software, addressing the common weaknesses of new instruction sets.This article is a learning note on how the NCNN uploader step by step made NCNN support RISC-V, which I believe will be helpful for many framework developers, especially those who wish to support RISC-V..
Tengine-Lite: Using open-source 212-point face keypoint implementation for real-time face anonymization on Android, with GitHub address included | Zhihu Summary: With the popularity of face recognition technology, the privacy issues of face data have received increasing attention, and research on privacy protection has emerged. Currently, there are roughly three directions: tampering with images recognized by face recognition systems; using generative adversarial networks (GAN) to anonymize someone’s photo or video; directly blurring the faces recognized by face recognition. This article mainly discusses the third point, explaining how to use mobile face keypoint algorithms to achieve face anonymization. This method has low device requirements, simple and understandable code, and can be directly implemented after modification. Meanwhile, this article open-sources a mobile real-time face 212 keypoint SDK that can run on various phones with very low latency.
Walking the ultimate path of 1-bit inference on ARM CPU | NeuralTalk Summary: Edge learning in deep learning has an extreme pursuit of lightweight, from 2016 to now, we have seen the bit count used in inference drop repeatedly: FP32, FP16, INT8… It is evident that low bits can save model storage space and power consumption, and can also gain performance advantages in some carefully designed algorithms. In this regard, 1-bit is undoubtedly the ultimate that can be achieved on classic computers.This article will introduce Huawei’s exploration of optimization techniques for ARM CPU on 1 bit.
Building a trainable BNN based on Pytorch | GiantPandaCV Summary: This article will analyze the Pytorch code for BNN based on the paper “Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1” presented at NIPS 2016.
Analyzing various methods and corresponding structures of Conv and their variants | AI Algorithms and Image Processing Summary: This article provides an overview of the blocks used in efficient CNN models (such as MobileNet and its variants) and an intuitive explanation of the computational load in spatial and channel domains.
Exploring Bert’s edge deployment on mobile: 6ms TinyBert, rapid inference combined with hardware | NeuralTalk Summary: Natural language processing problems can be divided into four categories: sequence labeling, classification tasks, sentence relationship judgment, and generative tasks. In late October 2018, the Google team proposed the pre-trained language model Bert, refreshing the results of 11 natural language processing tasks. Bert can be used for question-answering systems, sentiment analysis, spam filtering, named entity recognition, document clustering, and other tasks. How to utilize the computing resources of smartphones to solve the Bert computation acceleration problem is of great significance. This article records Huawei’s work on model algorithm distillation and ARM CPU optimization..

Note: Some links may not open, please click the end of the article [Read the original] to jump

Embedded AI Briefing 2020-07-18: Release of MobileNeXt/Tengine-Lite for Edge Models/NCNN Support for RISC-V

Focus on Model Compression, Low-Bit Quantization, Mobile Inference Acceleration Optimization, and Deployment

Industry News

Papers

Open Source Projects

Blog Posts

Leave a Comment Cancel reply

Focus on Model Compression, Low-Bit Quantization, Mobile Inference Acceleration Optimization, and Deployment

Industry News

Papers

Open Source Projects

Blog Posts

Related posts

Leave a Comment Cancel reply