Focus on Model Compression, Low-Bit Quantization, Mobile Inference Acceleration Optimization, and Deployment

Introduction: This issue contains 15 items. 【News】Shanghai Zhangjiang – News from several GPGPU companies: BoHan released cloud AI inference chip performance exceeding T4 with mass production expected in Q4 this year; Suipian released the largest AI chip in China, Birun’s first 7nm GPU is expected to tape out in Q3 and be launched next year, Samsung’s Exynos 2200 will use AMD licensed GPU performance potentially exceeding Adreno 730, AI face-swapping mobile software has taken the free charts by storm, surpassing TikTok/Snapchat; 【Papers】Huawei’s Additive Network has been upgraded to solidify theoretical accuracy and further optimization, analyzing PyTorch’s 2-bit activation compression technology ActNN, surpassing Tiny-YOLOv4 with edge real-time detection CSL-YOLO; 【Open Source】PyTorchVideo, a treasure trove of video models from training to mobile deployment, Microsoft released the SuperBench project – focusing on hardware/distributed communication/model performance; 【Blog】From SenseTime’s academic team, AI compilation and just-in-time compilation technology, MLPerf latest interpretation of AI training chip status, OPEN AI Lab Tengine framework’s plugin hardware backend adaptation and automated slicing solutions, examining the technological evolution and commercial imagination of RISC-V processor architecture from Arm, and finally interpreting Qualcomm’s CPU and NPU achieving AI high-definition decoding technical steps.

Alright, let’s start with some warm-up news ヽ(✿゜▽゜)ノ:

Qualcomm: Due to dissatisfaction with Samsung’s process performance, the Snapdragon 895+ may adopt TSMC’s 4nm process, launching at the end of 2022;
Imagenation: Collaborating with Realtek Semiconductor, which has obtained authorization for the IMG B series BXE-4-32 graphics processor, to integrate it into its latest SoC for the digital TV market;
China Mobile’s OneChip: A wholly-owned subsidiary of China Mobile, China Mobile IoT’s Xincheng Technology Co., Ltd. has officially started operations, entering the IoT chip field and planning to list on the Sci-Tech Innovation Board;
Oppo: The first product of its “chip-making” initiative may be a 6nm chip crafted by TSMC, possibly a simplified SoC or image signal processor ISP chip, with some IP technology potentially collaborating with Aojie Technology ASR. Currently, OPPO’s chip team has 1,000 people, mainly engineers from Unisoc, Huawei HiSilicon, and MediaTek, led by former MediaTek COO Zhu Shangzu. Their goal is to build a team of 3,000 engineers;
nVIDIA: To facilitate the acquisition of Arm, NVIDIA has opened the UK’s fastest supercomputer, Cambridge-1, to UK academic and medical enterprises, and has stated it will use Arm IP to design chips and establish a supercomputing center in the UK;
CINNO Research: The “China Mobile Communication Industry Data Observation Report” indicates that in May, the shipment volume of mobile phone chips showed that HiSilicon has fallen, while Unisoc has surged 63 times with a monthly shipment of 800,000 chips, ranking in the top five, a year-on-year increase of 6346.2%.MediaTek is the biggest winner;
Unisoc: After applying for bankruptcy reorganization, its core asset Unisoc (000938.SZ) has been reported to have new developments. Insiders revealed that Alibaba and several government-supported enterprises are considering acquiring equity in the cloud computing infrastructure company Unisoc, with bids potentially reaching 50 billion RMB. Note: Unisoc is one of the shareholders of Unisoc, holding a 35.23% stake.Unisoc operates independently as a corporate entity, and since establishing a new management team in early 2019, there has been no overlap in management teams or business with Unisoc, and Unisoc does not directly participate in Unisoc’s business operations or decision-making. Currently, we have not found any announcements that would directly impact Unisoc’s current production and business activities;
Cambricon: Designing a 200 TOPS 7nm intelligent driving chip, featuring an independent safety island and meeting automotive-grade standards, positioned as a “high-level autonomous driving chip”;
Rockchip: launched TB-RK3568X (Cortex-A55x4, Mali-G52, self-developed NPU@1TOPS), TB-RV1126D development board, targeting smart vision, cloud terminals, network video recorders (NVR&XVR), IoT gateways, industrial control, network-attached storage (NAS), KTV karaoke machines, and other industry applications;
DeepGlint: IPO on the Sci-Tech Innovation Board has been accepted by the Shanghai Stock Exchange, aiming to raise 1 billion RMB. Compared to the “AI four little dragons”, DeepGlint is no longer in the first tier, 2B revenue is rising but cash flow is difficult, making government business hard to do.AI companies without a main profit source are not strong enough to break established industry rules to some extent;
CloudWalk Technology: Announced a joint effort with Huawei Ascend A to build an enterprise A central management platform.Focusing on hardware built around Ascend chips, CANN heterogeneous computing architecture, and MindSpore AI computing framework to achieve efficient automation and enhanced experience in business processes.

Note: Some links may not open, please click the end of the article [Read the original] to jump.

Industry News

Shanghai Hanbo Semiconductor: Released cloud AI inference chip, performance exceeding NVIDIA T4, founded by a team with AMD background, investment from Kuaishou | Quantum Bit Summary: AI chip company Hanbo Semiconductor, which just completed a 500 million RMB A+ round of financing in April, released the cloud inference AI chip SV100 series, as well as the AI inference acceleration card VA1 equipped with this series of chips. At the launch, Hanbo Semiconductor’s founder and CEO Qian Jun showcased the first product of the SV100 series – the SV102 intelligent vision chip, while CTO Zhang Lei demonstrated the VA1 board. These two products primarily target the mature CV market in the AI field, emphasizing low latency and multi-channel video processing capabilities, offering energy efficiency advantages over existing GPUs, potentially saving 60% on server costs. Both the CEO and CTO come from AMD, and the company was founded in February 2018. As an emerging AI chip company, it completed the first semi-custom 7nm chip tape-out last May, and subsequently secured $50 million in A round financing led by Kuaishou. The cloud inference chip SV100 series is designed for cloud inference server chips, with SV102 being the first chip in the series, achieving a peak INT8 computing power of 200 TOPS. SV102 supports 64-channel 1080p video decoding, with a maximum power consumption of 75W, PCIe Gen4 x16 interface, and passive cooling. This chip successfully passed testing in June this year. The VA1 equipped with SV102 adopts a single-width, half-height, half-length 75W PCIe card design, thus saving energy and space compared to typical GPU cards. Zhang Lei stated that in the ResNet-50 benchmark test, VA1 achieved over twice the throughput of NVIDIA T4. Due to VA1 saving over 50% of server TCO, a 2U server equipped with VA1 can decode 384 channels of video, with overall computing power exceeding that of T4 devices by more than 2.5 times, and power consumption lower than servers equipped with the same GPU, potentially saving 60% on server costs. In video processing, VA1 supports over 64 channels of H264, H265, or AVS2 1080p decoding, with resolution support up to 8K.SV102 chip and VA1 board are expected to enter mass production in Q4 this year. In addition to hardware products, Hanbo has also built its own VastStream AI software platform, supporting mainstream AI frameworks like PyTorch and TensorFlow, and is currently adapting to server operating systems such as CentOS, Ubuntu, Red Hat, and Galaxy Kirin.
Shanghai Suipian: Released China’s largest AI chip, taking four domestic firsts, with Benchmark | Quantum Bit Summary: With the arrival of Suisi 2.0, other products from Suipian Technology have also been upgraded accordingly. First, the cloud Suipian T20 training acceleration card. It is the second-generation AI training acceleration card for data centers, officially described as having a wide model coverage, strong performance, and an open software ecosystem, supporting various AI training scenarios. Additionally, there is the cloud Suipian T21 training OAM module, designed based on the OCP (Open Compute Project) OAM (Open Accelerator Module) standard, compatible with OCP OAI standards (Open Accelerator Infrastructure) AI training acceleration module. The single-precision FP32 computing power of the Suipian T21 can reach up to 40 TFLOPS, while the TF32 computing power can reach up to 160 TFLOPS. Finally, Suipian Technology has also upgraded its Yusuan TopsRider software platform: based on operator generalization technology and graph optimization strategies, supporting various model training under mainstream deep learning frameworks. Utilizing the Horovod distributed training framework and GCU-LARE interconnect technology, it provides solutions for efficient operation of ultra-large-scale clusters. Moreover, the programming model and scalable operator interface are both open. Lastly, it has provided its product’s super-Moore’s Law: each generation of Suipian’s products must exceed the previous generation in “average business” with Perf/W:>3X, Perf/$(BOM):>2X, and reliable software backward compatibility.
Birun Technology: The first 7nm GPU chip is expected to tape out in Q3 this year and officially launch next year | Chip East Summary: Birun Technology’s CTO and chief architect Hong Zhou stated that the company’s first 7nm chip supporting AI training and inference is progressing smoothly, positioning the first GPU chip for high-end general intelligent computing, featuring high performance, scalability, and virtualization capabilities, supporting cloud training and inference, and is currently in the final stages, expected to tape out this year, with performance comparable to NVIDIA’s next-generation GPU, targeting the next-generation 5nm GPU computing chip that NVIDIA is still developing. Birun Technology’s strategy is to focus on a few points and wage an “asymmetric war”. NVIDIA’s GPU is not the optimal chip for AI training and inference, but rather a multi-capability chip. For example, the A100’s double precision is crucial for HPC, but for AI acceleration, it is not the optimal solution in terms of energy efficiency and computing power. Therefore, Birun Technology chooses to first specialize in general AI training and inference capabilities, stripping away designs unrelated to AI acceleration such as graphics rendering, and focusing on how to reasonably arrange more computing and storage units on its chips. After the chip tape-out, Birun Technology will prioritize advancing the software work for commercializing the acceleration chip. Birun Technology’s second chip has already begun architecture design, and the company will gradually launch GPU chips aimed at intelligent computing centers, cloud gaming, and edge computing. Birun Technology emphasizes optimizing three key features of its chips: versatility, high computing power, and chiplet technology:

Versatility: From compatibility with CUDA to replacing CUDA. The new GPU board must seamlessly support the CUDA ecosystem, which is more important than higher computing power or better energy efficiency. Birun Technology’s ultimate goal is to provide a self-developed programming model better than CUDA.
High Computing Power: Integrating the advantages of multiple architectures. While focusing on versatility, deepening and optimizing in specialized fields, integrating the advantages of various architectures. Not limited to traditional vector stream processing architecture, but will incorporate data flow processing units, near-storage computing architectures, and other elements into its concept, and perform special optimizations for key scenarios, enabling it to handle various data types, thus achieving several times the computing power of NVIDIA at the same energy consumption. The improvement of single-chip computing power is just one aspect, Birun Technology also introduces very high interconnect bandwidth in its chips, allowing for large-scale expansion of hundreds or thousands of chips, thus achieving clustered high computing power.

Samsung’s new flagship Exynos 2200 performance preview: AMD GPU assistance | Computer Enthusiast Summary: Samsung’s next-generation flagship SoC – Exynos 2200, has four sample chips prepared, two of which use Cortex-X1 as ultra-large cores, one uses the latest Cortex-X2 as an ultra-large core, and one conservative model uses Cortex-A78. As consumers, we naturally hope that Exynos 2200 can directly use a CPU matrix built on Cortex-X2+Cortex-A710+Cortex-510, which is purely ARMv9 architecture. Compared to other competitors, the biggest feature of Samsung Exynos 2200 is the integration of a custom RDNA 2 architecture GPU from AMD. Recently, a screenshot of the Exynos processor running the 3DMark Wild Life benchmark test was leaked online, showing that the Exynos chip with AMD Radeon GPU and Cortex-A77 CPU scored 8134 points, with an average frame rate of 50 FPS. Based on the Cortex-A77 CPU architecture, the Exynos processor should be considered a semi-finished product, primarily used to validate the reliability and strength of the AMD Radeon GPU. When this GPU is paired with ARMv9 architecture cores represented by Cortex-X2, it theoretically can unleash even stronger performance. In comparison, Qualcomm Snapdragon 888 and Kirin 9000 scored 5720 and 6677 respectively in the 3DMark Wild Life benchmark test, indicating that the performance of AMD Radeon GPU is very strong, definitely not inferior to Qualcomm’s next-generation Adreno 730 GPU.
Mysterious AI face-swapping mobile software “Voilà AI Artist” invades global social networks | New Intelligence Source website: https://www.wemagine.ai/ Summary: With the support of AI algorithms, the mobile app Voilà can generate images in four filter styles by simply uploading a portrait photo – 3D cartoon (Disney style), 2D cartoon, Renaissance painting, and comic character. Voilà also has a database of celebrity photos, allowing users to see the face-swapping effects on celebrities directly in the app. After about three months of launch, its iOS version topped the free charts in multiple countries and regions, beating strong competitors like TikTok, Instagram, and Snapchat. On the Android side, Voilà has surpassed 10 million downloads in the Google Play store and has ranked in the top 10 of popular charts in 26 countries and regions. The official Facebook account celebrated reaching 20 million users on June 13.

Papers

[1912.13200] Huawei Noah’s Additive Network Upgrade: Accuracy Improvement, Approaching Any Function | Machine Heart Summary: The computation of deep convolutional neural networks often requires enormous energy consumption, making it difficult to implement on mobile devices. Therefore, academia is exploring various new methods. This paper introduces a research proposing the use of addition to replace multiplication (convolution) in CNNs, significantly reducing the energy consumption of neural networks. It is well known that multiplication is slower than addition, but the computation in the forward inference process of deep neural networks involves a large number of multiplications between weights and activation functions. Therefore, many papers have attempted to study how to reduce multiplication computations in neural networks to accelerate deep learning speed. In the new version of AdderNet, its performance has significantly improved and has a solid theoretical guarantee. First, the research team proved that AdderNet with a single hidden layer and bounded width can approximate any tight Lebesgue integrable function. Its results can rival the universal approximation results of traditional neural networks. The team also provided approximation bounds for single hidden layer AdderNet. Secondly, to effectively optimize AdderNet, the team designed a training scheme transitioning from L2 to L1 and an adaptive learning rate extension to ensure sufficient parameter updates and better network convergence. The new AdderNet was tested on multiple image classification datasets, and experimental results showed that its accuracy significantly improved compared to previous versions, achieving recognition accuracy comparable to traditional CNNs on large datasets like ImageNet.
[2104.14129] Berkeley University’s ActNN: A New Approach to Saving Memory, Training Neural Networks with 2-bit Activation Compression in PyTorch | Machine Heart Paper: https://arxiv.org/abs/2104.14129 Code: https://github.com/ucbrise/actnn Summary: As ultra-large-scale deep learning models gradually become the trend in AI, how to train these models under limited GPU memory has become a challenge. This paper introduces ActNN from the University of California, Berkeley, a PyTorch-based activation compression training framework. Under the same memory constraints, the related paper has been accepted as a Long Talk at ICML 2021, and the code is open-sourced on GitHub. Currently, methods for saving training memory mainly fall into three categories: 1. Re-computation (Gradient checkpointing/Rematerialization) 2. Using CPU memory for swapping (swapping) and 3. Using distributed training to store tensors across multiple GPUs. These three methods are not mutually exclusive and can be used in combination. Most machine learning frameworks provide some support for these methods, and there are many related papers. However, efficiently and automatically implementing these strategies is not easy. Unlike existing methods, the authors propose ActNN, a new memory-saving framework based on compression. While providing theoretical proof, it offers an efficient and user-friendly implementation based on PyTorch: ActNN can expand the batch size by 6-14 times and increase the model size or input image size by 6-10 times.
[2107.04829] CSL-YOLO: Surpassing Tiny-YOLO V4, A New Lightweight YOLO Model for Edge Real-Time Detection | Collective Intelligence Book Summary: This paper proposes a new lightweight convolution method, the Cross-Stage Lightweight (CSL) module, which generates redundant features from simple operations and replaces pointwise convolutions with depth convolutions in the intermediate expansion stage to generate candidate features. The CSL module can significantly reduce computational load. Experiments conducted on MS-COCO show that the proposed CSL-Module can achieve fitting capabilities comparable to 3×3 convolutions. Previous research has shown that generating redundant feature maps with less computational load can greatly reduce FLOPs. CSPNet proposed a cross-stage solving method, and GhostNet systematically verified the effectiveness of cheap operations in this problem. However, the main operation for generating valuable feature maps remains too complex for edge computing. This paper suggests dividing the input feature map into two branches, with the first branch generating half of the redundant feature maps through cheap operations like GhostNet; the second branch generates the other half of the necessary feature maps through lightweight main operations, and then concatenates the two outputs. The CSL-Module generates half-redundant feature maps by skipping branch operations. On the main branch, it differs from CSP modules and Ghost modules. The authors propose a lightweight main operation to generate the other half of the necessary feature maps. In this branch, a block similar to IRB is designed, utilizing the input feature maps and output feature maps from the skip branch to generate intermediate candidate feature maps through depth convolutions. One of the main advantages of this block is that it does not require pointwise CNN, which is known to have significantly fewer FLOPs than pointwise CNN. It differs from IRB, which uses pointwise convolutions to generate candidate feature maps. Other advantages of this block include fully considering all currently available features, minimizing redundant computations. Additionally, since there is already a skip branch, the main branch only needs to generate half of the feature maps, significantly reducing FLOPs. Overall, the proposed CSL-Module reduces FLOPs through cheap operations and cross-stage ideas. On the other hand, the main branch is designed to be lightweight. The convolution layers in VGG-16 were replaced to verify the effectiveness of CSL-Module, naming the new models IRB-VGG-16, Ghost-VGG-16, and CSLVGG-16 respectively.

Open Source Projects

Note: Each content item is prefixed with the GitHub address of the repository owner and repository name, completing the address as github.com/<repo_owner>/<repo_name>.

facebookresearch/pytorchvideo: Facebook AI’s strongest full-stack video library: PyTorchVideo, making mobile run SOTA models 8 times faster! | New Intelligence Source Project: https://pytorchvideo.org/ Summary: PyTorchVideo is a machine learning library for video understanding, serving various codebases, SOTA video models, and open-source video models, as well as various foundational video algorithms, video data operations, popular video datasets, video augmentation, video model acceleration quantization, and a series of full-stack video-related content. PyTorchVideo also open-sourced mobile acceleration optimization, providing step-by-step tutorials to optimize video models’ core kernels, quantization (quantilize) acceleration. After several times of acceleration, it runs in real-time on mobile, and the official team directly released open-source code for Android and iOS mobile, allowing SOTA video models to be directly run on mobile. The PyTorchVideo accelerated X3D model running on the Samsung Galaxy S10 is 8 times faster, processing one second of video in about 130 milliseconds.
microsoft/SuperBench: Hardware and Software Benchmarks for AI Systems Project: https://microsoft.github.io/superbenchmark/ Summary: SuperBench is a validation and profiling tool for AI infrastructure. It provides micro-benchmarks for primitive computation and communication benchmarking, as well as model-benchmark to measure domain-aware end-to-end deep learning workloads. SuperBench supports:

Provide comprehensive performance comparison between different existing hardware
Provide insights for hardware and software co-design

Distributed validation tools to validate hundreds or thousands of servers automatically
Consider both raw hardware and E2E model performance with ML workload patterns
Build a contract to identify hardware issues
Provide infrastructural-oriented criteria as Performance/Quality Gates for hardware and system release
Provide detailed performance report and advanced analysis tool

AI infrastructure validation and diagnosis
AI workload benchmarking and profiling

Blog Posts

Tengine Open Source Talk 2021 Series Live Second Lesson | OPEN AI LAB Replay: https://live.csdn.net/room/weixin_43476455/lt1qpikr Summary: The live broadcast brought by “OPEN AI LAB”: Tengine Open Source Talk 2021 Series Second Lesson, mainly introduces the basic situation of the Tengine project, such as supported backends/model formats; detailed explanation of the automatic slicing mechanism of the current framework for different backends; overview of the plugin design for Tengine’s backend devices; through code walkthrough, grasp the details of Tengine’s adaptation to TensorRT and TIMVX.
Deep Learning Compilation and Model Just-In-Time Translation Technology | SenseTime Academic Summary: Balancing flexible dynamic algorithm expression and efficient computational execution is a core issue that a deep learning framework must solve. This article introduces a just-in-time translation technology based on functions, which, while maintaining the flexible expression of the entire algorithm, performs certain static optimizations on specific functions to enhance execution efficiency. Combining the high computational efficiency of graph execution mode with the ease of development and debugging of immediate execution mode is an important research direction in deep learning framework development. Mainstream deep learning frameworks such as TensorFlow and PyTorch have introduced just-in-time (JIT) translation technology, recording intermediate representations generated during the dynamic execution of models. One approach is to cache the intermediate representations generated during the model’s dynamic execution based on parameter signatures, allowing for direct use of the cached intermediate representations when the model is called again with parameters of the same signature, thus saving the time of interpreting Python code while also allowing for further optimization of the cached intermediate representations using computational graph optimization or other static optimization methods. For example, in Java, the Java Virtual Machine (JVM) can accelerate the execution of Java programs through just-in-time compilation technology. To execute Java source code, it must first be compiled into platform-independent Java bytecode (.class files) by the compiler, and then the JVM loads the bytecode file for interpretation and execution. For code that is executed infrequently, interpretation can save the compilation time of the just-in-time compiler; while for frequently executed hot code, just-in-time compilation technology can significantly improve the execution speed of the code after triggering compilation. In deep learning frameworks adopting immediate execution mode, similar ideas to just-in-time compilation technology can also be applied.
“Grand Banquet” Latest Status of AI Training Chips and Systems | StarryHeavensAbove Summary: The work of MLPerf has been ongoing for several years, with the Training Benchmark being the earliest and currently the most notable work. Unsurprisingly, various manufacturers will “claim” to have won the “xxx championship”. After the results of version 0.7 were released, NVIDIA made a comparison chart, attempting to normalize the comparison, but it is also difficult to say that it is a very reasonable comparison. The previous analogy for this issue was “Guan Gong fighting Qin Qiong”, which is also one of the difficulties faced by MLPerf benchmark testing. So, if the results are difficult to compare, what is the significance of organizing and participating in such “competitions”? First, it meets the needs of the “AI training arms race”. As long as the rule of “great miracles in AI” remains unchanged, the scale of AI models will continue to grow. Second, it is beneficial to enhance the comprehensive capabilities of chip manufacturers. As previously introduced, MLPerf tests the comprehensive capabilities of hardware and software systems. If we observe Huawei’s ResNet data, version 0.7 used the TensorFlow framework, while this time it used Mindspore. Third, the design and discussion of benchmarks themselves can also promote the progress of related technologies. Designing a good benchmark is also a high-tech task that requires a deep understanding of the technical trends and implementations of algorithms and system hardware and software.
Changing Times: The Technological Evolution and Commercial Imagination of RISC-V Processor Architecture | Chip Open Community Summary: The rapid development of the RISC-V architecture in recent years has increased its popularity in academia and industry. With its simplicity, modularity, and scalability, RISC-V is expected to become the dominant architecture of the new era. Starting from this issue, we will recommend a series of RISC-V knowledge graph content, guiding everyone to understand RISC-V from different perspectives such as “application development”, “business”, and “technological innovation”. This article is the opening piece of the RISC-V knowledge graph series, mainly introducing the origin of the RISC-V architecture from a technical and commercial perspective, analyzing the reasons for the rapid development of RISC-V, and dissecting the success of the X86 and ARM architectures. The main reason for ARM’s success is that it has been integrated as a foundational technology by other chip manufacturers, forming a small partner ecosystem around itself. Of course, ARM’s success is also attributed to historical trends, such as the emergence of feature phones, followed by the rise and popularization of smartphones, where the revolution in business models met the historical tide, allowing ARM to gradually move from traditional rural areas to urban areas and eventually to the center stage (ARM has now entered the traditional stronghold of X86, including Apple’s M1 PC chip and Ampere/Amazon’s server chips).
Utilizing Snapdragon 888’s CPU and AI Engine to Achieve 30+ fps Video Decoding | Quantum Bit Summary: Qualcomm AI Research Institute has achieved efficient neural encoding on mobile devices by optimizing decoder architecture, parallel entropy decoding (PEC), and AIMET quantization-aware training, involving three key steps:

Based on a SOTA frame-to-frame compression network, the decoder architecture was optimized through channel pruning and network operation optimization, relying on the Snapdragon 888’s built-in AI engine for acceleration, reducing computational complexity;
Creating a fast parallel entropy decoding algorithm. This algorithm can utilize data-level and thread-level parallelism, achieving higher entropy coding throughput. In Qualcomm’s solution, the Snapdragon 888’s CPU is used for parallel entropy decoding;
Optimizing the weights and activations of the model to 8 bits, then recovering the loss caused by rate distortion through quantization-aware training. This utilizes the AI Model Efficiency Toolkit (AIMET), which was open-sourced in May 2020, providing a library of advanced quantization and compression techniques for neural network model training. Through these three steps, Qualcomm AI Research Institute constructed an efficient 8-bit model with high decoding performance. In the demo, Qualcomm AI Research Institute selected a video with a resolution of 1280×704 (close to 720p HD), running the decoder network and entropy decoder offline to generate compressed bitstreams. Then, the compressed bitstream was processed by parallel entropy decoding and decoder networks running on Snapdragon 888 mobile devices (commercial smartphones), where parallel entropy decoding runs on the CPU, and the decoder network is accelerated by the sixth-generation Qualcomm AI engine. Ultimately, Qualcomm AI Research Institute achieved a neural decoding algorithm that realizes over 30 frames per second decoding speed for 1280×704 resolution video.

Click【Read the original】 to see previous articles

Embedded AI Briefing 2021-07-18: Zhangjiang GPGPU Companies/Microsoft SuperBench/Microsoft MLPerf/PyTorchVideo

Focus on Model Compression, Low-Bit Quantization, Mobile Inference Acceleration Optimization, and Deployment

Industry News

Papers

Open Source Projects

Blog Posts

Leave a Comment Cancel reply

Focus on Model Compression, Low-Bit Quantization, Mobile Inference Acceleration Optimization, and Deployment

Industry News

Papers

Open Source Projects

Blog Posts

Related posts

Leave a Comment Cancel reply