Embedded AI
1. What High-End Domestic AI Chips Are Available? | Frontline
Original:
https://baijiahao.baidu.com/s?id=1742745257605745466&wfr=spider&for=pc
On September 1, news about restrictions on the sale of some overseas high-performance AI chips once again drew market attention to domestic AI chips.
As industries such as data centers and autonomous driving accelerate their development, scenarios that demand high computing power and performance urgently need greater and more sufficient computing power to handle complex computing tasks. Therefore, high-performance AI chips have become key to supporting the mature development of these fields.
In fact, since the outbreak of the artificial intelligence boom, a number of high-performance AI chip startups have gradually emerged in our country, further promoting the autonomy of the domestic chip industry and representing a new force in China’s technological strength.
The following are representative domestic high-performance AI chip startups (sorted by establishment date):
01. Horizon
Founded in 2015, Horizon is currently the only company in China to achieve mass production of automotive-grade AI chips. Through its self-developed AI-specific computing architecture BPU (Brain Processing Unit), Horizon has built two major product lines: the Journey series chips for the autonomous driving field and the Rising Sun series chips for the AIoT field. Among them, the company released the Journey 5, an all-scenario intelligent central computing chip for vehicles, in July 2021, with a single-chip AI computing power of 128 TOPS.
02. Tian Shu Zhi Xin
Founded in 2015, Tian Shu Zhi Xin officially launched the design of 7-nanometer general-purpose parallel cloud computing chips in 2018. It is a GPGPU high-end chip and supercomputing system provider, targeting the data-driven technology market represented by cloud computing, artificial intelligence, and digital transformation.
03. Cambricon
Founded in 2016, Cambricon primarily develops integrated intelligent chip products and platform-based system software that combine cloud, edge, and terminal, as well as training and inference fusion, widely used in server manufacturers and industry companies to provide computing power for complex AI application scenarios in areas such as the internet, finance, transportation, energy, power, and manufacturing.
Currently, the company has launched a series of intelligent chips and accelerator card products, including terminal radiation (1A processor, 1H processor, 1M processor), edge (Siyuan 220 chip), and cloud (Siyuan 100 chip, Siyuan 270 chip, Siyuan 290 chip, Siyuan 370 chip), maintaining a pace of launching 1-2 core products each year.
04. Black Sesame Intelligence
Founded in 2016, Black Sesame Intelligence is a company developing automotive-grade autonomous driving computing chips and platforms, providing complete autonomous driving and vehicle-road collaboration solutions, including automotive-grade design, learning-based image processing, and low-power precise perception autonomous driving perception computing chips and platforms.
Black Sesame Intelligence has launched four autonomous driving chip products: Huashan A500, Huashan No. 2 A1000, A1000 L, and A1000 Pro, based on two self-developed core IPs. Among them, the Huashan No. 2 A1000 chip has completed all automotive-grade certifications and began mass production in April this year, with continuous delivery to industry customers, expected to achieve mass production on vehicles within 2022.
05. MoXin Artificial Intelligence
Founded in 2018, MoXin is a chip design company providing cloud and terminal AI chip acceleration solutions. By optimizing computing models, the company supports the development of fully sparse neural networks, offering ultra-high performance and ultra-low power consumption general AI computing platforms.
06. Suiyuan Technology
Founded in March 2018, Suiyuan Technology mainly focuses on the AI cloud computing field, providing independently innovative, fully self-researched, and fully proprietary general AI training and inference products, which can be widely used in cloud data centers, supercomputing centers, the internet, traditional industries, and smart cities.
As of now, Suiyuan Technology has launched cloud AI inference accelerator cards, including Yunsui i10, Yunsui i20, cloud AI training accelerator cards Yunsui T10, Yunsui T20, as well as the computing and programming platform “Yusuan” and inference acceleration engine Jiansuan (TopsInference) among other products.
07. Birun Technology
Founded in 2019, Birun Technology focuses on developing original general computing systems, establishing efficient hardware and software platforms, while providing integrated solutions in the intelligent computing field. From a development path perspective, Birun Technology first focuses on cloud general intelligent computing, gradually catching up with existing solutions in areas such as AI training and inference, graphical rendering, achieving breakthroughs in domestic high-end general intelligent computing chips.
This March, Birun Technology lit up the largest general-purpose GPU chip in China, and in August released its first general-purpose GPU chip BR100, with 16-bit floating point computing power exceeding 1000T and 8-bit fixed point computing power exceeding 2000T, achieving peak computing power at the PFLOPS level.
2. Horizon’s L4 AI Chip Challenges NVIDIA, 3-Year-Old Birun Sets Global Computing Record. Are Domestic Chips On a Roll?
Original:
https://www.jfdaily.com/wx/detail.do?id=523435
To compete for this prestigious award known as the “Oscar” of the artificial intelligence industry, leading enterprises at home and abroad often enter two projects simultaneously to increase their chances of winning. Due to fierce competition, to balance it out, the conference added a paper award in 2020, increasing the number of awards from 4 to 5; starting from 2021, based on the 5 major awards, 6 SAIL Star awards were added.
The SAIL award will be announced today. This year, over 800 projects from around the world participated in the application, and among the top 30 already decided, there are many leading companies and well-known academic institutions such as Meituan, Tencent, Baidu, Qualcomm, Amazon, iFlytek, JD.com, and the Institute of Automation of the Chinese Academy of Sciences.
No matter who the final winner is, these 30 hardcore projects are already a barometer, enough to represent the cutting-edge technology of the current global artificial intelligence industry and reflect the hottest applications.
In this year’s top 30 list, chips are concentrated, focusing on cloud inference chips, training chips, and automotive chips. This is a clear indication of the accelerated landing of the artificial intelligence industry.
For example, autonomous driving is a highly commercially valuable popular track in the artificial intelligence industry, with the core being automotive chips. In the era of software-defined vehicles, high-performance chips take center stage. The Journey 5, an all-scenario intelligent central computing chip from Horizon, which made it into the top 30 this year, is a representative of high-performance chips.
The Journey 5 is Horizon’s third-generation automotive-grade AI chip, using TSMC’s 16-nanometer process, with a maximum single-chip AI computing power of 128 TOPS (processing unit of computing power), power consumption of 30W, supporting 16-channel camera perception computing, covering L4-level autonomous driving needs, and is the first mass-producible AI chip in the country with over 100 TOPS of computing power. Currently, the Journey 5 has secured mass production cooperation with several mainstream automotive companies such as BYD and SAIC Motor.
People often compare the Journey 5 with NVIDIA’s 7-nanometer autonomous driving chip Orin released in 2019. Orin has a single-chip computing power of 254 TOPS and a power consumption of 45W. It can be seen that the gap between the Journey 5 and international giants like NVIDIA is narrowing. However, NVIDIA released a new generation of autonomous driving chip Atlan in April last year. Atlan uses a 5-nanometer process, with a single-chip computing power reaching 1000 TOPS, and samples will be provided to developers in 2023. It is clear that time is running out for Horizon, and its Journey series must accelerate iteration.
In the chip competition, speed is the only principle. With the support of capital, domestic chips are expected to pick up speed. This time, Birun Technology’s general-purpose GPU chip BR100 also made it into the SAIL award top 30. BR100 uses TSMC’s 7-nanometer process, with a peak computing power of 10 quadrillion floating-point operations per second, breaking the global record for general-purpose GPU computing power.
3. Breaking! US Officials Request NVIDIA and AMD to Stop Selling AI Chips to China
Original:
https://baijiahao.baidu.com/s?id=1742730578138892685&wfr=spider&for=pc
Recently, multiple foreign media outlets reported that chip design company NVIDIA stated in a filing with the SEC that US officials have requested it to stop exporting two top computing chips used for artificial intelligence work to China (including Hong Kong), a move that could undermine the ability of Chinese companies to perform advanced technologies such as image recognition and hinder NVIDIA’s business, which was expected to reach $400 million this quarter.
NVIDIA stated that US officials told the company that the new regulations would address the risks of the products potentially being used or transferred to “military end uses” or “military end users”.
This statement marks a significant escalation in the US’s crackdown on China’s technological capabilities.
Reuters noted that without chips from American companies like NVIDIA and AMD, Chinese institutions would be unable to economically and effectively carry out advanced computing tasks for image and voice recognition.
As an editor, I want to say that all reactionaries are paper tigers. We hope that domestic products can become self-reliant and one day not be choked. We must continue to work hard.
4. The World’s Strongest Intelligent Computing Has Arrived: 12000000000000000000 (Don’t Count, 18 Zeros) FLOPS!
Original:
https://mp.weixin.qq.com/s/ROyNjTP-R72Mqu6JPvrn9A
The throne of the “World’s Strongest Intelligent Computing” has just changed hands.
A “Chinese player” from Zhangbei County, Hebei Province, has defeated Google. The computing power it relies on has reached a staggering 12 EFLOPS (trillion trillion operations per second).
In contrast, Google’s peak cluster computing power is 9 EFLOPS, and Tesla only has 1.9 EFLOPS.
So how fast is this “speed” from Zhangbei County? To give an example, training an autonomous driving model used to take about 7 days. With the support of this “world’s strongest computing power,” this time has been reduced to less than 1 hour, a speedup of nearly 170 times! Intelligent computing refers to the AI computing power specifically provided for artificial intelligence. The true face of this “world’s strongest intelligent computing” is the Aliyun Feitian Intelligent Computing Platform powered by the Zhangbei Intelligent Computing Center.
Moreover, this intelligent computing center has not only achieved first place in AI computing power but also unlocked the following capabilities:
-
Parallel efficiency of thousands of cards reaches over 90%, with a threefold increase in computing resource utilization.
-
Can improve storage IO performance by up to 10 times, significantly reducing system latency by 90%.
-
Can improve AI training efficiency by 11 times and inference efficiency by 6 times.
-
PUE has dropped to a minimum of 1.09, saving 90% of construction area.
These AI computing powers are bringing about a more intelligent daily life.
AI Hot Topics
5. Poly-YOLO: Faster and More Accurate Detection (Mainly Solving Two Major Problems of Yolov3, Source Code Attached)
Original:
https://mp.weixin.qq.com/s/zgfWHf3CUfJ1BwQQdrpvkQ
Paper Address:
https://arxiv.org/pdf/2005.13243.pdf Source Code:
https://gitlab.com/irafm-ai/poly-yolo
The improved version of YOLOv3 is here! Compared to YOLOv3, Poly-YOLO has only 60% of the training parameters, but the mAP has increased by 40%! A lighter version, Poly-YOLO Lite, has also been proposed.
Object detection models can be divided into two groups: two-stage and one-stage detectors. The two-stage detector splits the process as follows. In the first stage, it proposes regions of interest (RoI), and in the subsequent stage, it performs bounding box regression and classification within these candidate regions. The one-stage detector predicts the bounding boxes and their classes in one go. Two-stage detectors are usually more precise in localization and classification but slower in processing compared to one-stage detectors. Both types include backbone networks for feature extraction and head networks for classification and regression. The backbone is typically some SOTA networks, such as ResNet or ResNext, pretrained on ImageNet or OpenImages. Nevertheless, some methods also attempt to train from scratch.
The framework shared today proposes a new version of YOLOv3 with better performance and extends it to a model called Poly-YOLO. Poly-YOLO builds on the original idea of YOLOv3 and eliminates its two weaknesses: label rewriting and anchor allocation imbalance.
Poly-YOLO uses stairstep upsampling to aggregate features in the lightweight SE-Darknet-53 backbone through hypercolumn technology, reducing issues and producing high-resolution single-scale outputs. Compared to YOLOv3: Poly-YOLO has only 60% of the trainable parameters, but the mAP has increased by 40%. The lighter Poly-YOLO Lite has the same accuracy as YOLOv3 but is three times smaller and twice as fast, making it more suitable for embedded devices.
6. ECCV 2022 | CMU Proposes the First Fast Knowledge Distillation Visual Framework: ResNet50 80.1% Accuracy, 30% Training Acceleration
Original:
https://mp.weixin.qq.com/s/dFTBUtVSsUC7vgddM0NHqw
Paper and Project URL:
http://zhiqiangshen.com/projects/FKD/index.html
Code:
https://github.com/szq0214/FKD
Today, we introduce a paper from Carnegie Mellon University and others on fast knowledge distillation presented at ECCV 2022. With basic training parameter configurations, it can train ResNet-50 from scratch on ImageNet-1K to 80.1% accuracy (without using data augmentation such as mixup, cutmix, etc.), saving over 16% in training speed (especially friendly for slow data reading costs on clusters) and being over 30% faster than previous SOTA algorithms, making it one of the best strategies for both accuracy and speed in knowledge distillation. The code and model have been fully open-sourced!
Knowledge distillation (KD) has had a huge impact on model compression, visual classification, and detection since it was proposed by Geoffrey Hinton et al. in 2015, leading to countless related variants and extensions. However, they can generally be divided into the following categories: vanilla KD, online KD, teacher-free KD, etc. Recent studies have shown that a simple, straightforward knowledge distillation strategy can achieve significant performance improvements, often exceeding many complex KD algorithms. However, vanilla KD has an unavoidable drawback: each iteration requires inputting training samples into the teacher model for forward propagation to generate soft labels, leading to a significant portion of computational overhead being spent on traversing the teacher model, which is usually much larger than the student model, and the teacher’s weights are fixed during training, resulting in a low learning efficiency for the entire knowledge distillation framework.
To address this issue, this paper first analyzes why it is impossible to generate a single soft label vector for each input image and then reuse this label across different iterations during the training process. The fundamental reason lies in the use of data augmentation in the visual domain model training process, especially the random-resize-cropping image augmentation strategy, which causes input samples produced in different iterations, even if they originate from the same image, to be sampled from different regions, leading to a mismatch between these samples and a single soft label vector across different iterations. Based on this, the paper proposes a design for fast knowledge distillation, which processes the required parameters through a specific encoding method, further storing and reusing soft labels while employing a strategy to allocate area coordinates for training the target network. This approach allows the entire training process to be explicitly teacher-free, characterized by being both fast (over 16%/30% training acceleration, especially friendly for slow data reading in clusters) and effective (achieving 80.1% accuracy on ImageNet-1K without additional data augmentation using ResNet-50).
7. DenseNet, MobileNet, DPN… Have You Mastered Them All? A Summary of Classic Models Essential for Image Classification (Part II)
Original:
https://mp.weixin.qq.com/s/8Svli4dmU3Hyq-932_SXgQ
This article will be serialized in 3 parts, introducing a total of 15 classic models that have achieved SOTA in image classification tasks.
-
Part 1:
-
AlexNet, VGG, GoogleNet, ResNet, ResNetXt
-
Part 2:
-
DenseNet, MobileNet, SENet, DPN, IGC V1
-
Part 3:
-
Residual Attention Network, ShuffleNet, MnasNet, EfficientNet, NFNet
Image classification is one of the most classic tasks in the field of computer vision, aiming to map the input image to predefined semantic categories, i.e., labeling it with category tags. Traditional image classification methods consist of steps such as low-level feature learning, feature encoding, spatial constraints, classifier design, and model fusion.
First, features are extracted from images. Classic feature extraction methods include HOG (Histogram of Oriented Gradient), LBP (Local Binary Pattern), SIFT (Scale-Invariant Feature Transform), etc., and multiple features can be fused to retain more useful information. Then, after encoding the features to remove redundancy and noise, feature encoding is generated, with classic methods including sparse coding, locally linear constrained coding, Fisher vector coding, etc. Afterward, after spatial feature constraints, feature aggregation is achieved, such as classic pyramid feature matching methods. Finally, classifiers are utilized for classification, with classic classifiers including SVM, random forests, etc.
Alex Krizhevsky’s CNN model proposed at ILSVRC 2012 was the first to apply deep learning to large-scale image classification tasks, achieving far superior results compared to traditional image classification methods and winning the championship at ILSVRC 2012, marking the beginning of deep learning model applications in image classification. This model is the famous AlexNet. Since then, image classification has shifted its research and application path from focusing on extracting effective features and improving classifier effectiveness to studying different deep learning model architectures.
Deep convolutional neural networks have brought a series of breakthroughs to image classification. Deep neural networks integrate low, medium, and high-level features, which can be trained end-to-end, and the richness of features can be enhanced through the depth of the network.
8. From Transformers to Diffusion Models: Understanding Sequence Modeling-Based Reinforcement Learning Methods
Original:
https://mp.weixin.qq.com/s/_6Ub-hhrWmAw-bpKBEs_Nw
Large-scale generative models have brought tremendous breakthroughs in natural language processing and even computer vision in the past two years. This recent trend has also influenced reinforcement learning, especially offline reinforcement learning (offline RL), with methods such as Decision Transformer (DT), Trajectory Transformer (TT), Gato, and Diffuser treating the data of reinforcement learning (including states, actions, rewards, and return-to-go) as a series of unstructured sequence data, making modeling these sequence data the core task of learning. These models can be trained using supervised or self-supervised learning methods, avoiding the unstable gradient signals common in traditional reinforcement learning. Even when using complex policy improvement and value estimation methods, they have demonstrated excellent performance in offline reinforcement learning.
This article will briefly discuss these sequence modeling-based reinforcement learning methods, and in the next article, I will introduce our newly proposed Trajectory Autoencoding Planner (TAP), a method that uses Vector Quantized Variational AutoEncoder (VQ-VAE) for sequence modeling and efficient planning in latent action space.
Transformers and Reinforcement Learning
Since the Transformer architecture was proposed in 2017, it has gradually sparked a revolution in natural language processing, with subsequent models like BERT and GPT-3 pushing the combination of self-supervision and Transformers to new heights, continuously emerging in the field of natural language processing with few-shot learning and other properties while also expanding into areas like computer vision.
However, for reinforcement learning, this process seemed not particularly evident until 2021. In 2018, multi-head attention mechanisms were introduced into reinforcement learning, mainly applied in attempts to solve generalization problems in sub-symbolic domains. Afterward, such attempts remained lukewarm. Based on my personal experience, Transformers have not shown stable, overwhelming advantages in reinforcement learning and have been difficult to train. In a 2020 work where we used Relational GCN for reinforcement learning, we also tried using Transformers behind the scenes, but they were significantly worse than traditional structures (like CNNs), making it hard to train a usable policy. Why Transformers are less compatible with traditional online reinforcement learning remains an open question; for example, Melo explained that the parameter initialization of traditional Transformers is unsuitable for reinforcement learning, but I won’t elaborate further here.
In mid-2021, the publication of Decision Transformer (DT) and Trajectory Transformer (TT) sparked a new wave of Transformer applications in reinforcement learning. The idea behind these two works is quite straightforward: if Transformers and online reinforcement learning algorithms don’t fit well together, why not treat reinforcement learning as a self-supervised learning task? Taking advantage of the popularity of offline reinforcement learning, both works locked their main task on modeling offline datasets, then applied this sequence model for control and decision-making.
9. Tips | A Series of Operations on Sample Imbalance
Original:
https://mp.weixin.qq.com/s/4_zz0xXmHKLZ4X1II4_-Gg
The problem of sample imbalance is quite common. We often see that one category has far more instances than others; for example, the number of conversions from exposure is significantly lower than the number of exposures without conversion. This serious sample imbalance affects the model’s performance and even influences our judgment of the model’s quality, as the model has high accuracy for categories with a high proportion while the estimation bias for categories with a low proportion is particularly large. However, due to the high proportion of categories significantly affecting loss/metric, we might think we have obtained a relatively optimal model. For instance, in anomaly detection problems, simply returning no anomalies can yield a high accuracy rate.
Resampling
This is currently the most frequently used method, which can either downsample the “majority” samples or oversample the “minority” samples, as shown in the figure below:
The drawbacks of resampling are also quite evident; oversampling may lead to “overfishing” of minority samples, while downsampling can lose a lot of information.
There are many resampling schemes, the simplest being random oversampling/downsampling to make the number of each category approximately the same. There are also more complex sampling methods, such as first clustering the samples and then performing downsampling by category on the samples that need to be downsampled, which can lose less information. As for oversampling, instead of simple copying, some “noise” can be added to generate more samples.
Tomek Links
Tomek links refer to the “nearest” samples in space, but they belong to different categories. Removing the samples from the majority class in these pairs is beneficial for the learning of classification models, as shown in the figure below:
NearMiss
This downsampling method removes some useless points by calculating distances.
-
NearMiss-1: Selects samples from the majority class that have the smallest average distance to the three nearest minority class samples.
-
NearMiss-2: Selects samples from the majority class that have the smallest average distance to the three furthest minority class samples.
-
NearMiss-3: For each minority class sample, selects a given number of majority class samples that are closest to it.
NearMiss-1 considers the average distance to the three nearest minority class samples, which is local; NearMiss-2 considers the average distance to the three furthest minority class samples, which is global. The distribution of majority class samples obtained from the NearMiss-1 method is also “imbalanced,” as it tends to find more majority class samples near concentrated minority classes and fewer majority class samples near isolated (or outlier) minority classes due to the local nature and average distance consideration of the NearMiss-1 method. The NearMiss-3 method ensures that there are enough majority class samples around each minority class sample, which evidently leads to high precision and low recall for the model.
In addition, there are many more operations, such as evaluation metrics, penalty terms, various algorithms, K-fold, and multiple sampling training methods. Interested friends can click the original link to learn more!
END
Love me, just give me a thumbs up
Click “Read the Original” to enter this project