How to Implement Embedded Artificial Intelligence

This article is based on the speech by Zhang Xianyi and has not been reviewed by the author.

Embedded AI combines hardware, software, and algorithms to provide cost-effective, high-performance, and low-power solutions, applicable in scenarios such as facial recognition, and aims to achieve better interactivity.

What is embedded artificial intelligence, or edge intelligence? It does not involve training deep learning models on the device or in embedded systems. During the deployment phase, our models are trained on servers, but in actual use, the deployment can occur in the cloud or on the edge devices, and there are significant differences between the two. Commonly, deployment is done on the cloud, for example, many APIs upload images via the internet, calculate on the cloud, and return results. In contrast, with embedded deployment, the model may be transmitted to a mobile phone, enabling image recognition and processing functions. It can also be transferred to drones or smart cameras.

Take Douyin as an example; it has a beauty filter live streaming feature. When a face moves, the beauty filter image follows the face, and there are many special effects, which is a typical application of embedded AI. Why not record a video and upload it to the cloud for processing? Because if uploaded to the cloud for processing, the interactivity would be very poor, and users cannot see the effects in real time, which is also costlier. Therefore, all effects are completed on the mobile device, resulting in better interactivity. Three algorithms are primarily used here: first, face detection, to locate the face; second, identifying key facial points, such as where the nose and eyes are; third, applying overlays like glasses or cat stickers that adhere to the face regardless of movement or rotation.

However, when actually implementing embedded systems, some challenges may arise, with the primary challenge being operational speed. Some models are too large to be transmitted to mobile devices, often taking several seconds to process an image, or they may consume too much power, potentially draining the phone’s battery quickly; these are challenges we have encountered. To tackle these challenges, generally, efforts must be made from hardware, software, and algorithms to successfully deploy embedded AI.

Let’s introduce a cost-effective ARM SoC AI solution. Taking facial recognition as an example, a cost-effective facial recognition device supports a local library of 20,000 faces, and in network snapshot mode, it can support a library of 50,000 faces, with recognition time under 0.2 seconds. As we know, frameworks like Caffe and TensorFlow have been developed by many large companies for server training to train models. However, if after training, the model is directly installed on the mobile device, it is resource-intensive and cumbersome.

Currently, there are forward inference frameworks specifically designed for embedded platforms, which have five main functions. First, device management; generally, embedded systems are heterogeneous systems, not just CPUs, but may also include GPUs, IPGs, or DSPs. Mobile chips often have DSPs that can be used for deep learning calculations. Second, besides heterogeneous management, model management is also involved. Third, memory management and storage formats need to prioritize either memory usage or performance. Mobile devices typically have ample memory, but smaller embedded devices or IoT-level devices have very limited memory, requiring careful consideration. Fourth, layer fusion to enhance speed from a performance optimization perspective. Fifth, method selection for implementation. For instance, convolution is a crucial operation in AI; there are many ways to optimize convolution, with three to four implementation methods, requiring coordination between low-level optimizations and high-level framework implementations, and adjustments based on the model being called to achieve optimal results. Performance evaluations can be conducted to see the results, and many platforms offer performance assessment tools.

Next, let’s look at high-performance FPGA AI solutions. If the model is large but requires fast processing, how can it be deployed? A typical approach is to use FPGA AI for implementation. We support two common SoC architectures and the FPGA development ecosystem. The hardware part, combined with software tools, can quantize and compress trained models for deployment on AI acceleration architectures, achieving efficient execution. The AI acceleration architecture is primarily divided into PS and PL design, essentially using a state machine to control the model and employing PE to execute specific convolution operations, thus achieving better performance.

This solution can achieve fast speeds, with platform speeds reaching 300MHz and speeds on the 7100 chip reaching 160MHz, capable of processing 60 frames per second, while the DSP unit utilization reaches 95%. This technology can be applied in drones and may lead to the development of customized chips in the future.

We must not only combine hardware and software for low-level optimization but also have much work to do at the algorithm model level. First, for embedded AI deep learning models, specific optimizations must be made. Second, we need to implement model compression, with the most common method being distillation. Distillation involves training a complex model on a server and then training a smaller model on that basis, akin to a teacher teaching a student, allowing the smaller model to achieve performance close to that of the larger model, with higher accuracy and significantly reduced computational resource usage. Next, we need to perform quantization. Finally, we continuously develop new network structures to further reduce computational load.

In summary, implementing embedded AI involves hardware, frameworks, software, and models. The ARM SoC hardware has high integration and good cost-performance, while the FPGA platform is very suitable for high-performance industrial applications. More importantly, customized models must be developed based on specific scenarios to achieve better results.

[Reprint Statement]The purpose of reprinting is to convey more information.If there are copyright or other issues regarding the works, please contact us within 30 days, and we will resolve them promptly!

Related posts

Leave a Comment Cancel reply