Complete Guide to Embedded AI Framework Tengine: Architecture, Operator Customization, and Engine Inference

Produced by | Smart Things Open Class

Instructor | Wang Haitao Co-founder of OPEN AI LAB and Chief Architect of Tengine

Reminder | Click the blue text above to follow us, and reply with the keyword “AI Framework” to obtain the course materials.

Introduction:

On the evening of April 8, Smart Things Open Class launched an embedded AI framework special session by OPEN AI LAB, taught by co-founder Wang Haitao, with the theme of “Tengine – Challenges and Practices of Embedded AI Framework”.

In this special session, Mr. Wang Haitao analyzed the challenges faced by embedded AI and the solutions provided by Tengine, providing a comprehensive analysis of the Tengine architecture and the composition of its inference API. Finally, Mr. Wang will teach you how to customize personalized operators through Tengine and perform inference on CPU/GPU/NPU/DLA.

This article is a compilation of the main lecture segment of this special session.

Hello everyone, I am Wang Haitao, and the theme I am sharing today is “Tengine – Challenges and Practices of Embedded AI Framework”, which is divided into the following parts:

1. Challenges faced by embedded AI and solutions provided by Tengine

2. Analysis of Tengine architecture

3. Introduction to Tengine API

4. Practice 1: Tengine extension, customization, and addition of operators

5. Practice 2: Inference of Tengine on CPU/GPU/NPU/DLA

Tengine is an embedded AI computing framework and a core product of our company. It has done a lot of work at the computing power level by establishing deep cooperative relationships with many domestic chip manufacturers, adopting various technical solutions to fully leverage the computing performance of hardware. Therefore, we are committed to building an AI computing power ecological platform, hoping that through Tengine, it can conveniently call the computing power of underlying chips. Secondly, it will provide a series of products and toolkits to solve various problems that arise when algorithms trained need to be deployed on the edge, providing a standard and fast method to accelerate the landing of the entire AI industry. This is one of our goals, so the future Tengine will evolve into a development and deployment platform for AIoT, not just an inference framework.

Challenges faced by embedded AI and solutions provided by Tengine

Currently, what problems exist in embedded AI? First, AI’s penetration into daily life is becoming increasingly strong, and it can be said that AI will be ubiquitous like electricity. This trend is mainly due to two reasons: first is the improvement of edge computing power, that is, the evolution of CPUs and the emergence of various NPUs; second is the progress of algorithms themselves, as we can see the continuous lightweighting of algorithms from the early VGG to Inception v1, v2, v3, and then to EfficientNet, which allows AI applications that could only run on servers in the past to now run on the edge. Another issue is that people are increasingly concerned about data security and privacy, leading more people to hope that AI computing can run locally, avoiding cloud computing as much as possible. For edge AI, 2016 was the first year, and it is still in a rapid explosion phase to this day.

The above mainly introduces the application needs of the front end. So what is the situation of the industry chain? Currently, the situation in the industry is very unfriendly. The first reflection is the diversity of hardware; the AIoT market naturally has a diversity of hardware, and the emergence of various AI acceleration IPs has exacerbated this diversity. Moreover, with the further application of AI, hardware that was previously considered unlikely to run AI, such as MCUs, can now also run some AI algorithms, leading to increasing diversity of hardware platforms.

The second reflection is the diversity of software. Currently, there are many training frameworks, and the models trained from these frameworks are sent to embedded platforms, which are also in a state of blooming. There are many various frameworks, some of which are native, while others are developed by third parties. Therefore, for application developers, how to land an algorithm on a platform is still a very long process.

The above is for application developers, and for algorithm developers, this is also a serious problem. The trained model needs to be adapted to a computation-limited embedded platform, which may require adjustments to the model, reducing its scale, and a lot of work needs to be done, such as quantization, before it can be implemented. If one wants to use the acceleration chip on the embedded platform, adjustments may also need to be made for the chip; some operators may not be supported and need to be replaced, or there may be differences in definitions. Therefore, the entire ecosystem is very unfriendly.

In such a diverse environment, the entire AI industry chain’s work efficiency is very low. For chip companies, they excel at improving computing power, but they find that if they only make chips and simple drivers, many AI developers cannot use them, so they need to invest a lot of resources to build upper-level development platforms and environments, such as Huawei in China, which has built an entire industrial chain from chips to ID environments to deployment environments.

Algorithm and application companies also find that if they do not complete the adaptation of the underlying hardware, a trained model may perform very well during training, but when it is actually deployed on the platform, it may either perform very slowly or have very poor accuracy. Therefore, they need to personally adapt and optimize the model, so that the algorithm can truly be applied and landed. We feel that the division of labor in the entire industry chain is very unclear and inefficient, which is also what we are trying to improve and solve.

How does Tengine empower the industry chain? First, it connects algorithms and chips; regardless of which framework the model is trained with, Tengine can accept it and call the underlying computing platform while ensuring that it runs very well on the underlying computing platform. Second, it provides toolkits and standard processes to help migrate the model to the SOC platform. For Tengine to accomplish this, we mainly think from three directions:

1) Openness: Openness focuses on two points. The first is open-source because we hope more people will use Tengine, so we have implemented open-source measures. The second is modularity; as the entire AI software and hardware are still rapidly developing, we need to provide a good way to support newly emerging algorithms, models, and hardware, so we build the entire software in a modular way.

2) Efficiency: Because computing power on the SOC platform is always a bottleneck, optimizing algorithms to run fast and well on the platform is essential. We mainly optimize from two levels: first, optimizing at the graph level, and second, implementing a high-performance computing library that is completely independent intellectual property.

3) Connection: First, connect algorithms and chips; regardless of changes in algorithms or chips, we hope to minimize the cost of intermediate docking as much as possible. Next, we also need to consider supporting multiple computing devices, connecting the scheduling and usage among multiple devices. These are the three things Tengine does to empower the industry as an inference framework.

How does Tengine assist in the application development of AIoT? The above figure shows a standard process for algorithm deployment. On the far left is algorithm training, followed by providing the standard workflow in the server via the Inference server. When it is necessary to perform on SoC, the first step is to optimize the model, and then quantize the model. At this point, there may be a decline in accuracy and speed, and the original model may need to be modified. This process may need to be repeated many times. After modification, it can be placed on a certain inference engine and run on the chip.

The above only introduces the algorithm model part; a real AI application includes data recording and result output. The above also shows a preprocessing and post-processing phase. Regardless of whether the data comes from a camera or a sensor, it needs to undergo certain preprocessing before being sent to the inference engine. The results from the inference engine also need post-processing before they can be further analyzed and processed by the application program. During quantization, there may be a certain loss of accuracy. In scenarios that require high accuracy, such as face recognition or payment scenarios, this loss needs to be minimized, which requires retraining for quantization.

In order to accelerate the landing process of AIoT, as shown in the blue box in the figure above, these are the areas where Tengine is doing work. For common mainstream training frameworks, we provide a quantization retraining toolkit, and we also provide dedicated algorithm libraries for preprocessing and post-processing to accelerate it.

Analysis of Tengine Architecture

Next, I will briefly introduce the Tengine promotion engine part, which is mainly divided into the following parts. First, a simple introduction to the overall architecture, followed by an introduction to the support situation for training frameworks, as this part is crucial for landing, and finally the execution of the computation graph, as well as high-performance computing libraries and supporting tools.

– Tengine Product Architecture

The Tengine product architecture is shown in the above figure. At the top are the API interfaces. We have conducted careful research on the API interfaces because we believe that the stability of the API is the first priority in building a Tengine AI application ecosystem, ensuring that the entire ecosystem can develop sustainably.

Below is the model conversion layer, which can convert various mainstream training framework models into Tengine models for running on the edge. Next, we will provide some supporting tools, including graph compilation, compression, tuning, and some simulation tools. We also provide some commonly used algorithm libraries, including preprocessing, post-processing, and algorithm libraries for specific domains.

Next is the actual execution layer, including graph execution, NNIR (Neural Network Intermediate Representation), which is a representation of Tengine’s graphs, including memory optimization, scheduling, and encryption implemented here. Below is the adaptation to the operating system; Tengine currently supports RTOS and Bare-Metal scenarios, all to support particularly low-end CPUs. On the heterogeneous computing layer, we can already support CPU, GPU, MCU, NPU, etc., and today we will also introduce the usage on GPU and DLA.

– Training Framework and OS Adaptation

For training frameworks, we currently support TensorFlow, PyTorch, MXNet, etc.; for mainstream operating systems, basically all Linux series can be supported, and RTOS is also supported, with IOS and Windows planned for the future.

– Model Conversion Scheme

The model conversion scheme essentially converts models from other frameworks into Tengine models. We have two methods for model conversion: the first is to use the conversion tool provided by Tengine for automatic transformation. The second requires the user to write some code, essentially extracting all parameters and data from the original model using Python scripts, and then constructing a Tengine NNIR graph through Tengine interfaces, and finally saving it as a Tengine model. Therefore, our model conversion scheme is different from others. The first step is to parse all models into Tengine’s NNIR format, and the second step is to save it as a Tengine model. Theoretically, Tengine NNIR can be executed directly without any issues. Thus, using this method, more things can be done. For example, Tengine NNIR can be saved as a Caffe model, making Tengine a conversion tool between models. Creating a good model conversion tool is not easy. When defining Tengine NNIR, we have encountered many compatibility issues between frameworks, and even though there are various conversion tools on the market, they all have some pitfalls.

– Tengine Model Visualization Tool

After the model is converted with Tengine, we need to check whether the conversion results are correct and meet expectations. We have collaborated with the open-source community’s model visualization tool software Netron to support Tengine models. If you download the latest version of Netron, it already has good support for Tengine.

As shown in the example above, it is the initial layers of SqueezeNet, where the first Convolution’s attributes are shown. As seen on the right side of the figure, its input is a data tensor, and it has a filter and a bias. It can be seen that since Tengine optimizes the graph, the attributes will have an additional activation. An activation of 0 indicates a ReLU activation, effectively merging Convolution and ReLU together. This tool can effectively display the results after the model conversion with Tengine, making it a very useful tool.

– Computation Graph Execution

For the execution of the computation graph, as mentioned earlier, all models loaded, regardless of whether they are from other frameworks or Tengine models, will be converted into Tengine NNIR, which requires masking the differences between various frameworks and ensuring compatibility. After obtaining the Tengine NNIR, we need to determine which nodes will be executed on which devices. If the system has only one device, the problem is relatively easy to solve. However, when the system has multiple computing devices, some difficulties arise. First, the characteristics of the two computing devices may differ; for example, one may have higher performance but support fewer operators, while another may have lower performance but better operator support. The strategy for handling this is to either split the tasks among multiple devices or concentrate them on one device. This is the static situation, while the dynamic situation is even more complex.

Assuming device A appears to be idle, but when I prepare to assign the task to it, it may become busy. Therefore, there has not been a particularly good solution so far. Here, we align closely with the OpenVX solution, where all computing devices need to set a priority according to demand. After obtaining a computation graph, the graph will inquire the underlying devices in order of priority to see if they can support this node. If one device responds, this node will not be allocated to others. Our solution is similar, but we allow the computation graph to be examined by all computing devices, not just stopping at one.

Through this method, we can determine which nodes of the graph will be executed on which devices. Once determined, the graph can be divided into several different parts. The diagram above is an example, where it is divided into four parts; these four parts have certain dependencies, with some graphs being able to execute in parallel while others cannot. Assuming yellow and blue can execute in parallel, they form an execution graph, which can then be scheduled to the respective devices for execution. This is roughly the entire computation process.

– Maximizing Hardware Performance in Computation

When running on devices, it is necessary to maximize computational power. We have optimized the computation library extensively, focusing primarily on the most time-consuming operators in the entire neural network, including convolution, fully connected layers, and pooling. Depending on the parameters, various computation modes have been implemented for convolution, such as GEMM-based, Direct, and Winograd. Generally, our library’s bidirectional acceleration ratio can reach 175%, and four-threaded performance can reach 300%, with some networks achieving even better performance, nearing 400% for four threads.

For optimization, it is aimed at the microarchitecture of the CPU. In the case of running a complete network, the utilization of Convolution MAC can reach 70%, essentially exhausting the hardware performance. Our computation library not only has excellent support for FP32 but also has been well optimized for INT8, which allows us to fully exploit the hardware’s computational power, achieving a speedup of 50-90% for INT8 compared to FP32. It is also noted that INT8 may bring some loss of accuracy; the computation library also supports mixed-precision computation, where layers with significant accuracy loss can use FP32 computation while others can use INT8 computation, providing a solution that balances both accuracy and speed. For the open-source version of FP32 performance, the overall effect is quite good, with SqueezeNet taking less than 60 milliseconds.

– Quantization Retraining Tool

As mentioned earlier, INT8 is now basically a must-have option for edge inference. However, the issue of INT8 accuracy has been a concern for many. If we look at the current mainstream training frameworks, except for TensorFlow having TensorFlow.Net, other frameworks currently do not have a complete quantization retraining solution. To address this issue, we have developed a quantization retraining tool, making improvements for NPU as well, since NPU chip manufacturers are not particularly familiar with training, and we hope to help them improve quantization accuracy on NPU, allowing more applications to run on NPU.

As shown on the right side of the figure, the results of quantization retraining show that the overall effect is very good. Especially for MxNet_MobileNet, the accuracy after retraining can increase by two percentage points.

In addition to the entire suite of tools and quantization retraining provided by Tengine, we have also done a lot of work on preprocessing and post-processing. In fact, there is a module called HCL.Vision, which is a computation library for CV applications. This computation library differs from other CV computation libraries in that if the hardware platform has some image processing acceleration, it will call the underlying hardware platform interface for processing; if not, we will implement an optimized CPU implementation, providing a unified interface externally. This is very convenient for application developers, as they can achieve hardware acceleration on a platform with hardware acceleration without changing the interface.

Introduction to Tengine API

Next, we will enter the Tengine API section, first introducing our thoughts on API design and then introducing some current applications of the API.

– Overview

When we started developing Tengine, we placed great importance on the API, while also learning from and referencing some excellent software solutions and frameworks in the industry. We mainly referenced the design principles and ideas of Android NN, OpenVX, TensorRT, TensorFlow, and MXNet interfaces, ultimately leaning towards OpenVX.

We have two principles: the first is that the definition and implementation of the interface are completely unrelated. This is mainly considered based on the fact that AI software and hardware are still in a rapidly developing situation. If the interface definition binds to certain implementations, it will bring a significant burden when refactoring code or supporting new features, so we completely separate the interface definition from the implementation.

The second is to emphasize the stability and flexibility of the interface. Stability is crucial for all APIs. Why emphasize flexibility? Similar to the above, we are uncertain about the future, so we should leave as many interfaces as possible to support future features and give applications more control over the underlying, which will be more beneficial for the stability and long-term development of the Tengine application ecosystem.

Regarding hardware, we have always considered how to better support Tengine. Running some trained models on the edge is a visible trend, so the interface needs to support model training. Essentially, Tengine connects algorithms and chips, providing a standard interface to offer chip computing power to applications. This is very valuable for our implementation and training solutions. Finally, we look at the results: we use C as the core API, and then encapsulate some more user-friendly APIs based on C, such as C++ API or Python API. These two APIs have been implemented, and there are plans for JS API and Java API, which will all be based on C API. This is an overview of the entire API system.

– Multi-chip and Software Backend Support

When entering the AIoT market, we know that there will certainly be various AI acceleration chips and AI IP emerging in the future. How to minimize the workload when migrating application programs between different chips or different AI acceleration IPs? Looking at our solution, application programs can call various computing powers through a one-time interface. As shown in the leftmost part of the figure, it can implement inference on a RK3288 or Raspberry Pi Linux. It calls the Tengine model, and the first line on the right is for an MCU platform, which may require a change to a new model, switching to a tiny model rather than a Tengine model. If it is an NPU, it only needs to change the format of the intermediate model; the other parts basically do not need to be modified. This reduces the complexity of application program maintenance and learning costs.

Based on our API, developers will find that it significantly reduces complexity. For example, if an application wants to run on both Android and Linux, it can call the Tengine interface, which can effectively utilize the underlying hardware’s computing power. It becomes more complex under Android because Google proposed an Android NN API interface, which did not exist before. Suppose there is an application that wants to call this interface but is not on the Android NN platform; what should be done?

In fact, we can use an intermediate solution, still calling the Android NN API but re-implementing the Android NN Runtime to allow Tengine to call the underlying computing power. This ensures that your program does not have to worry about switching interfaces across different platforms. From the above scenarios, it can be seen that the greatest value is for the underlying chip companies, which only need to adapt Tengine, and can provide application programs with computing power functionality across various scenarios.

– C++ API and Python API

Next, we will introduce the C++ API and Python API, which are primarily designed for quick and easy usage, so many functions from the C API have been removed. A typical usage example consists of four steps: the first step is to load the model, the second step is to set the input data, the third step is to run, and the fourth step is to retrieve an interface. The entire process is very strong and easy to understand, with almost no learning cost.

The Python API is similar to the C++ API, just in the Python structure, allowing for these things to be called without further elaboration.

In contrast, the core API of the C API is relatively complex. It can be divided into several major categories. The first category includes Tengine’s initialization and destruction, followed by interfaces corresponding to Tengine’s NNIR. Tengine includes concepts of Graph, Node, and Tensor. The concepts of Graph and Node are similar to those in TensorFlow. The Graph includes creating a Graph and creating Nodes, so through the Tengine interface, a graph can be created. This is also a consideration for supporting model conversion and training functions. Next are the execution interfaces for the graph, including Prerun and Postrun, and the rest are miscellaneous interfaces, such as setting log re-entries, binding an execution device to a graph and node, and more details can be found in the links below, where each API should have relatively clear annotations.

Next, I will introduce a typical process for inference using the C API, which is similar to what we mentioned earlier. The only slightly troublesome part is that it requires actively calling a prerun interface to bind the computation graph to a device and allocate the necessary resources for computation on the device. Next, we set the buffer of the Tensor and then run.

– Creating Graphs and Nodes with C API

First, create an empty graph, which represents nothing and does not come from any framework. Next, a input node needs to be created, which requires creating two things: the first is the Node, and the second is the Tensor, setting the Tensor as the output Tensor of the node. This idea is completely consistent with TensorFlow, so it will look somewhat similar to TensorFlow. Next, create a Convolution node, similar to the input node, first creating a node and specifying that the operation of this node is a Convolution. Then, create the output Tensor of this node. After that, set the input Tensors, such as 0, 1, and 2, which are input_tensor, w_tensor, and b_tensor, respectively, which make up this node. Convolution has many parameters, such as kernel size, stride, and padding. We provide an interface to directly set the properties of the node, which can be set directly through the program.

Practice 1: Tengine Extension, Customization, and Addition of Operators

The interfaces have been basically introduced; now let’s look at how we extend and customize operators with specific code.

– Custom Kernel for Tengine Operators

Firstly, for the customization of operator kernels, if an operator is already implemented in Tengine but differs from the currently implemented framework and you want a better implementation, or if you have hardware on your platform that can accelerate the operator, and you do not want to use the one we have already done, there are two ways to solve this. The first is to use the custom kernel interface of the C API to replace it but requires specifying which node to do so. This type of interface may have a lower learning cost as it does not require understanding the process of adding a new operator to Tengine; it only requires computation.

The second situation is to implement it using a Plugin, where an external module is used to re-implement the operator, but during registration, the priority of the implementation is raised. After this is done, all implementations of this type of operator will use our own operator.

Next, we will first introduce how to use the Tengine C API for replacement, and the way to add a new operator using Plugin will be explained in a later example.

The Tengine C API has a custom kernel interface, where the key data is custom_kernel_ops, which defines the content that needs to be implemented for the custom kernel. Its data structure is shown in the above figure. The first is the op, indicating whether to perform a convolution or pooling. Next are two parameters, kernel_param and kernel_param_size, which are for our operators to use. Then there are three key functions: prerun, run, and postrun, where the first parameter is ops, which is the ops itself, followed by input_tensor and output_tensor. The other parameters need to be found in the parameter.

After setting this up, we can call set_custom_kernel to replace the implementation on the device. Now we can look at the code.

– Adding New Operators to Tengine

Adding new operators is also very important for inference frameworks, especially for TensorFlow, where operators are both numerous and complex. When encountering unsupported operators, we recommend first implementing them using the plugin mode outside of the Tengine Repo. Once tested and stabilized, you can submit a PR to merge it into the Repo.

Below is an example using Tensorflow’s new operator, Ceil. There are several steps: first, register a new operator definition in Tengine NNIR; second, register the operator’s model loading in Tengine Tensorflow Serializer; finally, implement the operator in Tengine Executor. You can see its code.

Practice 2: Inference of Tengine on CPU/GPU/NPU/DLA

Next, let’s see how inference is done. We selected MobilenetSSD for this, with three examples: the first runs solely on CPU, the second runs in combination with CPU+GPU, and the third is a multi-device scenario, taking MSSD as an example. First, we will look at the situation running on CPU and GPU. Assuming there is an NPU or a DLA, how can we utilize it? Let’s first look at a demonstration.

From the above, it can be seen that calling NPU is a very easy task, greatly enhancing both learning cost and application development speed.

Get the Course Materials👇👇

In this public account, reply “AI Framework” to obtain the complete presentation for this lecture.

END

Live Broadcast Announcement

Tomorrow night at 8 PM, the 7th lecture of the CV research series will officially start! Shen Yujun, a PhD student at the Chinese University of Hong Kong MMLab, will explain the interpretability research and applications of Generative Adversarial Networks.

Scan the QR code in the poster below to register quickly👇👇👇

Your every “Looking” is taken as a like

▼

Related posts

Leave a Comment Cancel reply