The Impact of AI on Embedded Design

In the past two years, we have participated in numerous electronic competitions for college students. These competitions have seen sponsors increasingly favor the provision of development boards with AI acceleration units, and the organizers, such as universities, encourage students to integrate AI technology into their competition projects as much as possible. In fact, it has even become a hard requirement in the competition rules. For example, in the national college student electronic design competition we participated in at the end of last year, Renesas, as a sponsor, provided boards that included AI image processing acceleration units, and this feature was highlighted during the competition.

This trend is closely related to the larger trend of AIoT or edge AI. The integration of AI technology, particularly computer vision technology, in student competitions indicates that the barriers to adopting AI technology are not very high, suggesting that AI technologies represented by computer vision applications are maturing.

For embedded applications, DIY enthusiasts, and researchers, it has become increasingly easier to integrate AI technology into products or projects. After all, applications like face detection, object presence detection, head counting, and various computer vision-related applications are very popular and becoming quite commonplace. Off-the-shelf sensors, processing power support, and even software and libraries that support algorithms are becoming more refined.

Even NVIDIA, which primarily dominates the cloud AI market, launched a $99 Jetson Nano developer kit at the GTC conference in 2019, which doesn’t quite fit our traditional understanding of NVIDIA’s style. Such products are valuable for AI-related algorithm research and testing AI prototype products.

Moreover, in October last year, Huang launched a 2GB RAM Jetson Nano development kit priced at $59, further lowering the barrier to entry for AI (and CUDA). NVIDIA itself has used the slogans “AI for Everyone” (4G version) and “Discover AI” (2G version) in its promotions for the Jetson Nano products.

This product line is particularly representative in the fields of edge AI and AIoT, allowing us to discuss the general trend of embedded design integrating AI technology, as well as the two components of this trend: performance and ecosystem.

When AI Technology Invades Embedded Design

For similar development boards aimed at researchers, developers, tech enthusiasts, and students, the most well-known in the market is naturally the Raspberry Pi. However, relying solely on Raspberry Pi’s hardware resources for AI development and research is still not very reliable. Similar products include Banana Pi, Nano Pi, and others. After all, general-purpose processors are still too inefficient for AI computations, especially for real-time ML applications. We have mentioned before that, from a larger trend perspective, AI computation has become an increasingly important third capability beyond general-purpose computing and graphics computing, whether in the embedded field or in various edge devices.

Thus, products like NVIDIA’s Jetson Nano developer kit have emerged. Many chips and development board products aimed at specific fields have also integrated AI dedicated cores. For example, the aforementioned Renesas RZ/A2M development board features a DRP (Dynamically Reconfigurable Processor) acceleration unit—although its target market may be more specific—such products are common in the industrial sector.

More general development boards, even those without AI dedicated cores, have achieved improved computational efficiency through Tengine support—a lightweight modular neural network inference engine, such as the Rockchip RK3399 development board (ROCK960); the advanced RK3399 Pro adds a dedicated NPU; and there are also boards like the Amlogic A311D development board, which integrates a dedicated NPU. Further down, we will also see Google’s Edge TPU and Intel Myriad X…

The difference of Jetson Nano from these products mainly lies in its continuation of NVIDIA’s tradition of using GPUs for AI computation (and these boards are indeed not cheap at all). However, including the 2GB version, Jetson Nano supports CUDA in a compact form factor, which is quite remarkable (the current price on Taobao seems to be around 430 yuan).

Discussing AI Performance in Edge Computing

The main configuration of the Jetson Nano 2G development board includes a CPU with four 1.43GHz Arm Cortex-A57 cores, and a GPU with 128 CUDA cores from Nvidia Maxwell (the previous generation of the Geforce 10 series Pascal architecture), with half-precision floating point performance of 472GFlops. Compared to more specialized VPUs like Myriad X and the aforementioned NPUs, it is still inferior. However, compared to CPUs, it is still vastly superior, and NVIDIA’s AI development ecosystem, which is hard to match, can somewhat ignore the advantages of competitors in pure hardware performance, which will be mentioned when discussing the ecosystem later.

From this configuration, it is easy to see that Nano is more focused on low power consumption and low-cost IoT within NVIDIA’s Jetson family product line, primarily aimed at inference, although NVIDIA’s introduction suggests it can also be used for AI training.

Other configurations of this board include 2GB of 64-bit LPDDR4 with a bandwidth of 25.6GB/s; video encoding supports up to 4k30, 4 paths of 1080p30; video decoding supports up to 4k60, 8 paths of 1080p30; 1 MIPI CSI-2 camera interface (supports Raspberry Pi camera modules, Intel Real-sense, and many camera modules including IMX219 CIS), as well as USB, HDMI, Gigabit Ethernet, wireless adapters, and microSD slots (for local storage). Other I/O-related configurations will not be elaborated on here. The $99 4G memory version of the Jetson Nano development board, aside from having larger memory, will also have richer I/O resources (including an additional M.2 interface for connecting a wireless network card).

NVIDIA has a dedicated table that compares the real-time inference performance differences among several products under different models of video stream applications: Jetson Nano, Raspberry Pi 3, Raspberry Pi 3 + Intel Neural Compute Stick 2 (which integrates Myriad X), and Google’s Edge TPU development board (the comparison is for the 4GB version of Jetson Nano).

In fact, the crushing advantages of Raspberry Pi and the slightly weaker performance compared to TPU are expected—this is essentially the difference between general-purpose and specialized (the leading performance compared to Myriad X may reflect the capabilities of the ecosystem). However, many items in this table marked as “DNR” indicate that they did not run, possibly due to limited memory capacity, unsupported network layers, or other hardware and software limitations, which means they could not run. What NVIDIA wants to convey here is that the generality of GPUs will be significantly better.

The performance of the 2GB version of Jetson Nano under mainstream models is shown in the following figure:

When AI Starts to Talk Ecosystem

When discussing generality, we must mention that NVIDIA has almost no rivals in terms of ecosystem building in the AI field. NVIDIA has repeatedly mentioned that the development of AI computing science is rapid, and researchers are quickly inventing new neural network architectures. AI researchers or practitioners use various AI models in their projects. “Therefore, for learning and building AI projects, an ideal platform should be able to run various AI models, be sufficiently flexible, and provide enough performance to build meaningful interactive AI experiences.”

There is no need to elaborate on mainstream ML frameworks, including TensorFlow, PyTorch, Caffe, Keras, etc. The table above provides support for popular DNN models, and Jetson Nano supports them all. Comprehensive framework support, memory capacity, unified memory subsystems, and various software support are the foundations for achieving generality. Moreover, not just DNN inferencing, the CUDA architecture of Jetson Nano can be used for a wider range of computer vision, DSP, and various algorithms and operations.

It is particularly noteworthy that these models executed on Jetson Nano used TensorRT. Those familiar with NVIDIA’s AI ecosystem should know that TensorRT is a middleware promoted by NVIDIA, which can generate optimized model runtimes from input models via CUDA GPU, and is an important component of NVIDIA’s AI computing performance optimization.

As a software, TensorRT is an important component for achieving higher frame rates and greater computational efficiency in the examples mentioned earlier. Researchers have tested compiling a version of the Caffe model directly on Jetson Nano 2G without TensorRT optimization, and while it is not a big problem, the version accelerated by TensorRT shows a significant improvement in efficiency.

What can be derived from TensorRT is, on the one hand, NVIDIA’s claim that it is not or not just a chip company, and on the other hand, it offers a glimpse into the resources of the Jetson Nano ecosystem backed by NVIDIA.

In simple terms, all hardware platforms in the entire Jetson series are compatible with various software tools and SDKs: the software portion of the Jetson Nano development kit is the JetPack SDK, which includes the Ubuntu operating system, as well as various libraries for building end-to-end AI applications, such as OpenCV and VisionWorks for computer vision and image processing; CUDA, cuDNN, TensorRT for accelerating AI inferencing, and various libraries. NVIDIA previously referred to various software acceleration libraries as the CUDA-X software stack (though it seems this term has not been mentioned much this year).

Some typical examples closely related to Jetson Nano include Nvidia DeepStream for intelligent video stream analysis, Clara for medical imaging, genetics, and patient monitoring, as well as Isaac for robotics. There are also a plethora of pre-trained models that NVIDIA has reserved, which can be applied to Jetson Nano for inference after customization through transfer learning, and are also an important part of this ecosystem.

We have introduced some of these in previous articles discussing NVIDIA’s ecosystem. It seems that when it comes to the “AI ecosystem” aspect, other boards fall short, and more efficient ASIC-based AI dedicated processors seem to be less significant at this moment.

Let’s Talk About a Few Examples

We browsed the Jetson Nano developer guide written by NVIDIA, which includes environment setup, simple usage of existing libraries, etc.; we found that the main applications of Jetson Nano should be object detection, image classification, semantic segmentation, and voice processing. More specifically, they may include NVR network video recorders, self-driving cars, smart speakers, access control systems, SLAM robots, and various embedded AIoT applications related to smart transportation and smart cities.

To illustrate the development of such applications, we will give a few examples. Similar boards have similar processes for implementing many features, but Jetson Nano has a broader ecosystem. The preparation work for setting up the environment is not mentioned here; writing an image to a microSD card is a relatively simple process. Insert the microSD card into the Jetson Nano (4GB/2GB) slot, connect the power supply, mouse, keyboard, and display, and you can start using it. The operating system is Ubuntu 18.04 L4T, and the 2GB version is pre-set with a lightweight LXDE desktop to reduce system usage. It is feasible to learn and develop applications in Python, OpenCV, AI deep learning, ROS automatic control, etc.

Installing a camera on the Jetson Nano, whether through the aforementioned CSI interface or USB interface, is also possible. JetPack has a built-in OpenCV development environment that, in conjunction with the CSI camera, can input from the camera to implement some basic machine vision applications; including resizing, rotating input images, etc.; more advanced applications such as tracking specific colored objects in the image, edge detection, face tracking/eye tracking.

Source: NVIDIA Enterprise Solutions

Taking edge detection as an example, the execution process generally involves first converting the image to an HSV grayscale image; then applying Gaussian blur to the HSV grayscale image—this step is to denoise the image; finally, finding the edge lines in the image. The entire process can be executed using several function calls, as shown in the figure above. The execution process and results are approximately as follows:

This is a common machine vision application. Now let’s look at an actual AI application, using the OpenCV library and Python3 development environment to achieve face recognition, which is needed for company attendance and roll call systems. The face_recognition Python library can accomplish this, based on the dlib machine learning open-source algorithm library, with simple function calls.

The library’s face_locations method can find the location of faces in the image, and OpenCV can be used to draw boxes around the original image and display the results. The code is shown in the figure below; this example locates faces in the image, and by wrapping it in a while loop, it can read video.

After locating the face, it needs to be compared with the face feature database for identity recognition, which will not be elaborated on here. Interested students can refer to NVIDIA’s official WeChat account for updates on Jetson Nano 2G. The process of getting started with AI development is still quite light and convenient.

It is worth mentioning that to reflect the advantages in performance and ecosystem, once JetPack is installed, you can find some CUDA samples. These examples include ocean simulation experiments, smoke particle simulations (256×256 smoke particle movement with light and shadow changes), n-body particle collision simulations, etc.

By comparing the execution process using just the CPU or leveraging the parallel capabilities of the GPU for acceleration, one can feel the significant difference in performance. These CUDA examples should be the most representative of the value of Jetson Nano in terms of performance and ecosystem.

It is particularly noteworthy that NVIDIA has created a Hello AI World experience for the Jetson product line, which can be considered part of the AI ecosystem. They claim that developers can experience various deep learning inference demos within a few hours, running real-time image classification and object detection functions on Jetson Nano with JetPack SDK, TensorRT, etc., using pre-trained models. (Additionally, NVIDIA’s developer blog has listed using Jetson Nano to run a complete training framework, retraining models with Transfer Learning, which also seems to be a use case, though it is likely to take quite some time…)

Hello AI World can be considered a tutorial, mainly related to applications in computer vision and cameras, including image classification, object detection, semantic segmentation, etc., as well as Deep Learning Nodes for ROS, which integrates recognition and detection features with ROS (Robot Operating System) to achieve open-source projects for robotic systems and platforms. In fact, Hello AI World itself can illustrate NVIDIA’s comprehensive ecosystem layout.

Finally, we can talk about a specific manifestation of NVIDIA’s software capabilities or AI ecosystem capabilities: NVIDIA’s official WeChat account once provided 10 lines of Python code, as shown in the figure, “to achieve deep learning object detection and recognition for 90 categories.” It seems that with the hardware resources of Jetson Nano 2G, even excellent algorithms like YOLOv4 or SSD-Mobilenet can only achieve 4-6 FPS performance.

However, executing this Python code in the JetPack ecosystem will generate a corresponding TensorRT accelerated engine for the model. The first line of the code imports the tool library module, then establishes input and output objects; the fourth line imports the module for “deep learning inference applications,” and then uses detectNet() to establish the net object, handling subsequent “physical detection inference recognition” tasks.

In the while loop, the seventh line reads a frame of the image, and the eighth line detects objects in the image that meet the threshold. Due to the existence of TensorRT, this line of code can significantly enhance performance, allowing beginners to avoid dealing with the TensorRT invocation issues. The ninth line method overlays data such as boxes, category names, and confidence levels onto the image for detected objects. In NVIDIA’s underlying implementation, the original performance of 4-6 FPS can be improved to over 10 FPS. This example feels quite representative.

This article uses Jetson Nano 2G as an example to briefly discuss the importance of performance and ecosystem for the development friendliness of embedded development boards with AI capabilities. Hardware performance is a fundamental guarantee—the trend is that more embedded boards are beginning to incorporate AI computing power; while the existing development ecosystem, represented by NVIDIA, has greatly reduced the difficulty of development, at least for beginners, and achieved significant optimization in performance efficiency. (This also indirectly confirms that NVIDIA might be a software company…)

When AI Technology Invades Embedded Design

Discussing AI Performance in Edge Computing

When AI Starts to Talk Ecosystem

Let’s Talk About a Few Examples

Leave a Comment Cancel reply