Source: Wai Rui Lao Ge (ID: verilog-2001)
FPGA is known as a universal chip, capable of achieving almost any function through logical programming.
So the question arises: how difficult is it to use FPGA to accelerate artificial intelligence and implement deep learning algorithms?
It is quite difficult!
If starting from AI algorithms and then writing, debugging, optimizing in Verilog, and finally downloading to FPGA for operation, it requires a lot of manpower and resources. Without at least a year and a half, it is hard to achieve results.
However,
if there is an FPGA app library containing various AI apps, you can download the corresponding programs to the FPGA, as shown in the figure below:
There are intelligent cameras for face recognition, defect detection, ReID (pedestrian detection/tracking technology), voice recognition, multi-target face recognition, intelligent driving assistance, etc.
These applications can be directly compiled and downloaded to the FPGA, simplifying the development process significantly to achieve an “immediate use” effect.
Productivity can be formed quickly.
Even with customized requirements, the feasibility of the route can be quickly assessed, allowing for secondary development based on this.
This is faster and more efficient than starting directly from Verilog coding.
So how does this work?
2: How to Do It
How many steps are needed to build a face recognition system with FPGA?
If using the KV260 equipped with Zynq UltraScale+ FPGA, it requires a total of three steps:
Step 1: Assemble Components:
KV260 is a system based on Zynq UltraScale+ FPGA, which includes CPU, FPGA, GPU, and other resources.
The red heat sink covers the Zynq UltraScale+ chip. It has a quad-core A53 processor and 256K logic resources in the FPGA, representing a “dual sword combined” architecture.
The components include:
1: Power supply: System power supply.
2: SD card: For burning the operating system.
3: Camera: Video input.
4: USB to serial cable: System serial output.
5: Ethernet cable: For connecting to the internet and downloading smart applications.
6: HDMI cable: For video output.
Then assemble these parts together.
Step 2: Download Applications
1: Download the Linux system to the TF card, insert the TF card, and power on to start Linux.
2: Connect the Ethernet cable and check the connectivity by pinging the gateway to see if the connection is successful.
3: Once online, select and install the corresponding application, which is the smart camera capable of face recognition.
The image below shows the application installation process. After clicking the download command, make sure to download in root mode, sudo dnf install packagegroup-kv260-smartcam.
The installation takes some time, so let it run for a while.
Step 3: Start Testing
Start the application and begin testing. Mount the application in the Linux system, then launch the AI application.
sudo smartcam -mipi -W 1920 -H 1080 -r 30 –target dp
This is akin to figuring out how many steps it takes to fit an elephant in a refrigerator!
Then, how does it perform?
3: How Does It Perform?
This is how it looks.
I used this FPGA system mounted on a live streaming pole, paired with a monitor, creating something resembling a live studio based on an industrial camera.
With this “live studio”, I noticed that there was no beauty effect.
So, I skipped the live stream.
How does this AI perform?
I showed the KV260 a video of the UEFA European Championship award ceremony.
Why the UEFA award ceremony? Mainly because there are many people appearing in this video simultaneously, and the scene is constantly moving. I wanted to test the processing capability of this system.
1: The number of faces recognized.
2: Processing latency.
Videos with only two or three people cannot test the capabilities of this AI chip.
Moreover, a still image does not have high latency requirements. For instance, face recognition at a community entrance would exhibit noticeable delays.
So, let’s see if the KV260 can recognize how many faces are in the video?
This match is the UEFA award ceremony, and you can see that the KV260 immediately recognized all the faces of the winning players.
This efficiency is quite high.
I counted, and at most, 22 faces were recognized and captured simultaneously. Both main and substitute players were captured.
Being able to capture 22 faces simultaneously demonstrates the face recognition capability of this chip.
Moreover, the scene is continuously switching, and the latency control is quite good, although it is difficult to analyze quantitatively.
4: System and Principles
Having seen the performance of the FPGA in this system, let’s get to some solid content.
What components are necessary for an artificial intelligence chip system?
CPU system: Responsible for OS operation, system initialization, data flow control, signal transmission, peripheral management, etc.
DPU system: (Deep Learning Processor Unit) responsible for accelerating deep learning algorithms, acting as an intelligent acceleration engine.
In simple terms: two parts, control and computation.
This is the classic CPU+FPGA solution, commonly used in many embedded industrial fields, generally ARM+FPGA;
ARM handles control management and various peripheral management, while FPGA is responsible for AI acceleration.
The Zynq UltraScale+ integrates both into a single chip, as shown in the diagram below.
Its main resources include: quad-core CPU A53, a Mali GPU, and a dual-core R5F supporting real-time CPU.
The term DPU reminds one of Xilinx’s acquisition of Deep Vision Technology, which had a significant impact in the industry.
At that time, Deep Vision’s core technology was based on FPGA DPU technology, claiming to compress neural networks by dozens of times without affecting accuracy, while also utilizing “on-chip storage” to store deep learning algorithm models, reducing memory reads and significantly lowering power consumption.
Looking at it now, it is uncertain whether this technology is inherited in the Zynq UltraScale+ chip, but its performance is still strong.
The main workflow and performance indicators are shown in the diagram below:
It supports 11 MIPI cameras, which should be sufficient for the number of cameras required for a vehicle.
As far as I know, this chip is one of the most capable of supporting so many cameras.
In addition, the most important aspect is that this chip has very rich FPGA programmable resources: 256K logic Cells; (logic cells consist of LUT and REG). It operates at 333MHz, and its deep learning processing capability reaches 1.3 Tops in programmable logic.
Aside from smart cameras, there are other applications available for different choices.
For instance, intelligent driving, voice recognition, security detection, etc.
Here, I want to highlight ADAS (Advanced Driver Assistance Systems), which utilize various sensors installed in vehicles to collect environmental data inside and outside the vehicle in real-time, processing static and dynamic object recognition, detection, and tracking, enabling drivers to perceive potential dangers as quickly as possible, thus improving safety. This includes driver fatigue warnings and identity recognition, with face recognition being necessary for personalized driving cabins in passenger vehicles, as well as future business models. Some prototypes are already in development.
To summarize:
1: Processing Capability: The KV260 uses the Zynq UltraScale+ MPSoC chip, which includes a quad-core application processor, GPU, real-time processor, and 256K logic cells, providing sufficient hardware resources to handle AI applications.
2: Peripheral Interfaces: The KV260 supports 11 camera inputs, which can accommodate applications like automotive ADAS.
3: Software Support: In addition to hardware, there are many software suites available, such as face recognition, intrusion detection, autonomous driving, etc., facilitating secondary development.
FPGA can achieve millisecond-level applications through deep learning, with high throughput and low latency, which is likely an advantage.
The earlier face recognition test, where 22 faces were recognized simultaneously, also indirectly indicates a high number of face recognitions and low latency.
Why 22 faces? Because that was the number of faces in the video.
Of course, this test only provides a visually perceptible perspective. Actual applications can be tested based on requirements, and evaluations can be made from the test results to accelerate the design-in process.
Editor: Jin Shui Lou Tai