Deploying and Evaluating the RK3588 YOLOv5s Model

MEGAWAY TECHNOLOGY

RK3588 YOLOv5s

Model Deployment and Evaluation

01/

Model Overview

Model Name: YOLOv5s

Model Type: Classification Model

Official Repository: GitHub – ultralytics/yolov5: YOLOv5 in PyTorch> ONNX > CoreML > TFLite V7.0

Model Parameters (PARAMS): 7225885

Model Computation (FLOPS): 16.4 GFLOPs

Deployment Device: RK3588

Deployment Environment: Ubuntu20.04/rknn_toolkit2 V1.6.0/OpenCV4.5.1

02/

Model Analysis

YOLOV5 model outputs three detection heads, with sizes [1, 255, 80, 80], [1, 255, 40, 40], [1, 255, 20, 20]. Taking the first detection head as an example [1, 255, 80, 80] =[1,3,85,80,80], meaning:

[batch_size,n_anchors{x,y,w,h,obj_confidence,class0_confidence,…class80_confidence},grid_w, grid_h]

Where n_anchors is the number of anchors, the YOLOv5 series sets three sizes of anchors; x, y, w, h are the offsets and dimensions of a candidate box relative to the current grid center; obj_confidence is the probability that an object exists in a candidate box; class0_confidence,…class80_confidence are the probabilities of the candidate box belonging to the 80 categories in the COCO dataset. grid_w, grid_h are the grid sizes, calculated as image size/grid size, with YOLOv5 series setting three grid sizes of 8/16/32 for detecting different sizes of targets. When the grid size is 8, grid_w, grid_h would be 640/8=80.

Each detection head undergoes Detect processing, and finally concatenated into an output size of [1, 25200, 85].

Detect processing is a crucial step in the YOLOv5 series model. According to the official repository and related papers, taking the first detection head as an example, the inference process of Detect is as follows:

1. Reshape + Transpose dimension conversion, [1, 255, 80, 80]->[1, 3, 80, 80, 85], and slice along the last dimension to obtain the data for each candidate box {x, y, w,h,obj_confidence,class0_confidence,…class80_confidence}

2. Decode each candidate box data according to the following formulas

b_x=2* sigmoid(x)-0.5+c_x

b_y=2 * sigmoid(y)-0.5+c_y

b_w=(2 * sigmoid(w))^2

b_h=(2 * sigmoid(h))^2.

At this point, the data for each candidate box becomes:

[b_x, b_y, b_w, b_h, objconfidence, class0confidence, … class80confidence]

3. Reshape + Concat the data from the three detection heads into [1, 25200, 85] output. However, during model deployment, due to reasons such as NPU not being able to adapt to the Detect module, we often choose to remove it. Therefore, in the post-processing program, we need to manually add the above process, which will be described in detail later.

03/

Model Optimization and Conversion During Deployment

3.1

Model Structure Optimization

1. Replacing SiLU with ReLU

The original YOLOv5s network uses SiLU as the activation function, which has certain advantages over ReLU. However, during actual deployment, the precision loss caused by using ReLU is completely acceptable, and the NPU handles the Conv + ReLU fusion layer more efficiently. This is also the method recommended by Rockchip. Therefore, we replace the SiLU structure in the network with ReLU, specifically:Models/common.py Line 71 class Conv

After replacing the activation function, we retrain the model using the COCO val2017 dataset, and the accuracy obtained after 200 epochs of iteration is as follows:

Deploying and Evaluating the RK3588 YOLOv5s Model

For comparison, the official data for YOLOv5s using SiLU is:

Deploying and Evaluating the RK3588 YOLOv5s Model

2. Removing the Detect structure during the forward process

The Detect head in the forward process is a key structure of YOLOv5. However, if it is included in the model, it requires repeated data exchanges and memory rearrangements between the NPU and CPU, consuming a lot of time. Therefore, when exporting the onnx model, we remove the Detect structure from the forward process and rewrite this part in the subsequent project using C++. The specific operation is to add self.training = self.training | self.export in the forward method of models/yolo.py, and also not export the detection head during the export phase.

Deploying and Evaluating the RK3588 YOLOv5s Model

Additionally, it is necessary to modify export.py by deleting the descriptions about output shapes on lines 805 and 807 as follows:

Deploying and Evaluating the RK3588 YOLOv5s Model

It is also possible to remove the last Detect structure from the network by modifying the onnx model after exporting it or executing this operation during the model conversion process.

3.2

Model Conversion and Quantization

Deploying and Evaluating the RK3588 YOLOv5s Model

Modifying the DO_QUANTIZE option allows you to choose whether to quantize the model. To quantize the model, a dataset must first be established. Here, we use the COCO128 dataset to quantize the model. The COCO128 dataset contains a total of 128 images. Using a python script, you can save the image names from the dataset to dataset.txt, which serves as the dataset indicator file during quantization, as shown in the code below:

Deploying and Evaluating the RK3588 YOLOv5s Model

3.3

Analysis of Model Layer Operation Status

To evaluate the model’s operational efficiency before and after quantization, we use the online testing interface provided by rknn_toolkit2 on the PC to test the model’s operating parameters on the device. The CPU frequency on the device is fixed at 900MHz, and the NPU frequency is fixed at 1GHz. Execute the command on the device:

sudo rknn_server

Connect the device’s Type-C OTG port to the PC’s USB port, and execute the command on the PC:

adb devices

Query the connected adb devices on the PC to obtain the device ID. If it prompts that adb is not installed, please run the following command to install:

sudo apt install adb

Add the time consumption evaluation and error data analysis program in the model conversion python program in section 3.2:

Deploying and Evaluating the RK3588 YOLOv5s Model

The complete program is as follows:

Deploying and Evaluating the RK3588 YOLOv5s Model

04/

Data Processing

4.1

Data Preprocessing

Deploying and Evaluating the RK3588 YOLOv5s Model

4.2

Data Postprocessing

Deploying and Evaluating the RK3588 YOLOv5s Model

05/

Model Inference Experiment

Experimental Equipment: RK3588_Eval_Board

Experimental Environment: Ubuntu 20.04 LTS/rknn-toolkit 1.6.0/OpenCV 4.5.1

Experimental Parameters: CPU Frequency @1.8GHz/NPU Frequency @1.0GHz/Iterations @1000 iterations

5.1

Optimized Model Inference Experiment

1. Optimized FP16 Model Inference Experiment

• Running Time

• NPU Average Load

Deploying and Evaluating the RK3588 YOLOv5s Model

• NPU Load Curve

Deploying and Evaluating the RK3588 YOLOv5s Model

• CPU Average Load

Deploying and Evaluating the RK3588 YOLOv5s Model

• CPU Load Curve

Deploying and Evaluating the RK3588 YOLOv5s Model

2. Optimized INT8 Model Inference Experiment

• Running Time

Deploying and Evaluating the RK3588 YOLOv5s Model

• NPU Average Load

Deploying and Evaluating the RK3588 YOLOv5s Model

• NPU Load Curve

Deploying and Evaluating the RK3588 YOLOv5s Model

• CPU Average Load

Deploying and Evaluating the RK3588 YOLOv5s Model

• CPU Load Curve

Deploying and Evaluating the RK3588 YOLOv5s Model

5.2

Model Inference Experiment Based on MegaFlow Runtime SDK

1. Single-threaded Model Inference Process Experiment Based on MegaFlow Runtime SDK

In the MegaFlow platform, create a new process, add Image Read, YOLOv5s, and Null operators, configure the operator parameters, deploy the model to NPU0, click the “Debug” button, select the test type as “By Count”, set the test count to 1000, start debugging, and wait for the debugging to complete to view the latest performance report.

Deploying and Evaluating the RK3588 YOLOv5s Model

By analyzing the operator’s runtime diagram in the report, it can be seen that the first model inference occurs at 65ms, with a data sequence number of 0; the last model inference output is at 24820ms, with a data sequence number of 1002. A total of 1003 inference attempts were made, with an average time per inference of 24.65ms, and an average FPS of 40.51.

• NPU Load Curve

By analyzing the system performance analysis diagram in the report, the NPU load curve during the test process can be obtained. It can be seen that only NPU0 is working during single-threaded inference.

Deploying and Evaluating the RK3588 YOLOv5s Model

• CPU Load Curve

By analyzing the system performance analysis diagram in the report, the CPU load curve during the test process can be obtained. It can be seen that the overall CPU load is low, around 20%.

Deploying and Evaluating the RK3588 YOLOv5s Model

2. Multi-threaded Model Inference Process Experiment Based on MegaFlow Runtime SDK

In the MegaFlow platform, create a new process, build a multi-threaded inference process, and configure the operator parameters. Deploy the models of three inference threads to NPU0, NPU1, and NPU2, click the “Debug” button, select the test type as “By Count”, set the test count to 1000, start debugging, and wait for the debugging to complete to view the latest performance report.

Deploying and Evaluating the RK3588 YOLOv5s Model

• Operator Runtime Sequence Diagram

Deploying and Evaluating the RK3588 YOLOv5s Model

By analyzing the operator’s runtime diagram in the report, it can be seen that the first model inference occurs at 60ms, with a data sequence number of 0; the last model inference output is at 10095ms, with a data sequence number of 1028. A total of 1029 inference attempts were made, with an average time per inference of 9.75ms, and an average FPS of 102.57. Compared to the single-threaded inference process, the multi-threaded inference process improved inference performance by 153.2% while ensuring data order output.

• NPU Load Curve

By analyzing the system performance analysis diagram in the report, the NPU load curve during the test process can be obtained. It can be seen that the utilization of all three NPUs reached over 70%.

Deploying and Evaluating the RK3588 YOLOv5s Model

• CPU Load Curve

By analyzing the system performance analysis diagram in the report, the CPU load curve during the test process can be obtained. It can be seen that the overall CPU load is high, reaching over 70%.

Deploying and Evaluating the RK3588 YOLOv5s Model

06/ Summary

This article introduces the YOLOv5s model and its deployment and optimization process on the RK3588 device. By replacing the activation function with ReLU, removing the Detect structure, INT8 quantization, and using MegaFlow Runtime SDK for multi-threaded inference, we have improved the inference performance of the model. The inference performance of various methods is summarized in the table below:

Deploying and Evaluating the RK3588 YOLOv5s Model

Click to follow

Get more exciting news

Related posts

Leave a Comment Cancel reply