
RK3588 YOLOv5s
Model Deployment and Evaluation

Model Name: YOLOv5s
Model Type: Classification Model
Official Repository: GitHub – ultralytics/yolov5: YOLOv5 in PyTorch> ONNX > CoreML > TFLite V7.0
Model Parameters (PARAMS): 7225885
Model Computation (FLOPS): 16.4 GFLOPs
Deployment Device: RK3588
Deployment Environment: Ubuntu20.04/rknn_toolkit2 V1.6.0/OpenCV4.5.1
YOLOV5 model outputs three detection heads, with sizes [1, 255, 80, 80], [1, 255, 40, 40], [1, 255, 20, 20]. Taking the first detection head as an example [1, 255, 80, 80] =[1,3,85,80,80], meaning:
[batch_size,n_anchors{x,y,w,h,obj_confidence,class0_confidence,…class80_confidence},grid_w, grid_h]
Where n_anchors is the number of anchors, the YOLOv5 series sets three sizes of anchors; x, y, w, h are the offsets and dimensions of a candidate box relative to the current grid center; obj_confidence is the probability that an object exists in a candidate box; class0_confidence,…class80_confidence are the probabilities of the candidate box belonging to the 80 categories in the COCO dataset. grid_w, grid_h are the grid sizes, calculated as image size/grid size, with YOLOv5 series setting three grid sizes of 8/16/32 for detecting different sizes of targets. When the grid size is 8, grid_w, grid_h would be 640/8=80.
Each detection head undergoes Detect processing, and finally concatenated into an output size of [1, 25200, 85].

Detect processing is a crucial step in the YOLOv5 series model. According to the official repository and related papers, taking the first detection head as an example, the inference process of Detect is as follows:
1. Reshape + Transpose dimension conversion, [1, 255, 80, 80]->[1, 3, 80, 80, 85], and slice along the last dimension to obtain the data for each candidate box {x, y, w,h,obj_confidence,class0_confidence,…class80_confidence}
2. Decode each candidate box data according to the following formulas
b_x=2* sigmoid(x)-0.5+c_x
b_y=2 * sigmoid(y)-0.5+c_y
b_w=(2 * sigmoid(w))^2
b_h=(2 * sigmoid(h))^2.
At this point, the data for each candidate box becomes:
[b_x, b_y, b_w, b_h, objconfidence, class0confidence, … class80confidence]
3. Reshape + Concat the data from the three detection heads into [1, 25200, 85] output. However, during model deployment, due to reasons such as NPU not being able to adapt to the Detect module, we often choose to remove it. Therefore, in the post-processing program, we need to manually add the above process, which will be described in detail later.
Model Structure Optimization
1. Replacing SiLU with ReLU
The original YOLOv5s network uses SiLU as the activation function, which has certain advantages over ReLU. However, during actual deployment, the precision loss caused by using ReLU is completely acceptable, and the NPU handles the Conv + ReLU fusion layer more efficiently. This is also the method recommended by Rockchip. Therefore, we replace the SiLU structure in the network with ReLU, specifically:Models/common.py Line 71 class Conv
After replacing the activation function, we retrain the model using the COCO val2017 dataset, and the accuracy obtained after 200 epochs of iteration is as follows:
For comparison, the official data for YOLOv5s using SiLU is:
2. Removing the Detect structure during the forward process
The Detect head in the forward process is a key structure of YOLOv5. However, if it is included in the model, it requires repeated data exchanges and memory rearrangements between the NPU and CPU, consuming a lot of time. Therefore, when exporting the onnx model, we remove the Detect structure from the forward process and rewrite this part in the subsequent project using C++. The specific operation is to add self.training = self.training | self.export in the forward method of models/yolo.py, and also not export the detection head during the export phase.
Additionally, it is necessary to modify export.py by deleting the descriptions about output shapes on lines 805 and 807 as follows:
It is also possible to remove the last Detect structure from the network by modifying the onnx model after exporting it or executing this operation during the model conversion process.
Model Conversion and Quantization
Modifying the DO_QUANTIZE option allows you to choose whether to quantize the model. To quantize the model, a dataset must first be established. Here, we use the COCO128 dataset to quantize the model. The COCO128 dataset contains a total of 128 images. Using a python script, you can save the image names from the dataset to dataset.txt, which serves as the dataset indicator file during quantization, as shown in the code below:
Analysis of Model Layer Operation Status
To evaluate the model’s operational efficiency before and after quantization, we use the online testing interface provided by rknn_toolkit2 on the PC to test the model’s operating parameters on the device. The CPU frequency on the device is fixed at 900MHz, and the NPU frequency is fixed at 1GHz. Execute the command on the device:
sudo rknn_server
Connect the device’s Type-C OTG port to the PC’s USB port, and execute the command on the PC:
adb devices
Query the connected adb devices on the PC to obtain the device ID. If it prompts that adb is not installed, please run the following command to install:
sudo apt install adb
Add the time consumption evaluation and error data analysis program in the model conversion python program in section 3.2:
The complete program is as follows:
Data Preprocessing
Data Postprocessing
Experimental Equipment: RK3588_Eval_Board
Experimental Environment: Ubuntu 20.04 LTS/rknn-toolkit 1.6.0/OpenCV 4.5.1
Experimental Parameters: CPU Frequency @1.8GHz/NPU Frequency @1.0GHz/Iterations @1000 iterations
Optimized Model Inference Experiment
1. Optimized FP16 Model Inference Experiment
• Running Time
• NPU Average Load
• NPU Load Curve
• CPU Average Load
• CPU Load Curve
2. Optimized INT8 Model Inference Experiment
• Running Time
• NPU Average Load
• NPU Load Curve
• CPU Average Load
• CPU Load Curve
Model Inference Experiment Based on MegaFlow Runtime SDK
1. Single-threaded Model Inference Process Experiment Based on MegaFlow Runtime SDK
In the MegaFlow platform, create a new process, add Image Read, YOLOv5s, and Null operators, configure the operator parameters, deploy the model to NPU0, click the “Debug” button, select the test type as “By Count”, set the test count to 1000, start debugging, and wait for the debugging to complete to view the latest performance report.
By analyzing the operator’s runtime diagram in the report, it can be seen that the first model inference occurs at 65ms, with a data sequence number of 0; the last model inference output is at 24820ms, with a data sequence number of 1002. A total of 1003 inference attempts were made, with an average time per inference of 24.65ms, and an average FPS of 40.51.
• NPU Load Curve
By analyzing the system performance analysis diagram in the report, the NPU load curve during the test process can be obtained. It can be seen that only NPU0 is working during single-threaded inference.
• CPU Load Curve
By analyzing the system performance analysis diagram in the report, the CPU load curve during the test process can be obtained. It can be seen that the overall CPU load is low, around 20%.
2. Multi-threaded Model Inference Process Experiment Based on MegaFlow Runtime SDK
In the MegaFlow platform, create a new process, build a multi-threaded inference process, and configure the operator parameters. Deploy the models of three inference threads to NPU0, NPU1, and NPU2, click the “Debug” button, select the test type as “By Count”, set the test count to 1000, start debugging, and wait for the debugging to complete to view the latest performance report.
• Operator Runtime Sequence Diagram
By analyzing the operator’s runtime diagram in the report, it can be seen that the first model inference occurs at 60ms, with a data sequence number of 0; the last model inference output is at 10095ms, with a data sequence number of 1028. A total of 1029 inference attempts were made, with an average time per inference of 9.75ms, and an average FPS of 102.57. Compared to the single-threaded inference process, the multi-threaded inference process improved inference performance by 153.2% while ensuring data order output.
• NPU Load Curve
By analyzing the system performance analysis diagram in the report, the NPU load curve during the test process can be obtained. It can be seen that the utilization of all three NPUs reached over 70%.
• CPU Load Curve
By analyzing the system performance analysis diagram in the report, the CPU load curve during the test process can be obtained. It can be seen that the overall CPU load is high, reaching over 70%.
This article introduces the YOLOv5s model and its deployment and optimization process on the RK3588 device. By replacing the activation function with ReLU, removing the Detect structure, INT8 quantization, and using MegaFlow Runtime SDK for multi-threaded inference, we have improved the inference performance of the model. The inference performance of various methods is summarized in the table below:
Click to follow
Get more exciting news
