Using the RK3588 Chip NPU: Running and Interpreting the Official rknn_yolov5_android_apk_demo

1. Objective of This Article

This article will accomplish two tasks:

  • Run the official dynamic target recognition example using the camera on the RK3588 development board.
  • Interpret the source code to enhance understanding of the rknn development.

2. Development Environment Description

  • Host System: Windows 11
  • Target Device: Android development board equipped with RK3588 chip
  • Core Tools: Android Studio Koala | 2024.1.1 Patch 2, NDK 27.0

3. Compilation and Execution

0. Source Code Location

The official path is rknn-toolkit2\rknpu2\examples\rknn_yolov5_android_apk_demo. If you cannot find it, feel free to contact me.

1. Directory Structure

As shown in the figure below:Using the RK3588 Chip NPU: Running and Interpreting the Official rknn_yolov5_android_apk_demo

2. Compilation and Execution

After opening the project, dependencies are automatically downloaded. Click run, and it runs smoothly on the RK3588 development board without any obstacles. The following image shows the running effect, recognizing the three giants in the mobile phone image.

Using the RK3588 Chip NPU: Running and Interpreting the Official rknn_yolov5_android_apk_demo

4. Code Interpretation

1. Core Issues

This case involves real-time target detection during camera preview, so there are several core issues to address:

  1. Camera Usage: The case uses the first-generation Camera API, which is no longer recommended, but it works well.
  2. Concurrent Operation: The camera previews while the program uses YOLOv5 for target detection. A dedicated inference thread is needed for detection.
  3. Image Format Conversion: The preview frames from the Camera API in Android are in NV21 format, while the Rockchip NPU requires RGB888 format.
  4. A container to cache preview frames is needed: A queue for image caching must be designed.

2. Main Process of the Program

It is recommended to start analyzing from CameraPreviewActivity, where the logic gradually unfolds.

Initialization Phase:

  • onCreate():
    • Initialize UI components (FPS display, result canvas).
    • Load the preset YOLO model files into the cache directory based on the device platform (e.g., RK3588/RK356x).
    • Initialize the inference engine mInferenceWrapper and the target detection result processor mInferenceResult.
  • createPreviewView():
    • Create a SurfaceView for camera preview and bind the SurfaceHolder callback.

Camera and Preview Control:

See the flowchart:Using the RK3588 Chip NPU: Running and Interpreting the Official rknn_yolov5_android_apk_demo

TSurfaceHolderCallback

  • surfaceCreated(): Calls <span>startCamera()</span> to open the camera and start preview.
  • surfaceDestroyed(): Stops the preview and releases camera resources.

startCamera()

  1. Select the camera ID based on the number of cameras (prefer ID=2).
  2. Configure camera parameters (resolution, preview format), set the preview buffer <span>mPreviewData0</span>.
  3. Register the preview data callback using <span>setPreviewCallbackWithBuffer</span>.

onPreviewFrame()

  • Receives preview frames in NV21 format.
  • Uses RGA hardware acceleration to convert to RGB format.
  • Pushed to the image queue <span>mImageBufferQueue</span>.
    @Override
    public void onPreviewFrame(byte[] data, Camera camera) {
        mCamera0.addCallbackBuffer(data);
        ImageBufferQueue.ImageBuffer imageBuffer = mImageBufferQueue.getFreeBuffer();

        if (imageBuffer != null) {
            // RK_FORMAT_YCrCb_420_SP -> RK_FORMAT_RGBA_8888
            // flip for CAMERA_FACING_FRONT
            RGA.colorConvertAndFlip(data, RK_FORMAT_YCrCb_420_SP, imageBuffer.mImage_handle,  
                    RK_FORMAT_RGB_888,
                    CAMERA_PREVIEW_WIDTH, CAMERA_PREVIEW_HEIGHT, this.flip);

            mImageBufferQueue.postBuffer(imageBuffer);
        }
    }

Inference Thread

See the diagram below:Using the RK3588 Chip NPU: Running and Interpreting the Official rknn_yolov5_android_apk_demo

startTrack()

  • Initializes the image queue and starts the inference thread <span>mInferenceRunnable</span><span>.</span>

Inference Loop

  1. Gets RGB images from the queue.
  2. Calls <span>mInferenceWrapper.run()</span> to run the YOLO model inference.
  3. Calculates FPS and updates the UI via <span>Handler</span>.
  4. Detection results are stored in <span>mInferenceResult</span>.

UI Rendering

See the diagram below:Using the RK3588 Chip NPU: Running and Interpreting the Official rknn_yolov5_android_apk_demo

showTrackSelectResults()

  • Draws detection boxes and category labels on <span>Canvas</span>.
  • Displays results via <span>ImageView</span>.

Handler Message Mechanism

  • The main thread updates the FPS value and the bitmap of detection results.

3. Java Layer Related Classes

3.1 Core Wrapper Class for YOLO Target Detection Model: InferenceWrapper

This class is responsible for interacting with the underlying C++ implementation of the RKNN inference engine, completing the entire process of model initialization, inference execution, and post-processing.Member Variables:

OutputBuffer mOutputs; // Stores the raw output of the model
DetectResultGroup mDetectResults; // Container for native post-processing results
int OBJ_NUMB_MAX_SIZE = 64; // Maximum limit for the number of detected objects

The two classes (data structures) of DetectResultGroup are introduced later in the InferenceResult class.Model Initialization:

    public int initModel(int im_height, int im_width, int im_channel, String modelPath) throws Exception {
        mOutputs = new InferenceResult.OutputBuffer();
        // Pre-allocate memory for YOLO's three-layer output:
        mOutputs.mGrid0Out = new byte[255 * 80 * 80];  // 80x80 feature map
        mOutputs.mGrid1Out = new byte[255 * 40 * 40];  // 40x40 feature map 
        mOutputs.mGrid2Out = new byte[255 * 20 * 20];  // 20x20 feature map
        if (navite_init(im_height, im_width, im_channel, modelPath) != 0) {
            throw new IOException("rknn init fail!");
        }
        return 0; 
    }

255 = 3(5+80) corresponds to the 3 anchors of YOLOv5, each anchor predicts 5 parameters + 80 categories.Inference Execution:

public OutputBuffer run(long img_buf_handle, int camera_width, int camera_height) {
    native_run(img_buf_handle, camera_width, camera_height, 
              mOutputs.mGrid0Out, mOutputs.mGrid1Out, mOutputs.mGrid2Out);
    return mOutputs;
}

// JNI interface
private native int native_run(long img_buf_handle, int cam_width, int cam_height, byte[] grid0Out, byte[] grid1Out, byte[] grid2Out);

The parameter img_buf_handle is the native handle of the image buffer. The results are directly written into the pre-allocated Java layer buffer via JNI.Post-Processing Flow:

public ArrayList<Recognition> postProcess(OutputBuffer outputs) {
    // 1. Initialize result container
    mDetectResults = new DetectResultGroup();
    mDetectResults.ids = new int[OBJ_NUMB_MAX_SIZE];
    mDetectResults.scores = new float[OBJ_NUMB_MAX_SIZE]; 
    mDetectResults.boxes = new float[4 * OBJ_NUMB_MAX_SIZE];

    // 2. Call native post-processing
    int count = native_post_process(outputs.mGrid0Out, outputs.mGrid1Out, outputs.mGrid2Out,
                                  mDetectResults.ids, mDetectResults.scores, mDetectResults.boxes);

    // 3. Convert to Recognition objects
    ArrayList<Recognition> recognitions = new ArrayList<>();
    for (int i = 0; i < count; ++i) {
        RectF rect = new RectF(
            mDetectResults.boxes[i*4+0], // left
            mDetectResults.boxes[i*4+1], // top
            mDetectResults.boxes[i*4+2], // right
            mDetectResults.boxes[i*4+3]  // bottom
        );
        recognitions.add(new Recognition(
            mDetectResults.ids[i],
            mDetectResults.scores[i], 
            rect
        ));
    }
    return recognitions;
}

Native Methods:

Method Signature Function Description
<span>navite_init()</span> Initializes the RKNN model, loading the specified model file from the path
<span>native_deinit()</span> Releases model resources
<span>native_run()</span> Executes inference, writing results to the Java layer buffer
<span>native_post_process()</span> Executes NMS and other post-processing, returning detection results

Relationship between InferenceWrapper and Camera Thread and RKNN JNI Timing Diagram:

Using the RK3588 Chip NPU: Running and Interpreting the Official rknn_yolov5_android_apk_demo

Relationship with InferenceResult:

  • Provides raw output (OutputBuffer) to InferenceResult for storage
  • Executes postProcess() to generate final recognition resultsIn Cooperation with ObjectTracker:
  • The generated list of Recognition will be passed to the tracker for ID association

3.2 InferenceResult Class

A utility class for processing target detection inference results, mainly responsible for storing inference outputs, processing detection results, and tracking detected objects.Main Member Variables:

OutputBuffer mOutputBuffer; // Stores the raw data of inference output
ArrayList<Recognition> recognitions = null; // Stores processed recognition results
private boolean mIsVaild = false; // Marks whether results need to be recalculated
PostProcess mPostProcess = new PostProcess(); // Post-processor
private ObjectTracker mSSDObjectTracker; // Object tracker

Important Internal Classes (Data Structures):

  • OutputBuffer: Stores the three-layer output grids of the YOLO model
public static class OutputBuffer {
    public byte[] mGrid0Out;
    public byte[] mGrid1Out;
    public byte[] mGrid2Out;
}
  • Recognition: Represents a single detection result
public static class Recognition {
    private int trackId = 0; // Tracking ID
    private final int id; // Category ID
    private final Float confidence; // Confidence
    private RectF location; // Location rectangle
    
    // Methods include getter/setter and toString()
}
  • DetectResultGroup: Used to store native post-processing results
public static class DetectResultGroup {
    public int count = 0; // Number of detections
    public int[] ids; // Array of category IDs
    public float[] scores; // Array of scores
    public float[] boxes; // Array of bounding boxes
}

Main Member Methods:

  • Reset Method: Clears results, resets flags, and resets ObjectTracker object
public void reset() {
    if (recognitions != null) {
        recognitions.clear();
        mIsVaild = true;
    }
    mSSDObjectTracker = new ObjectTracker(CAMERA_PREVIEW_WIDTH, CAMERA_PREVIEW_HEIGHT, 3);
}
  • Set Inference Result:
public synchronized void setResult(OutputBuffer outputs) {
    // Clone or copy inference output data
    if (mOutputBuffer.mGrid0Out == null) {
        mOutputBuffer.mGrid0Out = outputs.mGrid0Out.clone();
        mOutputBuffer.mGrid1Out = outputs.mGrid1Out.clone();
        mOutputBuffer.mGrid2Out = outputs.mGrid2Out.clone();
    } else {
        arraycopy(outputs.mGrid0Out, 0, mOutputBuffer.mGrid0Out, 0, outputs.mGrid0Out.length);
        arraycopy(outputs.mGrid1Out, 0, mOutputBuffer.mGrid1Out, 0, outputs.mGrid1Out.length);
        arraycopy(outputs.mGrid2Out, 0, mOutputBuffer.mGrid2Out, 0, outputs.mGrid2Out.length);
    }
    mIsVaild = false; // Mark as needing reprocessing
}

3.3 ObjectTracker

The core class that implements target tracking functionality, based on JNI calls to the underlying tracking algorithm, assigns a unique ID to detected objects and maintains tracking across frames.Core Member Variables:

Variable Type Description
<span>mHandle</span> long Native handle of the underlying tracker
<span>mMaxTrackLifetime</span> int Maximum number of frames to retain after target loss (default 3)
<span>mWidth/mHeight</span> int Image resolution
<span>MAX_TRACKED_NUM</span> static int Maximum number of tracked targets (64)
<span>track_input_*</span> static array Buffer for input data
<span>track_output_*</span> static array Buffer for output data

Core Tracking Method:

public ArrayList<InferenceResult.Recognition> tracker(
    ArrayList<InferenceResult.Recognition> recognitions) {
    
    // 1. Prepare input data
    for (int i = 0; i < recognitions.size(); ++i) {
        RectF loc = recognitions.get(i).getLocation();
        track_input_location[4*i] = loc.left;
        track_input_location[4*i+1] = loc.top;
        track_input_location[4*i+2] = loc.right;
        track_input_location[4*i+3] = loc.bottom;
        track_input_class[i] = recognitions.get(i).getId();
        track_input_score[i] = recognitions.get(i).getConfidence();
    }

    // 2. Call Native Tracking
    int[] track_output_num = new int[1];
    native_track(mHandle, mMaxTrackLifetime,
        recognitions.size(), track_input_location, 
        track_input_class, track_input_score,
        track_output_num, track_output_location,
        track_output_class, track_output_score,
        track_output_id, mWidth, mHeight);

    // 3. Convert output results
    ArrayList<InferenceResult.Recognition> results = new ArrayList<>();
    for (int i = 0; i < track_output_num[0]; ++i) {
        RectF rect = new RectF(
            track_output_location[i*4]/mWidth,
            track_output_location[i*4+1]/mHeight,
            track_output_location[i*4+2]/mWidth,
            track_output_location[i*4+3]/mHeight);
        
        InferenceResult.Recognition recog = new InferenceResult.Recognition(
            track_output_class[i],
            track_output_score[i] == -10000 ? 0 : track_output_score[i],
            rect);
        recog.setTrackId(track_output_id[i]); // Set tracking ID
        results.add(recog);
    }
    return results;
}

Target tracking is a very important function in this case, and the specific algorithm is implemented at the native layer, which will be introduced later.native_track(): Executes cross-frame target tracking, parameter description: – handle: Tracker instance handle – maxTrackLifetime: Maximum number of lost frames – track_input_: Input data (location/class/score) – track_output_: Output data (location/class/score/ID) – width/height: Image resolution

3.4 ImageBufferQueue

Implementation of a circular buffer queue, mainly used to manage the production-consumption process of image buffers, optimized specifically for Rockchip NPU hardware acceleration.Buffer State Machine:

public class ImageBuffer {
    static public final int STATUS_INVAILD = 0;  // Invalid/Idle state
    static public final int STATUS_READY = 1;    // Data ready for processing
    static public final int STATUS_USING = 2;    // Currently in use
}

Member Variables

Variable Name Type Description
<span>mQueueBuffer</span> <span>ImageBuffer[]</span> Buffer array
<span>mQueueBufferSize</span> <span>int</span> Total number of buffers
<span>mCurrentFreeBufferIndex</span> <span>int</span> Current free buffer index
<span>mCurrentUsingBufferIndex</span> <span>int</span> Current using buffer index

Timing Diagram:

Using the RK3588 Chip NPU: Running and Interpreting the Official rknn_yolov5_android_apk_demo

3.4.1 Buffer Acquisition and Release

Get Free Buffer (Producer):

    public  synchronized ImageBuffer getFreeBuffer() {
        // Circularly find the first buffer that is not in the USING state
        int index = mCurrentFreeBufferIndex;

        for (int i=0; i<mQueueBufferSize; ++i) {
            ++index;

            if (index >= mQueueBufferSize) {
                index = 0;
            }

            if (mQueueBuffer[index].mStatus != ImageBuffer.STATUS_USING) {
                break;
            }
        }

        mCurrentFreeBufferIndex = index;

        mQueueBuffer[index].mStatus = ImageBuffer.STATUS_INVAILD;
        return mQueueBuffer[index];
    }

Submit Ready Buffer:

    public synchronized void postBuffer(ImageBuffer buffer) {
        buffer.mStatus = ImageBuffer.STATUS_READY;
    }

Get Ready Buffer (Consumer):

    public synchronized ImageBuffer getReadyBuffer() {
        // Circularly find a buffer in the READY state
        int index = mCurrentUsingBufferIndex;

        for (int i=0; i<mQueueBufferSize; ++i) {
            ++index;

            if (index >= mQueueBufferSize) {
                index = 0;
            }

            if (mQueueBuffer[index].mStatus == ImageBuffer.STATUS_READY) {
                break;
            }
        }

        if ((index != mCurrentUsingBufferIndex) && (mQueueBuffer[index].mStatus == ImageBuffer.STATUS_READY)) {
            mCurrentUsingBufferIndex = index;
            mQueueBuffer[index].mStatus = ImageBuffer.STATUS_USING;

            return mQueueBuffer[index];
        }

        return null;
    }

Release Used Buffer:

public synchronized void releaseBuffer(ImageBuffer buffer) {
    buffer.mStatus = ImageBuffer.STATUS_INVAILD; // Reset to invalid state
}

3.4.2 NPU Memory Management

public class ImageBuffer {
    public long mImage_handle; // NPU memory handle
    
    public ImageBuffer(int width, int height) {
        // Create NPU image buffer (RGB888 format)
        mImage_handle = create_npu_img_buffer(RK_FORMAT_RGB_888);
    }
    
    protected void finalize() {
        // Release NPU memory
        release_npu_img_buffer(mImage_handle);
    }
}

4. Native Layer

From the flow in the Java layer, you will gradually call the implementation in the native layer. All JNI interfaces are in the native-lib.cc source file. Below are some key points:

4.1 Image Format Conversion

In the Java layer onPreviewFrame, the native layer’s color_convert_and_flip interface is called, as seen in RGA.java

    public static int colorConvertAndFlip(byte[] src, int srcFmt, long npu_buf_handle, int dstFmt,
                                   int width, int height, int flip) {

        if (src == null) {
            Log.w("rkyolo.RGA", "src or dst is null");
            return -1;
        }

        return color_convert_and_flip(src, srcFmt, npu_buf_handle, dstFmt, width, height, flip);
    }

    private static native int color_convert_and_flip(byte[] src, int srcFmt, long npu_buf_handle, int dstFmt,
                                            int width, int height, int flip);

JNI Bridge Implementation: See native-lib.cc

extern "C"
JNIEXPORT jint JNICALL
Java_com_rockchip_gpadc_demo_rga_RGA_color_1convert_1and_1flip(
    JNIEnv* env, jclass clazz,
    jbyteArray src, jint src_fmt,
    jlong npu_buf_handle,
    jint dst_fmt,
    jint width, jint height, jint flip) 
{
    // Get the Native pointer of the Java array
    jbyte* src_buf = env->GetByteArrayElements(src, nullptr);
    
    // Call the core processing function
    jint ret = colorConvertAndFlip(src_buf, src_fmt, npu_buf_handle, 
                                 dst_fmt, width, height, flip);
    
    // Release Java array resources
    env->ReleaseByteArrayElements(src, src_buf, 0);
    
    return ret;
}

Core Logic Implementation: See yolo-image.cc

// Convert NPU buffer handle
auto p_dst_buf = reinterpret_cast<img_npu_buffer*>(npu_buf_handle);

// Create RGA source buffer
rga_buffer_handle_t rga_src_handle = 
    importbuffer_virtualaddr(src, width*height*4);
rga_buffer_t rga_src = 
    wrapbuffer_handle(rga_src_handle, width, height, srcFmt);
// Image processing
if (DO_NOT_FLIP == flip) {
    // Only color space conversion
    ret = imcvtcolor(rga_src, p_dst_buf->rga_buf, 
                    rga_src.format, p_dst_buf->rga_buf.format);
} else {
    // Color conversion + flip processing
    ret = imflip(rga_src, p_dst_buf->rga_buf, flip);
}
// Resource release
if (rga_src_handle) {
    releasebuffer_handle(rga_src_handle);
}

The core methods imcvtcolor and imflip are implemented in the RGB library…

4.2 Target Tracking

The target tracking code is located in cpp/object_tracker, with JNI bridging in native-lib.cc

4.2.1 Starting with JNI Bridging Methods

Bridge Method Signature:

JNIEXPORT void JNICALL Java_com_rockchip_gpadc_demo_tracker_ObjectTracker_native_1track
    (JNIEnv * env, jobject obj,
     jlong handle,
     jint maxTrackLifetime,
     jint track_input_num,
     jfloatArray track_input_locations,
     jintArray track_input_class,
     jfloatArray track_input_score,
     jintArray track_output_num,
     jfloatArray track_output_locations,
     jintArray track_output_class,
     jfloatArray track_output_score,
     jintArray track_output_id,
     jint width,
     jint height)

Parameter Description:

Parameter Type Parameter Name Description
jlong handle Native tracker instance handle
jint maxTrackLifetime Maximum number of frames to retain after target loss
jint track_input_num Number of input targets
jfloatArray track_input_locations Input target location array (4*num)
jintArray track_input_class Input target category array
jfloatArray track_input_score Input target confidence array
jintArray track_output_num Output target count
jfloatArray track_output_locations Output target location array (4*num)
jintArray track_output_class Output target category array
jfloatArray track_output_score Output target confidence array
jintArray track_output_id Output target tracking ID array
jint width Image width
jint height Image height

4.2.2 Core Data Structures

  • Basic Structure:
typedef struct {
    int x, y;       // Top-left coordinates
    int width, height; // Width and height
} Rect_T;

typedef struct {
    Rect_T r;       // Bounding box
    int obj_class;  // Category ID
    float score;    // Confidence
    int id;         // Tracking ID
    int reserve[1]; // Reserved field
} object_T;
  • Extended Structure for Tracking Objects:
struct ExtObject {
    int id;                 // Tracking ID
    Rect_T location;        // Current location
    Rect_T predict_loc_when_miss; // Predicted location when lost
    Rect_T smooth_rect;      // Smoothed position
    ObjectStatus status;     // Status
    int detectedNum;         // Number of detections
    int noDetectedNum;       // Number of non-detections
    int obj_class;          // Category
    float score;            // Confidence
    int miss;               // Loss flag
};

track Method

The bridge method directly calls the track method in track_link.cpp: responsible for associating detection results with existing tracking targets and outputting tracking results with stable IDs.Parameter Description:

Parameter Type Parameter Name Description
<span>long</span> handle Native handle of the tracker instance
<span>int</span> maxTrackLifetime Maximum number of frames to retain after target loss
<span>int</span> track_input_num Number of input targets
<span>float*</span> c_track_input_locations Input target location array (each target has 4 floats: x1,y1,x2,y2)
<span>int*</span> c_track_input_class Input target category array
<span>float*</span> c_track_input_score Input target confidence array
<span>int*</span> c_track_output_num Output target count
<span>float*</span> c_track_output_locations Output target location array
<span>int*</span> c_track_output_class Output target category array
<span>float*</span> c_track_output_score Output target confidence array
<span>int*</span> c_track_output_id Output target tracking ID array
<span>int</span> width Image width
<span>int</span> height Image height

Input Data Conversion: Converts the bounding box format passed from the Java layer to the internal processing format

for (int i = 0; i < track_input_num; i++) {
    // Convert [x1,y1,x2,y2] to [x,y,width,height] format
    object_input[i].r.x = (int)c_track_input_locations[i*4 +0];
    object_input[i].r.y = (int)c_track_input_locations[i*4 +1];
    object_input[i].r.width = (int)(c_track_input_locations[i*4 +2] - c_track_input_locations[i*4 +0]);
    object_input[i].r.height = (int)(c_track_input_locations[i*4 +3] - c_track_input_locations[i*4 +1]);
    
    // Copy category and confidence
    object_input[i].score = c_track_input_score[i];
    object_input[i].obj_class = c_track_input_class[i];
}

Execute Tracking Update: Calls the tracking algorithm to update the target state

object->update(maxTrackLifetime, track_input_num, object_input, 
              c_track_output_num, object_output, width, height);

Output Data Conversion: Converts the internal format to the output format expected by the Java layer

for (int i = 0; i < *c_track_output_num; i++) {
    // Convert [x,y,width,height] back to [x1,y1,x2,y2] format
    c_track_output_locations[i*4 +0] = (float)object_output[i].r.x;
    c_track_output_locations[i*4 +1] = (float)object_output[i].r.y;
    c_track_output_locations[i*4 +2] = (float)(object_output[i].r.x + object_output[i].r.width);
    c_track_output_locations[i*4 +3] = (float)(object_output[i].r.y + object_output[i].r.height);
    
    // Copy category, confidence, and tracking ID
    c_track_output_class[i] = object_output[i].obj_class;
    c_track_output_score[i] = object_output[i].score;
    c_track_output_id[i] = object_output[i].id;
}

4.2.4 OdtDetector::update Method

This method is responsible for processing newly detected targets and associating them with existing tracking targets, outputting tracking results with stable IDs.Parameter Description:

Parameter Type Description
<span>maxTrackLifetime</span> int Maximum number of frames to retain after target loss
<span>track_num_input</span> int Number of input targets
<span>object_input</span> object_T* Input target array (includes location, category, score)
<span>track_num_output</span> int* Output target count (returned via pointer)
<span>object_output</span> object_T* Output target array
<span>width</span> int Image width
<span>height</span> int Image height

Main Process

  • Data Preparation Phase: Converts input data to vector format for subsequent processing
std::vector<Rect_T> objects_Rect_T;
std::vector<int> objects_class;
std::vector<float> object_score;

for (int i = 0; i < track_num_input; i++) {
    objects_Rect_T.push_back(object_input[i].r);
    objects_class.push_back(object_input[i].obj_class);
    object_score.push_back(object_input[i].score);
}
  • Core Tracking Update: Calls the underlying tracker to associate targets and update their states
m_objects_tracker.updateTrackedObjects(objects_Rect_T, objects_class, 
                                     object_score, maxTrackLifetime, 
                                     m_width, m_height);
  • Get Tracking Results: Retrieves extended target information from the tracker
std::vector<ObjectsTracker::ExtObject> extObjects;
m_objects_tracker.getObjects(extObjects);
  • Result Output Processing: Selects output location information based on target status and fills complete output data
int nobjects = (int)extObjects.size();
track_num_output[0] = nobjects;

for (; i < nobjects && i < 100; i++) {
    if(extObjects[i].miss == 0) {
        object_output[i].r = extObjects[i].smooth_rect; // Use smoothed position
    } else {
        object_output[i].r = extObjects[i].predict_loc_when_miss; // Use predicted position
    }
    object_output[i].obj_class = extObjects[i].obj_class;
    object_output[i].score = extObjects[i].score;
    object_output[i].id = extObjects[i].id;
}

4.2.5 ObjectsTracker::updateTrackedObjects Method

The core algorithm implementation of the multi-target tracking system, using an IOU (Intersection over Union)-based matching strategy, combined with motion prediction and state management, to achieve stable association between detection boxes and tracking targets.TrackedObject Tracking Target Structure:

struct TrackedObject {
    std::vector<Rect_T> lastPositions;  // Historical position records
    Rect_T predict_loc_when_miss;      // Predicted location when lost
    Rect_T smooth_Positionn;           // Smoothed position
    float vx, vy;                      // Motion speed
    int numDetectedFrames;             // Total number of detection frames
    int numFramesDetected;             // Continuous detection frames
    int numFramesNotDetected;          // Continuous loss frames
    int id;                            // Unique ID
    int obj_class;                     // Target category
    float score;                       // Confidence
    int miss;                          // Loss flag
};

Main Process:

  • Initialization
int N1 = trackedObjects.size();  // Number of existing tracking targets
int N2 = detectedObjects.size(); // Number of new detected targets

// Initialize matching relationship table
std::vector<int> correspondence(N2, NEW_RECTANGLE); 
std::vector<float> correspondenceScore(N2, 0);
  • Motion Prediction: Calculates motion speed based on historical positions to predict current location
for(int i=0; i < N1; i++) {
    if(numpositions>1) {
        predict_loctation(curObject, image_width, image_height, pre_x, pre_y, vx, vy);
        curObject.vx = vx;
        curObject.vy = vy;
        // Update predicted location
        prevRect.x = (int)pre_x;
        prevRect.y = (int)pre_y;
        curObject.predict_loc_when_miss = prevRect;
    }
}
  • Data AssociationPrimary Matching: Calculates matching scores based on IOU and number of lost frames
for(int j=0; j < N2; j++) {
    float iou = CalculateIOU(prevRect, detectedObjects[j]);
    if (iou > 0.1f) {
        float trackScore = iou / (curObject.numFramesNotDetected + 1);
        if (iou > bestArea && correspondenceScore[j] < trackScore) {
            bestIndex = j;
            bestArea = iou;
            correspondenceScore[j] = trackScore;
        }
    }
}

Conflict Resolution: Handles conflicts of overlapping detection boxes

for(int j=0; j < N2; j++) {
    if(correspondence[j] >= 0) continue;
    float iou = CalculateIOU(detectedObjects[j], detectedObjects[bestIndex]);
    if(iou > 0.45f) {
        correspondence[j] = INTERSECTED_RECTANGLE; // Mark as conflict box
    }
}
  • State UpdateUpdate Existing Targets: Uses weighted average for position smoothing, combined with motion prediction
trackedObjects[i].lastPositions.push_back(detectedObjects[j]);
// Maintain the length of the historical position queue
while(trackedObjects[i].lastPositions.size() > parameters.numLastPositionsToTrack) {
    trackedObjects[i].lastPositions.erase(trackedObjects[i].lastPositions.begin());
}

// Smoothing processing
trackedObjects[i].smooth_Positionn.x = 
    ((smooth.x + smooth.width*0.5f)*0.5f + (detect.x + detect.width*0.5f)*1.f)/1.5f
    + vx*0.5f/1.5f - smooth.width*0.5f;

Create New Targets: Creates new tracking targets for unmatched detection boxes

trackedObjects.push_back(detectedObjects[j]);
trackedObjects.back().obj_class = objects_class[j];
trackedObjects.back().score = objects_score[j];

Conclusion

This article indeed took many days to write (and was quite exhausting). Some C++ code still needs further contemplation. Target tracking, RGB conversion, and image buffer queues are all very meaningful. They can still be referenced in other projects. Writing is not easy, please provide guidance and support.

Leave a Comment