Performance Optimization Methods for C++ Deployment

Performance Optimization Methods for C++ Deployment01 Use Structures to Store Common Variables in AdvanceWhen writing preprocessing and postprocessing functions, certain variables, such as the shape of the model input tensor and count, are often used multiple times. If these values are recalculated in each processing function, it will increase the computational load during deployment. In such cases, consider using a structure and defining an initialization function. Calculate the required values in advance, and when you need to use these variables, simply reference them (using &) for passing.

// Define structure
struct ModelInfo {
    hbDNNPackedHandle_t packed_handle;
    hbDNNHandle_t       model_handle;
    const char *        model_path;
    const char **       model_name_list;
    int model_count;
    int input_count;
    int output_count;
};
// Function declaration
int init_model(ModelInfo &model_info);
int other_function(ModelInfo &model_info, ...);
// Main function
int main() {
    // Initialization
    ModelInfo prefill_model = {0};
    prefill_model.model_path = drobotics_model_path_prefill.c_str();
    init_model(prefill_model);
    // Use reference passing for related parameters in other functions
    other_function(prefill_model, ...);
    return 0;
}
// Complete definition of the initialization function
int init_model(ModelInfo &model_info) {
    hbDNNInitializeFromFiles(&model_info.packed_handle, &model_info.model_path, 1);
    HB_CHECK_SUCCESS(hbDNNGetModelNameList(&model_info.model_name_list, &model_info.model_count, model_info.packed_handle),
                     "hbDNNGetModelNameList failed");
    HB_CHECK_SUCCESS(hbDNNGetModelHandle(&model_info.model_handle, model_info.packed_handle, model_info.model_name_list[0]),
                     "hbDNNGetModelHandle failed");
    HB_CHECK_SUCCESS(hbDNNGetInputCount(&model_info.input_count, model_info.model_handle), "hbDNNGetInputCount failed");
    HB_CHECK_SUCCESS(hbDNNGetOutputCount(&model_info.output_count, model_info.model_handle), "hbDNNGetOutputCount failed");
    return 0;
}
// Use reference passing in other function parameters
int other_function(ModelInfo &model_info, ...) {
    ...
}

02 Use References Instead of Value Passing in Functions

Considering the characteristics of C++, it is recommended to use references (&) for function parameters instead of value passing, which has several significant advantages:

  1. Only the reference to the original object is passed to the function, avoiding unnecessary copies and reducing computation time.
  2. Since data is not copied, using references can avoid the overhead of memory duplication compared to value passing, thus reducing memory usage.

However, it is important to note that references allow functions to modify the original data, so if you do not want the original data to be modified, do not use references.03 Quantization/Dequantization Fusion3.1 Fusion in the Preprocessing and Postprocessing LoopIn preprocessing and postprocessing, data is usually traversed, and quantization/dequantization also involves data traversal. Therefore, consider merging calculations to reduce data traversal time. This is the most common approach to quantization/dequantization fusion, and you can refer to numerous source code examples in AI benchmarks.3.2 Fusion When Storing Data into TensorsIf no fusion opportunities are found in preprocessing, quantization calculations can also be performed when copying data into the input tensor.

int64_t kv_count = 0;
int8_t* input_ptr = reinterpret_cast<int8_t*>(model_info.input_tensors[i].sysMem.virAddr);
for (int n = 0; n < total_count; n++) {
    input_ptr[n] = quantize_int8(kv_decode[kv_count++], cur_scale, cur_zero_point);
}

3.3 Precompute Quantized Values When Filling Initial ValuesSometimes we want to prepare specific inputs for the model, such as generating a zero array and filling a specific area of the array with a fixed floating-point value. In this case, if we first generate a complete floating-point array and then traverse the entire array for quantization, it will incur unnecessary traversal time. A common optimization approach is to precompute the quantized results of the fill values and directly fill in the fixed-point values, thus avoiding redundant quantization time.

std::vector<int16_t> prepare_decode_attention_mask(ModelInfo &model_info,
    DecodeInfo &decode_info, PrefillInfo &prefill_info, int decode_infer_num) {
    // Initialize a zero array
    std::vector<int16_t> decode_attention_mask_int(decode_info.kv_cache_len, 0);
    // Precompute the quantized results of the fill values
    hbDNNQuantiScale scale = model_info.input_tensors[1].properties.scale;
    auto cur_scale = scale.scaleData[0];
    auto cur_zero_point = scale.zeroPointData[0];
    int16_t pad_value_int = quantize_s16(-2048.0, cur_scale, cur_zero_point);
    // Fill the quantized values into specific areas of the array
    for(int i = 0; i < decode_info.kv_cache_len - prefill_info.tokens_len - decode_infer_num - 1; i++) {
        decode_attention_mask_int[i] = pad_value_int;
    }
    // Return the quantized array
    return decode_attention_mask_int;
}

3.4 Skip Dequantization Based on the Actual Role of PostprocessingIn some cases, such as when postprocessing only performs argmax, there is no need to perform dequantization; you can directly use integer data for argmax. Users need to determine whether to use this optimization method based on the specific principles of postprocessing.

// Directly perform argmax calculation on the int16_t data output by the model
int logits_argmax(std::vector<hbDNNTensor> &output_tensor) {
    auto data_tensor = reinterpret_cast<int16_t*>(output_tensor[0].sysMem.virAddr);
    int maxIndex = -1;
    int maxValue = -32768;
    for (int i = 0; i < 151936; ++i) {
        if (data_tensor[i] > maxValue) {
            maxValue = data_tensor[i];
            maxIndex = i;
        }
    }
    return maxIndex;
}

04 Store Output Data Directly into Input Tensor When Inferring the Same Model in a Loop

In some cases, we want the C++ program to repeatedly infer the same model, and the output from the previous frame can serve as the input for the next frame. If we follow conventional methods, we might save the output tensor’s contents to a specific array and then copy this array to the input tensor, resulting in two data copy times and occupying more memory.

In fact, we can directly point the model’s output tensor address to the input tensor, so the inference result of the first frame will be written directly to the input tensor, allowing the second frame’s inference to utilize this data without needing to prepare input separately, saving a significant amount of time.

To use this method, the shape/stride information of the corresponding nodes for model input and output must be identical. Additionally, if the model has removed the quantization/dequantization operators and the corresponding scales are identical, then the reused portion of the tensor does not need to be flushed (as it does not involve CPU operations), further saving time.

Here is a detailed example.

Assuming we have a model with 59 input nodes (0-58) and 57 output nodes (0-56), with quantization/dequantization operators removed, and the last 56 nodes of input and output have the same scale/shape/stride information.After the first frame inference is completed, the values of output nodes 1-56 need to be passed to input nodes 3-58. Therefore, when allocating the model’s input and output tensors, the output tensor only needs to allocate for 1, and when allocating the input tensor, tensors for 3-58 can be pushed back to the output tensor simultaneously.Specifically, it can be written as follows:

int prepare_tensor(std::vector<hbDNNTensor> &input_tensor, std::vector<hbDNNTensor> &output_tensor,
                   hbDNNHandle_t dnn_handle) {
    int input_count  = 0;
    int output_count = 0;
    hbDNNGetInputCount(&input_count, dnn_handle);
    hbDNNGetOutputCount(&output_count, dnn_handle);
    for (int i = 0; i < 1; i++) {
        hbDNNTensor output;
        HB_CHECK_SUCCESS(hbDNNGetOutputTensorProperties(&output.properties, dnn_handle, i),
                         "hbDNNGetOutputTensorProperties failed");
        int output_memSize = output.properties.alignedByteSize;
        HB_CHECK_SUCCESS(hbUCPMallocCached(&output.sysMem, output_memSize, 0), "hbUCPMallocCached failed");
        output_tensor.push_back(output);
    }
    for (int i = 0; i < input_count; i++) {
        hbDNNTensor input;
        HB_CHECK_SUCCESS(hbDNNGetInputTensorProperties(&input.properties, dnn_handle, i),
                         "hbDNNGetInputTensorProperties failed");
        int input_memSize = input.properties.alignedByteSize;
        HB_CHECK_SUCCESS(hbUCPMallocCached(&input.sysMem, input_memSize, 0), "hbUCPMallocCached failed");
        input_tensor.push_back(input);
        if(i > 2) {
            output_tensor.push_back(input);
        }
    }
    return 0;
}

During model inference, the reused portion of the tensor does not need to be flushed, so only the output_tensor at index 0 and the input_tensor at indices 0/1/2 need to be flushed (these tensors interact with the CPU).

while(1) {
    hbUCPTaskHandle_t task_handle_decode{nullptr};
    hbDNNTensor *output_decode = decode_model.output_tensors.data();
    HB_CHECK_SUCCESS(hbDNNInferV2(&task_handle_decode, output_decode,
         decode_model.input_tensors.data(), decode_model.model_handle), "hbDNNInferV2 failed");
    hbUCPSchedParam ctrl_param_decode;
    HB_UCP_INITIALIZE_SCHED_PARAM(&ctrl_param_decode);
    ctrl_param_decode.backend = HB_UCP_BPU_CORE_ANY;
    HB_CHECK_SUCCESS(hbUCPSubmitTask(task_handle_decode, &ctrl_param_decode), "hbUCPSubmitTask failed");
    HB_CHECK_SUCCESS(hbUCPWaitTaskDone(task_handle_decode, 0), "hbUCPWaitTaskDone failed");
    // Only refresh part of the output memory (output_tensor 0)
    hbUCPMemFlush(&decode_model.output_tensors[0].sysMem, HB_SYS_MEM_CACHE_INVALIDATE);
    HB_CHECK_SUCCESS(hbUCPReleaseTask(task_handle_decode), "hbUCPReleaseTask failed");
    // Postprocessing (only for output_tensor 0)
    decode_argmax_id = logits_argmax(decode_model.output_tensors);
    // Prepare input data for the next frame inference for input_tensor 0/1/2
    prepare_input_tensor(...);
    // Only refresh part of the input memory (input_tensor 0/1/2)
    for (int i = 0; i < 3; i++) {
         hbUCPMemFlush(&decode_model.input_tensors[i].sysMem, HB_SYS_MEM_CACHE_CLEAN);
    }
}

Additionally, if this optimization method is used, when releasing memory after model inference, avoid releasing the same memory block multiple times. For this case, after all input_tensor is released, only output_tensor at index 0 needs to be released.

for (int i = 0; i < decode_model.input_count; i++) {
    HB_CHECK_SUCCESS(hbUCPFree(&decode_model.input_tensors[i].sysMem), "hbUCPFree decode_model.input_tensors failed");
}
for (int i = 0; i < 1; i++) {
    HB_CHECK_SUCCESS(hbUCPFree(&decode_model.output_tensors[i].sysMem), "hbUCPFree decode_model.output_tensors failed");
}

05 Multithreaded PostprocessingFor models like YOLO v5, which have three output heads, consider using three threads to simultaneously perform postprocessing on the three output heads to significantly improve performance.

Read the original article and communicate with the author

Leave a Comment