Based on previous performance analysis, you may find that the performance results are less than expected. This section covers Horizon's recommendations and measures to improve model performance, including:
Check those performance-affecting YAML configuration parameters.
CPU OP processing.
Checking yaml configuration parameters and CPU OP Processing only apply to usage scenarios where the model is compiled by using hb_compile.
Some parameters in the model conversion configuration file can affect model's final performance, you can check if they've been correctly specified as you expected. The definitions and functions of all parameters please refer to the Specific Parameter Information section.
The debug_mode parameter is used for accuracy debugging analysis.
If dump_all_layers_output is configured, a dequantized output node will be added to each convolutional and matmul operator to dump the intermediate results in model conversion. It will significantly reduce model's onboard performance. Therefore, please remember to remove the dump_all_layers_output parameter from debug_mode in performance evaluation.
The compile_mode parameter is used to select whether the optimization direction is bandwidth or latency when compiling the model.
If you are concerned about performance, please configure it as latency.
The optimize_level parameter is used to select the optimization level of the compiler. O0: No optimization, fastest compilation speed and lowest optimization level.
O1 to O2: As the optimization level increases, the compiled model is expected to execute faster, but the compilation time is also expected to be longer.
The max_time_per_fc parameter is used to control the execution time of the function-call of the compiled model data instruction, thus implementing the model priority preemption function.
Setting this parameter to change the execution time of the function-call of the preempted model will affect the on-board performance of the model.
If the evaluation of hrt_model_exec perf confirms that the apparent performance bottleneck is due to the current operator running on the CPU. Then in such case, we suggest that you should confirm if the OPs which currently running on the CPU can be supported by the BPU as described in the Toolchain Operator Support Constraint List section.
If the operator parameters used are outside the constraints supported by the BPU, we suggest that you adjust the corresponding computing parameters of the original floating-point model back into the restricted range. To help you quickly find out the off-limits parameter(s), we suggest that you proceed the model check procedure as described in the Check the Model section, the tool will print out the off-limits parameters at the console.
Note that you'll need to handle the effect on model performance (if any) caused by modifying the original Floating-point model parameters.
Take the input_channel or output_channel of Convolution exceeding restrictions as classic examples, by reducing number of channels to quickly enable the OP to be supported by the BPU can also affect model accuracy.
If the operator does not have BPU support, you need to optimize it according to the following conditions:
CPU operator at the middle of the model
For cases where the CPU operator is in the middle of the model, it is recommended that you try parameter adjustment, operator replacement or model modification as a priority.
CPU operator at the beginning and end of the model
For the case where the CPU operator is at the beginning and end of the model, please refer to the following example, using the quantization/anti-quantization nodes as an example.
For nodes connected to the model input and output, you can add the remove_node_type parameter to the yaml file model_parameters configuration group (model parameters group) and recompile the model.
The academic community is continuously optimizing the computational efficiency (the smaller the theoretical computation required for the same algorithm accuracy, the higher the efficiency) and parameter efficiency (with the same algorithm accuracy, the less the parameter volume, the higher the efficiency) of algorithm models. The representative operations, such as the EfficientNet and the ResNext, have utilized the Depthwise Convolution and the Group Convolution respectively. However, as the supporting efficiencies of GPU and TPU are very low and cannot make full use of algorithms' advantages when confronting such high-efficient models, hence the academic circle was forced to optimize the EfficientNetV2 and the NFNet for GPU/TPU. The optimizations primarily lie in less use of Depthwise Convolution and significantly expanding the Group size in Group Convolution. As a results, these modifications have reduced the computation and parameter efficiencies of the original models.