The Horizon PTQ floating-point conversion toolchain supports both Caffe1.0 and ONNX (10 ≤ opset_version ≤ 19 and ir_version ≤ 9) model formats, and the ONNX export method is as follows:
| Training Framework | ONNX Export Methods |
|---|---|
| Caffe1.0 | Horizon native support, no need to export ONNX |
| Othere frameworks | ONNX Tutorials |
The toolchain uses an interface based on the public ONNXRuntime wrapper to implement model parsing and forwarding when converting models. So we should check the legitimacy of the original floating-point ONNX model itself (i.e., whether it is capable of inference properly) and whether accuracy bias was introduced during the export of ONNX from the training framework before using the toolchain. Specific tests can be found in the The HBRuntime Inference Library section.
S100 PTQ tool no longer supports to modify the input layout format of the model during the conversion process, PTQ will keep your model layout unchanged during the conversion process. If you need to modify the layout, you can directly modify the model before exporting onnx to finish the layout change.
If modifications are required at the model level, it is recommended that they be implemented in the following way:
Add transpose to the DL framework so that its input layout is NHWC and re-export the ONNX model;
Use the onnx library to modify the model directly, as referenced below:
In the yaml configuration file of the model conversion, the compilation parameter group provides the optimize_level parameter to select the optimization level of the model compilation, the available range is O0~O2. Among which:
O0 without any optimization, the fastest compilation speed, suitable for use in model conversion function verification, debugging different calibration methods.
O1 - O2 as the optimization level gets higher, the search space during compilation and optimization gets larger.
The compiler's optimization strategy is not at the operator granularity level, but is a global optimization for the whole model.
According to the original model type, we will discuss this issue by dividing it into dynamic input models and non-dynamic input models.
The input_batch parameter can only be used only for the model which first dimension of the input_shape is 1 (if the model has multiple inputs, the first dimension of the input_shape needs to be 1 for all inputs), and only effective when the original onnx model itself supports multi-batch inference. This parameter only supports specifying a single value, which will act on all inputs of the model when the model has multiple inputs.
The size of each calibration data shape should be the same as the size of input_shape.
Dynamic input model: If the original model is a dynamic input model, for example, ? x3x224x224 (dynamic input models must use the input_shape parameter to specify the model input information).
When input_shape is configured as 1x3x224x224, if you want to compile the model to get a multi-batch model, you can use input_batch parameter, then each calibration data shape size is 1x3x224x224.
When the first dimension of input_shape is an integer greater than 1, the original model itself will be recognized as a multi-batch model, and the input_batch parameter can not be used, and you need to pay attention to the size of each calibration data shape. For example, if the input_shape is 4x3x224x224, the size of each calibration data shape need to be 4x3x224x224.
Non-dynamic input model.
When the input shape[0] of the input is 1, you can use the input_batch parameter.
Each calibration data shape size is the same as the original model shape.
When the input shape[0] is not 1, the input_batch parameter is not supported.
This is normal. It is possible for the model input order to change during the conversion of a multi-input model. The possible cases are shown in the following example.
original floating-point model input order: input1, input2, input3.
original.onnx model input order: input1, input2, input3.
quanti.bc model input order: input2, input1, input3.
hbm model input order: input3, input2, input1.
When you do accuracy consistency alignment, please make sure the input order is correct, otherwise it may affect the accuracy result.
If you want to check the hbm model input order, you can use the hb_model_info command to check it.
The input order listed in the input_parameters info group is the hbm model input order.
This model will only be output when the PTQ model conversion fails, it has no special meaning, you can provide this model and the log file of the model conversion to Horizon for debug analysis.
The model conversion log prints the name and shape of the input and output nodes of the model, but sometimes the shape is 0, as shown in the following figure.
This occurs mainly in the batch dimension, because the output shape of some models are dynamic (stored using dim_param), and converting the tool forward inference model (shape_inference) uses a '?' to take up space, and the log print will show 0.
This is as expected and does not affect the model conversion and does not need to be of undue concern.
In order to improve the performance of the model on the board, the model conversion has an Optimizer module which performs some graphs optimization on the model to improve the performance of the model on the board. The Optimizer module mainly includes functions such as Constant Folding, Operator Replacement, Operator Property/Input Update, Operator Fusion, and Operator Move.
The Optimizer module causes a change in the operator of the your model, and this module performs some equivalent graph optimization of the model.
Case 1: The second input to the where operator in the model is -Inf
During the optimization phase of the conversion process, in order to be able to ensure that the where operator can be quantized, this operator will be split into a combination of several quantizable operators during the optimization phase. When the second input of where is -Inf, there will be a case of nan in the conversion process. The reason is as follows:
Case 2: All 0 tensor in the model
1). concat result in all 0 tensor.

Taking the above structure as an example, the two inputs to concat are all 0 constants, so it is possible that the subsequent output after slice may have a branch in which the data flowing are all 0, and lead to illegal thresholds (i.e., thresholds of 0) in HzCalibration.
2). equal result in all 0 tensor.
Taking the above structure as an example, the input onnx::Not_3 to the model is a bool type tensor, which after cast+equal results in all false data in the tensor, and after cast(bool -> float32) results in an all 0 tensor.
This case requires the customer to try to test with multiple sets of data, thus ruling out the possibility that the model has all 0 caused by the calibration data.
BPU acceleration: It means that the operator can be quantized and accelerated by BPU hardware when the model is inferred at the board side, where most of the operators ( such as conv, and so on) are directly supported by the hardware. Some will be replaced with other operators to achieve acceleration (e.g., gemm will be replaced with conv); and others depend on specific contexts (e.g., Reshape, Transpose need to be BPU operators before and after) to be quantized passively.
CPU computation: For operators in the model that cannot be directly or indirectly accelerated by the BPU hardware, our toolchain will put them on the CPU for computation, and the runtime prediction library will also automatically complete the heterogeneous scheduling of the execution hardware during model inference.
BPU supports conversion between common model input formats (e.g. nv12 -> rgb/bgr) and data normalization, which can be configured via yaml file.
See the description of input_type_train, input_type_rt, mean_value, scale_value and std_value parameters.
The yaml file optimize_level parameter configures the compiler optimization level from O0 to O2, and the higher the optimization level, the larger the search space is available and usually more time consuming.
The optimization level does not perform some defined optimization strategy at the operator granularity level, and most of the operator optimizations are independent of the optimization level (they are not time-consuming).
The optimization level is mainly for global optimization, which is the analysis and optimization of the whole model.
By default, PTQ toolchain inserts quantization operator at the beginning of featuremap input model to implement the mapping of input data from float32 to lower-bit computation, and inserts dequantization operator at the end of all models to implement the mapping of output data from low-bit computation(or int32 by default if BPU ends with conv) to float32. The quantization/dequantization operators are not efficient on the CPU, especially when the data shape is large.
Therefore, we prefer to integrate quantization/dequantization operations into pre- and post-processing, which is the most efficient way.
For an operator in the model to run on BPU, besides satisfying the BPU support conditions, it also needs to be able to find its quantization threshold on calibration.
The quantization thresholds of some non-computation intensive operators (such as concat, reshape, etc.) depend on the featuremap Tensor of the upstream and downstream operators.
Therefore, if these operators are at the beginning and end of the model they will run on the CPU by default.
At this point, for more efficient performance, unitconv can be inserted before/after this operator to introduce a new quantization threshold statistic, which in turn can be quantized on the BPU.
However, it should be noted that this approach may introduce some quantization loss.
Take the model output by conv+reshape+concat structure as an example, by default, the toolchain will output conv with high accuracy of int32, and then send it to reshape and concat on CPU after dequantization to float32.
If unitconv is inserted after concat, the whole structure will run on BPU with low accuracy of int8.
At this point, although the final unitconv can still be output with high accuracy of int32, the accuracy compression of the previous conv output has already introduced a certain amount of quantization loss.
Therefore, please consider whether to insert unitconv to optimize the performance.
When the model has CPU operators that cannot be accelerated in the middle of BPU operators, some performance loss will be introduced when switching computation between BPU and CPU operators, thus introducing a certain performance loss in two ways.
CPU operator performance is much lower than that of the BPU operator.
The heterogeneous scheduling between CPU and BPU also introduces quantization and dequantization operators (running on CPU), and because the internal computation needs to traverse the data, its time consumption will be proportional to the shape size.
The above CPU operator and quantization and dequantization operators can be measured by passing profile_path parameter through the board-side tool hrt_model_exec. Horizon Robotics recommends that you try to build the model with BPU operator to get better performance.
Most of the operators use int8 calculation by default, and some of them support int16 and fp16 calculation, and the range of operator support continues to expand, see Toolchain Operator Support Constraints List, in addition:
If the BPU part of the model ends in Conv, the operator defaults to int32 high accuracy output;
The DSP hardware is also int8/int16/float32 capable.
For the preparation of PTQ model calibration data, refer to the Data Preparation - Model Calibration Set Preparation section. Also, for featuremap input models, please do your own preprocessing of the data and save it as an npy file of the desired input type via the numpy.save interface.
During the model conversion phase, if debug_mode parameter in the yaml file is configured with dump_all_layers_output, adds a dequantization output node for each convolution and matmul operator, which significantly degrades the performance of the model once it is on the board.
Among them, the output_nodes parameter can specify any node in the model as an output node, which is more convenient for us to debug and tune.
In addition, the hb_verifier tool can be used to compare the consistency of the fixed-point model quantized_model.bc with the hbm model on the upper board.
During the board deployment phase, the hrt_model_exec tool also supports saving node outputs (including nodes specified with the output_nodes parameter) in bin or txt format, as described in The hrt_model_exec Tool Introduction .
First, we need to understand the following two concepts:
Currently supports only the Conv operator at the tail of the model to be output with int32 high precision.
Normally, the model conversion will fuse Conv with its subsequent BN and ReLU/ReLU6 in the optimization stage for calculation. However, due to the limitation of the BPU hardware itself, Conv, which is output with int32 high precision at the end of the model, does not support operator fusion.
Therefore, if the model is ended with Conv+ReLU/ReLU6, then to ensure the overall accuracy of the quantized model, the Conv will by default use int32 output, while the ReLU/ReLU6 will run on the CPU. Similarly, the other tail-part OPs who run on the CPU are also because that the Conv OP needs higher-accuracy output. However, Horizon supports running these operators on the BPU by configuring quant_config in the yaml file to get better performance, but introduces some loss of accuracy.