Based on the accuracy evaluation in the previous section, you may find that the accuracy is less than expected. This section introduces you how to perform accuracy tuning with the accuracy tuning tools and functions to reduce quantization accuracy loss or to assist you in locating the cause of quantization accuracy loss when you experience a loss of quantization accuracy during the PTQ model conversion.
All the quantization accuracy tuning below refers to the calibrated_model.onnx quantization accuracy tuning generated during the quantization process described in the previous section.
You can tune the model accuracy by adjusting the quantization method or the computation accuracy as follows:
You can try to adjust the model quantization method by configuring different quantization methods, quantization parameter search methods, or by trying to configure the independent calibration functions:
Configure the calibration method
You can try to adjust the model calibration method, such as kl, max, and other calibration methods, the configuration method can be found in section The quant_config Introduction.
Configure the quantization parameter search methods
Two different granularity calibration parameter search methods are supported:
modelwise_search: search for quantization parameters at the model level, this method allows multiple calibration methods to be configured at once, which will find a minimally quantization-loss calibration method by comparing the quantization loss metric (configurable) of the model output before and after quantization.
layerwise_search: search for quantization parameters at the node level, this method calculates the quantization loss metric (configurable) based on the model output before and after quantization for each node and assigns the calibration method with the the minimally quantization loss to that node.
The configuration method can be found in section The quant_config Introduction.
Independent Quantization Function Configuration
Enable the independent quantization mode can reduce the computational resource consumption, you can try to configure the parameters per_channel, asymmetric, bias_correction, the configuration method can be found in section The quant_config Introduction.
In addition to the configuration of the quantization method, you can try to configure the computation accuracy (dtype) of the model operator to try to accuracy tuning, currently we support configuring the computation accuracy of the operator at three levels: model, op_type, and op_name, and the supported configuration types include int8, int16, float16, and float32. The configuration method can be found in section The quant_config Introduction.
If you want to locate the exact operators that caused the loss of quantization accuracy, we also provide you with the accuracy debug tool to assist you in locating them, which can be found in section Accuracy Debug Tool.
Based on our previous experience with typical model accuracy tuning processes, below we provide you with an accuracy tuning process that balances ease of use and practicality:
The tuning flowchart is described in detail below:
| Tuning Area | Milestone | Detailed Description | Auxiliary Function |
| Default int8 quantization model accuracy | Verify whether the int8 quantization accuracy meets your expectations. | Without any configuration of the quantization parameters, perform the model conversion using the default int8 quantization, test model quantization accuracy, and evaluate whether the accuracy can meet the standard. |
|
| Mixed precision quantization tuning | Verify the upper limit of model accuracy meets your expectations, and determine which type of mixed precision to use for subsequent tuning. | Configure all_node_type=int16 via quant_config,and try to configure all nodes in model with int16 quantization as much as possible, then obtain the upper limit of int16 quantized model's accuracy. If the accuracy can meet standard, int8+int16 mixed precision tuning can be performed later. | |
| Full int16 quantization accuracy can be improved through error compensation. Modify the weights of the Conv and ConvTranspose operators, as well as the inputs of MatMul, GridSample, and Resize to int16 in calibrated model, then get the upper limit of model accuracy. If the accuracy can meet standard, full int16 accuracy optimization will be performed later. | |||
| Configure all_node_type=float16 via quant_config, then obtain the upper limit of float16 quantized model's accuracy. If the accuracy can meet standard, int16+float16 mixed precision tuning can be performed later. | |||
| Even though all_node_type is configured as float16, the accuracy still can't meet standard, float16+float32 mixed precision tuning can be performed later. | |||
| By explicitly specifying the quantization precision of the operator, complete the mixed precision tuning. | int8+int16 mixed precision tuning: Use the accuracy debug tool to analyze nodes with high quantization loss in int8 calibrated model and configure them to int16 quantization via quant_config to complete quantization accuracy fine-tuning. The calibration method configured for int8+int16 mixed precision can use method automatically selected by the system when all_node_type=int16 is configured. | ||
| full int16 accuracy optimization:After configuring all_node_type to int16, hardware constraints and inference time may still cause int8 quantized nodes to exist in calibrated model, including Conv, ConvTranspose weights, Resize, GridSample, and the second input of MatMul. First, modify the full int16 calibrated model by changing the calibration nodes' qtype from int8 to int16. If the accuracy can meet standard, then configure node input as ec via quant_config to compensate for the int8 quantization loss. It should be noted that quantization accuracy of ec will be slightly worse than int16. | |||
| int16+float16 mixed precision tuning: Use the accuracy debug tool to analyze nodes or structures with high quantization loss in int16 calibrated model, identify the least used float16 nodes, and configure them to float16 in quant_config to complete quantization accuracy fine-tuning. | |||
| float16+float32 mixed precision tuning: Use the accuracy debug tool to analyze nodes or structures with high quantization loss in float16 calibrated model, identify the least used float32 nodes, and configure them for float32 in quant_config. If you have requirements for inference time, please try QAT. | |||
If using debug tool to set sensitive nodes to high precision does not effectively improve the model accuracy, you can analyze the model structure and try to set the typical operators or sub-structures that have a higher risk of quantization loss:
|