When converting a model, you can use quant_config to configure the model's quantization parameters from multiple levels: model_config, op_config, subgraph_config and node_config:
model_config: Configure the overall quantization parameters for the model, key is a custom name.
op_config: Configure the quantization parameters for nodes with a certain type, key is the operator type.
subgraph_config: Configure the quantization parameters for a subgraph, key is the subgraph name.
node_config: Configure the quantization parameters for a specific node, key is the name of the node.
There is a priority relationship between the four levels, the smaller the configuration granularity, the higher the priority, i.e., priority model_config < op_config < subgraph_config < node_config. When a node is configured by more than one level at the same time, the level with the highest priority takes effect in the end.
Since the model quantization and compilation process internally will split and fuse some operators, it may cause some split operators' names to not be exactly the same as those of the original model operators. For such operators, when specifying the node_config, the operator names should refer to the corresponding operator names in the optimized_float_model.onnx model generated during the process.
The quant_config supports the configuration of int8/int16/float16 three kinds of computation accuracy data types, about these three kinds of data types are described as follows:
int8: The default quantization type for most operators, which generally does not need to be actively configured by you.
int16: You can refer to the section int16 Configuration.
float16: When configured as the float16 type, the tool will internally only configure this operator as the float16 computation accuracy type.
For the float16 datatype:
There will be no computation broadcast updates to the float16 computation operator context operators.
Different computing platforms have different support scopes for float16 operators. For the specific support scope, please refer to the Toolchain Operator Support Contraint List.
If you configure an operator that doesn't support float16 type to conduct float16 calculation, we will revert it to the float32 type for calculation.
| Primary Parameter | Secondary Parameter | Parameter Type | Required or Not | Description |
| model_config | all_node_type | String | Optional | Set the inputs of all nodes in the model to the specified type at once, with optional configuration of int16,float16. |
| model_output_type | String | Optional | Set the output tensor of the model to the specified type, with optional configuration of int8,int16. |
| Primary Parameter | Secondary Parameter | Tertiary Parameter | Parameter Type | Required or Not | Description |
| op_config | NodeKind | qtype | String | Optional | Configure the input data type of a node of a certain type, with optional configuration of int8,int16,float16,float32. |
| Primary Parameter | Secondary Parameter | Tertiary Parameter | Parameter Type | Required or Not | Description |
| subgraph_config | SubgraphName | inputs | List | Required | Configure input node names for enclosed subgraphs. Extraction operations abort when encountering unenclosed subgraphs due to missing input nodes. |
| outputs | List | Required | Configure output node names for enclosed subgraphs. | ||
| qtype | String | Optional | Configure the input data type for all nodes within the subgraph, with optional configuration of int8,int16,float16,float32. |
| Primary Parameter | Secondary Parameter | Tertiary Parameter | Parameter Type | Required or Not | Description |
| node_config | NodeName | qtype | String | Optional | Configure the input data type of a node with a specified name, with optional configuration of int8,int16,float16,float32. |
| input0 | String | Optional | Configure the 0th input data type of a node with a specified name, with optional configuration of int8,int16,float16,float32,ec. | ||
| input1 | String | Optional | Configure the 1st input data type of a node with a specified name, with optional configuration of int8,int16,float16,float32,ec. |
The input data type of a node can be specified by qtype, input0, or input1. qtype is generally used to specify the data type for all inputs of a node. input0 and input1 are used to specify the data type for the 0th and 1st inputs of a node, respectively. Moreover, the priority of input0 and input1 is higher than that of qtype.
ec stands for error compensate. It is a solution that compensates for the precision loss of int8 quantization for specific operators by creating the same operator. Currently, it only supports the following operators, and configuring other operators will not take effect.
The weight input of Conv and ConvTranspose operators (specified through the third-level parameter input1 in node_config).
Any input of MatMul operator (specified through the third-level parameter input0 or input1 in node_config).
The 0th input of GridSample and Resize operators (specified through the third-level parameter input0 in node_config).
The quant_config supports configuring multiple calibration algorithms such as kl, max, etc. For each calibration algorithm, you can also flexibly control the specific hyperparameter configuration(if not configured, default value will be used). In addition, some independent calibration functions such as per_channel, asymmetric, bias_correction can also be configured.
If quant_config is not configured, multiple pre-set calibration algorithms will be tried by default, and the calibration algorithm with minimum quantization loss will be selected.
| Primary Parameter | Secondary Parameter | Tertiary Parameter | Parameter Type | Required or Not | Description |
| model_config | activation | calibration_type | String/List[String] | Optional | Calibration algorithms for activation, with optional configuration of kl, max. It supports configuring a List with multiple calibration algorithms. |
| num_bin | Int/List[Int] | Optional | Parameter of kl calibration algorithm, requires num_bin > 128, and the default value is 1024. It supports configuring a List with multiple num_bin values. | ||
| max_num_bin | Int | Optional | Parameter of kl calibration algorithm, requires max_num_bin >= num_bin, and the default value is 16384. | ||
| max_percentile | Float/List[Float] | Optional | Parameter of max calibration algorithm, with parameter range in [0.5, 1.0], and the default value is 1.0. It supports configuring a List with multiple max_percentile values. | ||
| per_channel | Bool/List[Bool] | Optional | Whether per-channel quantization is enabled or not, with optional configuration of false, true, and the default value is false. It supports configuring a List that includes both per-channel enabled and disabled. | ||
| asymmetric | Bool/List[Bool] | Optional | Whether asymmetric quantization is enabled or not, with optional configuration of false, true, and the default value is false. It supports configuring a List that includes both asymmetric enabled and disabled. |
| Primary Parameter | Secondary Parameter | Tertiary Parameter | Quaternary Parameter | Parameter Type | Required or Not | Description |
| model_config | weight | bias_correction | num_sample | Int | Optional | The number of samples for bias correction, requires num_sample >= 1, and the default vaue is 1. |
| metric | String | Optional | The model error metric for bias correction, with optional configuration of cosine-similarity, mse, mae, mre, sqnr and chebyshev, and the default value is cosine-similarity. |
The quant_config supports two search methods with different granularities:
modelwise_search: Search for quantization parameters at model level. This method allows multiple calibration algorithms to be configured at one time. By comparing the quantization loss with metric configured by you based on the model output before and after quantization, a calibration algorithm with the minimum quantization loss is selected.
layerwise_search: Search for quantization parameters at node level. This method calculates the quantization loss with metric configured by you based on the model output before and after the quantization of each node, and assigns the calibration algorithm with the minimum quantization loss to the node.
If multiple calibration algorithms are configured, modelwise search will be enabled by default to find the optimal algorithm for current model; if the layerwise search parameters are configured, a layer by layer search for the optimal algorithm will be initiated.
| Primary Parameter | Secondary Parameter | Tertiary Parameter | Parameter Type | Required or Not | Description |
| model_config | modelwise_search | metric | String | Optional | The model error metric for modelwise search, with optional configuration of cosine-similarity, mse, mae, mre, sqnr and chebyshev, and the default value is cosine-similarity. |
| Primary Parameter | Secondary Parameter | Tertiary Parameter | Parameter Type | Required or Not | Description |
| model_config | layerwise_search | metric | String | Optional | The model error metric for layerwise search, with optional configuration of cosine-similarity, mse, mae, mre, sqnr and chebyshev, and the default value is cosine-similarity. |
The following is an example of a json template configuration of the quant_config with all the configurable options, you can refer to this template for configuration.
In the process of model conversion, most of the operators in the model are quantized to int8 for computation, and by configuring the quant_config parameter.
You can specify in detail the input or the output of an op as int16 calculation (The range of operators supporting the configuration of int16 you can refer to Toolchain Operator Support Constraint List) The basic principle is as follows:basically as follows.
After you configure an op input/output data type to int16, the model conversion automatically performs an update and check of the op input/output context int16 configuration internally. For example, when configuring op_1 input/output data type as int16, it actually potentially specifies that the previous/next op of op_1 needs to support computation in int16 at the same time. For unsupported scenarios, the model conversion tool will print a log indicating that the int16 configuration combination is temporarily unsupported and fall back to int8 computation.