The quant_config Introduction

When converting a model, you can use quant_config to configure the model's quantization parameters from multiple levels: model_config, op_config, subgraph_config and node_config:

  • model_config: Configure the overall quantization parameters for the model, key is a custom name.

  • op_config: Configure the quantization parameters for nodes with a certain type, key is the operator type.

  • subgraph_config: Configure the quantization parameters for a subgraph, key is the subgraph name.

  • node_config: Configure the quantization parameters for a specific node, key is the name of the node.

There is a priority relationship between the four levels, the smaller the configuration granularity, the higher the priority, i.e., priority model_config < op_config < subgraph_config < node_config. When a node is configured by more than one level at the same time, the level with the highest priority takes effect in the end.

Note

Since the model quantization and compilation process internally will split and fuse some operators, it may cause some split operators' names to not be exactly the same as those of the original model operators. For such operators, when specifying the node_config, the operator names should refer to the corresponding operator names in the optimized_float_model.onnx model generated during the process.

Configure computation accuracy

The quant_config supports the configuration of int8/int16/float16 three kinds of computation accuracy data types, about these three kinds of data types are described as follows:

  • int8: The default quantization type for most operators, which generally does not need to be actively configured by you.

  • int16: You can refer to the section int16 Configuration.

  • float16: When configured as the float16 type, the tool will internally only configure this operator as the float16 computation accuracy type.

Attention

For the float16 datatype:

  • There will be no computation broadcast updates to the float16 computation operator context operators.

  • Different computing platforms have different support scopes for float16 operators. For the specific support scope, please refer to the Toolchain Operator Support Contraint List.

  • If you configure an operator that doesn't support float16 type to conduct float16 calculation, we will revert it to the float32 type for calculation.

Configure the computation accuracy for the entire model

Primary ParameterSecondary ParameterParameter TypeRequired or NotDescription
model_configall_node_typeStringOptionalSet the inputs of all nodes in the model to the specified type at once, with optional configuration of int16,float16.
model_output_typeStringOptionalSet the output tensor of the model to the specified type, with optional configuration of int8,int16.

Configure the computation accuracy for nodes with certain type

Primary ParameterSecondary ParameterTertiary ParameterParameter TypeRequired or NotDescription
op_configNodeKindqtypeStringOptionalConfigure the input data type of a node of a certain type, with optional configuration of int8,int16,float16,float32.

Configure the computation accuracy for a subgraph

Primary ParameterSecondary ParameterTertiary ParameterParameter TypeRequired or NotDescription
subgraph_configSubgraphNameinputsListRequiredConfigure input node names for enclosed subgraphs. Extraction operations abort when encountering unenclosed subgraphs due to missing input nodes.
outputsListRequiredConfigure output node names for enclosed subgraphs.
qtypeStringOptionalConfigure the input data type for all nodes within the subgraph, with optional configuration of int8,int16,float16,float32.

Configure the computation accuracy for a specific node

Primary ParameterSecondary ParameterTertiary ParameterParameter TypeRequired or NotDescription
node_configNodeNameqtypeStringOptionalConfigure the input data type of a node with a specified name, with optional configuration of int8,int16,float16,float32.
input0StringOptionalConfigure the 0th input data type of a node with a specified name, with optional configuration of int8,int16,float16,float32,ec.
input1StringOptionalConfigure the 1st input data type of a node with a specified name, with optional configuration of int8,int16,float16,float32,ec.
Attention

The input data type of a node can be specified by qtype, input0, or input1. qtype is generally used to specify the data type for all inputs of a node. input0 and input1 are used to specify the data type for the 0th and 1st inputs of a node, respectively. Moreover, the priority of input0 and input1 is higher than that of qtype.

ec stands for error compensate. It is a solution that compensates for the precision loss of int8 quantization for specific operators by creating the same operator. Currently, it only supports the following operators, and configuring other operators will not take effect.

  • The weight input of Conv and ConvTranspose operators (specified through the third-level parameter input1 in node_config).

  • Any input of MatMul operator (specified through the third-level parameter input0 or input1 in node_config).

  • The 0th input of GridSample and Resize operators (specified through the third-level parameter input0 in node_config).

Configuring calibration parameters

The quant_config supports configuring multiple calibration algorithms such as kl, max, etc. For each calibration algorithm, you can also flexibly control the specific hyperparameter configuration(if not configured, default value will be used). In addition, some independent calibration functions such as per_channel, asymmetric, bias_correction can also be configured.

Attention

If quant_config is not configured, multiple pre-set calibration algorithms will be tried by default, and the calibration algorithm with minimum quantization loss will be selected.

Configuring calibration parameters for activation

Primary ParameterSecondary ParameterTertiary ParameterParameter TypeRequired or NotDescription
model_configactivationcalibration_typeString/List[String]OptionalCalibration algorithms for activation, with optional configuration of kl, max. It supports configuring a List with multiple calibration algorithms.
num_binInt/List[Int]OptionalParameter of kl calibration algorithm, requires num_bin > 128, and the default value is 1024. It supports configuring a List with multiple num_bin values.
max_num_binIntOptionalParameter of kl calibration algorithm, requires max_num_bin >= num_bin, and the default value is 16384.
max_percentileFloat/List[Float]OptionalParameter of max calibration algorithm, with parameter range in [0.5, 1.0], and the default value is 1.0. It supports configuring a List with multiple max_percentile values.
per_channelBool/List[Bool]OptionalWhether per-channel quantization is enabled or not, with optional configuration of false, true, and the default value is false. It supports configuring a List that includes both per-channel enabled and disabled.
asymmetricBool/List[Bool]OptionalWhether asymmetric quantization is enabled or not, with optional configuration of false, true, and the default value is false. It supports configuring a List that includes both asymmetric enabled and disabled.

Configuring calibration parameters for weight

Primary ParameterSecondary ParameterTertiary ParameterQuaternary ParameterParameter TypeRequired or NotDescription
model_configweightbias_correctionnum_sampleIntOptionalThe number of samples for bias correction, requires num_sample >= 1, and the default vaue is 1.
metricStringOptionalThe model error metric for bias correction, with optional configuration of cosine-similarity, mse, mae, mre, sqnr and chebyshev, and the default value is cosine-similarity.

Configuring search methods for calibration parameters

The quant_config supports two search methods with different granularities:

  • modelwise_search: Search for quantization parameters at model level. This method allows multiple calibration algorithms to be configured at one time. By comparing the quantization loss with metric configured by you based on the model output before and after quantization, a calibration algorithm with the minimum quantization loss is selected.

  • layerwise_search: Search for quantization parameters at node level. This method calculates the quantization loss with metric configured by you based on the model output before and after the quantization of each node, and assigns the calibration algorithm with the minimum quantization loss to the node.

Attention

If multiple calibration algorithms are configured, modelwise search will be enabled by default to find the optimal algorithm for current model; if the layerwise search parameters are configured, a layer by layer search for the optimal algorithm will be initiated.

Configuring the modelwise search method

Primary ParameterSecondary ParameterTertiary ParameterParameter TypeRequired or NotDescription
model_configmodelwise_searchmetricStringOptionalThe model error metric for modelwise search, with optional configuration of cosine-similarity, mse, mae, mre, sqnr and chebyshev, and the default value is cosine-similarity.

Configuring the layerwise search method

Primary ParameterSecondary ParameterTertiary ParameterParameter TypeRequired or NotDescription
model_configlayerwise_searchmetricStringOptionalThe model error metric for layerwise search, with optional configuration of cosine-similarity, mse, mae, mre, sqnr and chebyshev, and the default value is cosine-similarity.

Configuration example of the json template

The following is an example of a json template configuration of the quant_config with all the configurable options, you can refer to this template for configuration.

{ // Configure model-level parameters "model_config": { // Configure input data types for all nodes at once "all_node_type": "int16"/"float16", // Configure the data type of the model output "model_output_type": "int8"/"int16", // Configure calibration parameters for activation "activation": { // Configure calibration algorithm for activation, with parameter type: String or List[String] "calibration_type": "max"/"kl"/["max", "kl"], // Configure the parameter for kl calibration algorithm, with parameter type: Int or List[Int] "num_bin": 1024/2048/[1024, 2048], // Configure the parameter for kl calibration algorithm, with parameter type: Int "max_num_bin": 16384, // Configure the parameter for max calibration algorithm, with parameter type: Float or List[Float] "max_percentile": 0.99995/1.0/[0.99995, 1.0], // Configure whether per-channel quantization is enabled or not, with parameter type: Bool or List[Bool] "per_channel": true/false/[true, false], // Configure whether asymmetric quantization is enabled or not, with parameter type: Bool or List[Bool] "asymmetric": true/false/[true, false] }, // Configure calibration parameters for weight "weight": { // Configure bias correction "bias_correction": { // Configure the number of samples for bias correction, with parameter type: Int "num_sample": 1, // Configure the model error metric for bias correction, with parameter type: String "metric": "cosine-similarity"/"mse"/"mae"/"mre"/"sqnr"/"chebyshev" } }, // Configure modelwise search, if multiple calibration algorithms are configured, modelwise search will be enabled to select the best one with minimum quantization loss "modelwise_search": { // Configure the model error metric for modelwise search, with parameter type: String "metric": "cosine-similarity"/"mse"/"mae"/"mre"/"sqnr"/"chebyshev" }, // Configure layerwise search, if layerwise search parameters are configured, a layer by layer search for the optimal algorithm will be initiated "layerwise_search": { // Configure the model error metric for layerwise search, with parameter type: String "metric": "cosine-similarity"/"mse"/"mae"/"mre"/"sqnr"/"chebyshev" } }, // Configure the parameters of a node type, change the op_name to the node type name, e.g. “Conv”, “Add”, “Softmax”... "op_config": { // Configure the type of input data for a certain type of node "op_name1": {"qtype":"int8"/"int16"/"float16"}, "op_name2": {"qtype":"int8"/"int16"/"float16"} }, // Configure parameters for a subgraph, change the subgraph_name to your custom subgraph name, // and update input_node/output_node with the actual input/output node names of the subgraph. e.g. “Conv_0”, “Add_1”... "subgraph_config": { // Configure the input data type for all nodes within the subgraph "subgraph_name1": { "inputs": ["input_node1", "input_node2"], "outputs": ["output_node1"], "qtype": "int8"/"int16"/"float16" } }, // Configure the parameters of a node, change the node_name to the name of the node, e.g. “Conv_0”, “Add_1”... "node_config": { // Configure the input data type of a node "node_name1": {"qtype": "int8"/"int16"/"float16"}, "node_name2": {"qtype": "int8"/"int16"/"float16"} } }

int16 Configuration

In the process of model conversion, most of the operators in the model are quantized to int8 for computation, and by configuring the quant_config parameter. You can specify in detail the input or the output of an op as int16 calculation (The range of operators supporting the configuration of int16 you can refer to Toolchain Operator Support Constraint List) The basic principle is as follows:basically as follows.

After you configure an op input/output data type to int16, the model conversion automatically performs an update and check of the op input/output context int16 configuration internally. For example, when configuring op_1 input/output data type as int16, it actually potentially specifies that the previous/next op of op_1 needs to support computation in int16 at the same time. For unsupported scenarios, the model conversion tool will print a log indicating that the int16 configuration combination is temporarily unsupported and fall back to int8 computation.