QConfig in Detail

Definition and Principle

Definition of QConfig

The qconfig refers to quantization configuration, which is a key set of parameters in the quantization process of deep learning models. The quantization mode of the model is determined by qconfig, which needs to be set for the model before preparing the qat / calibration model.

Attention

Due to historical reasons, there are different definitions and usages of qconfig in the Plugin. Earlier versions of qconfig will be deprecated in the near future, and we only recommend that you use the qconfig usage described in this document.

A qconfig object can set three keywords: input, weight, and output, representing the quantization configuration of the operator's input, weight, and output respectively. When preparing model, these configurations determine whether to insert FakeQuantize or FakeCast nodes at the corresponding positions. None means no nodes will be inserted.

import torch from horizon_plugin_pytorch.quantization.qconfig import QConfig from horizon_plugin_pytorch.quantization.fake_quantize import FakeQuantize from horizon_plugin_pytorch.quantization.fake_cast import FakeCast from horizon_plugin_pytorch.quantization.observer_v2 import MinMaxObserver from horizon_plugin_pytorch.dtype import qint8 qconfig = QConfig( input=None, weight=FakeQuantize.with_args( observer=MinMaxObserver, dtype=qint8, qscheme=torch.per_channel_symmetric, ch_axis=0, ), output=FakeCast.with_args(dtype=torch.float16), # activation=xxx Earlier usage, same as the output keyword. Still compatible, but it's recommended to use the output keyword. )

Definition of FakeQuantize

FakeQuantize is a fake quantization node that performs quantization and dequantization operations on the input. Inserting fake quantization can simulate the errors caused by quantization in the forward pass of a floating-point model. The horizon_plugin_pytorch supports three types of fake quantization: FakeQuantize, PACTFakeQuantize, and _LearnableFakeQuantize. We recommend using the statistic-based FakeQuantize. The document won't introduce PACTFakeQuantize and _LearnableFakeQuantize. If required, please read the papers before using them.

# statistic-based FakeQuantize from horizon_plugin_pytorch.quantization.fake_quantize import FakeQuantize # https://arxiv.org/pdf/1805.06085 from horizon_plugin_pytorch.quantization.pact_fake_quantize import PACTFakeQuantize # https://arxiv.org/pdf/1902.08153 from horizon_plugin_pytorch.quantization._learnable_fake_quantize import _LearnableFakeQuantize

You can call the with_args method of FakeQuantize to get a constructor and use it to construct qconfig as shown in the previous section. The parameters of with_args include parameters supported by FakeQuantize and observer, theoretically allowing configuration of all parameters declared in the init method of the FakeQuantize and observer classes. However, to avoid unnecessary details, we recommend you to configure the observer-related parameters only.

Different observers have different parameters. Below are examples of constructing FakeQuantize with common used observers. For the specific usage of other observers, see the calibration section.

import torch from horizon_plugin_pytorch.quantization.qconfig import QConfig from horizon_plugin_pytorch.quantization.fake_quantize import FakeQuantize from horizon_plugin_pytorch.quantization.observer_v2 import MinMaxObserver, FixedScaleObserver, MSEObserver from horizon_plugin_pytorch.dtype import qint8 # The __init__ method of MinMaxObserver includes many parameters. The with_args method can control these parameters. # We only recommend you to set a few parameters as in the fq_constructor_1 example. # def __init__( # self, # averaging_constant: float = 0.01, # ch_axis: int = -1, # dtype: Union[torch.dtype, QuantDType] = qint8, # qscheme: torch.qscheme = torch.per_tensor_symmetric, # quant_min: int = None, # quant_max: int = None, # is_sync_quantize: bool = False, # factory_kwargs: Dict = None, # ) -> None: fq_constructor_1 = FakeQuantize.with_args( observer=MinMaxObserver, # Suitable for input/output/weight in qat and weight in calibration. averaging_constant=0.01, # When performing qat after calibration, the averaging_constant of input/output can be set to 0 to fix the scale. dtype=qint8, # Quantization type, set based on the support of the operator. qscheme=torch.per_channel_symmetric, # Only weight supports per-channel quantization. ch_axis=0, # Specify the channel for per-channel quantization. ) # Similarly, you can check the __init__ method of FixedScaleObserver and MSEObserver to learn the configurable parameters. fq_constructor_2 = FakeQuantize.with_args( observer=FixedScaleObserver, # Fixed scale, will not change in any conditions. dtype=qint8, # Quantization type, set based on the support of the operator. scale=INPUT_ABS_MAX / 128, # scale value, use maximum absolute value divided by the maximum quantization type value. ) fq_constructor_3 = FakeQuantize.with_args( observer=MSEObserver, # Suitable for input/output in calibration. dtype=qint8, # Quantization type, set based on the support of the operator. ) qconfig = QConfig( weight=fq_constructor_x, ... )

Definition of FakeCast

FakeCast is a fake conversion node that converts the input to float32 data type. If the data type is float16, it also simulates the truncation error caused by converting value to float16. This node is mainly used to mark operators that require floating-point computation.

The method of using FakeCast to construct qconfig is similar to FakeQuantize, but it only has one parameter.

import torch from horizon_plugin_pytorch.quantization.qconfig import QConfig from horizon_plugin_pytorch.quantization.fake_cast import FakeCast qconfig = QConfig( input=FakeCast.with_args(dtype=torch.float16), # set based on the support of the operator. ... )

Construct QConfig

There are two methods for you to choose from when constructing Qconfig:

  • Construct the QConfig object directly as introduced above. This method is flexible, allowing the configuration of any configurable parameter, but requires deep understanding of QConfig.

  • Use the get_qconfig interface. This interface is simpler and easier to use than directly constructing QConfig objects but less flexible, and cannot be used for advanced requirements.

import torch from horizon_plugin_pytorch.quantization import get_qconfig from horizon_plugin_pytorch.quantization.observer_v2 import MinMaxObserver from horizon_plugin_pytorch.quantization.qconfig import QConfig from horizon_plugin_pytorch.quantization.fake_quantize import FakeQuantize from horizon_plugin_pytorch.dtype import qint8 # qconfig_1 / qconfig_2 / qconfig_3 / qconfig_4 are equivalent. qconfig_1 = QConfig( weight=FakeQuantize.with_args( observer=MinMaxObserver, averaging_constant=0.01, dtype=qint8, qscheme=torch.per_channel_symmetric, ch_axis=0, ), output=FakeQuantize.with_args( observer=MinMaxObserver, averaging_constant=0, dtype=qint8, qscheme=torch.per_tensor_symmetric, ch_axis=-1, ), ) qconfig_2 = QConfig( weight=FakeQuantize.with_args( observer=MinMaxObserver, qscheme=torch.per_channel_symmetric, ch_axis=0, ), output=FakeQuantize.with_args( observer=MinMaxObserver, averaging_constant=0, ), ) qconfig_3 = get_qconfig( observer=MinMaxObserver, # Input and output observer types, only supports MinMaxObserver and MSEObserver in horizon_plugin_pytorch.quantization.observer_v2, default is MinMaxObserver. in_dtype=None, # Input data type, set based on the support of the operator. None means the input keyword of QConfig is None, default is None. weight_dtype=qint8, # Weight data type, set based on the support of the operator. None means the weight keyword of QConfig is None, default is qint8. out_dtype=qint8, # Output data type, set based on the support of the operator. None means the output keyword of QConfig is None, default is qint8. fix_scale=True, # Whether to fix the input and output scales. ) qconfig_4 = get_qconfig(fix_scale=True)

Set qconfig via QconfigSetter

QconfigSetter automatically sets qconfig according to the specified rules based on the model's computation graph, and it is our most recommended method for setting qconfig. The use of QconfigSetter depends on the graph mode of the prepare process, with the usage as follows:

from horizon_plugin_pytorch.quantization import prepare, PrepareMethod, get_qconfig from horizon_plugin_pytorch.quantization.qconfig_setter import * qat_model = prepare( model, example_inputs=example_inputs, # The graph mode requires providing model inputs to obtain the computation graph qconfig_setter=QconfigSetter( reference_qconfig=get_qconfig(), # qconfig used to provide the observer templates=[<Templates>], # User-configured templates enable_optimize=True, # Enable all default optimizations save_dir="./qconfig_setting", # Save path for qconfig configuration results (qconfig.pt file) and changelog ), method=PrepareMethod.JIT_STRIP, # QconfigSetter depends on the computation graph )

Template Description

The templates you can configure are as follows:

  1. ModuleNameTemplate (required, needs to cover all quantized operators): Specify dtype configuration or quantization threshold through module name.

  2. ConvDtypeTemplate (required): Specify the input and weight dtype of Conv-type operators.

  3. MatmulDtypeTemplate (required): Specify the input dtype of Matmul operators.

  4. SensitivityTemplate (optional): Configure the top-n operators to high precision according to sensitivity.

  5. LoadFromFileTemplate: Load the qconfig.pt file, which is used to reproduce previous quantization configurations. At this time, enable_optimize must be False; otherwise, the correctness of the configuration results cannot be guaranteed, and there may be CPU operators during deployment.

These templates take effect in the order of configuration, and the configuration of the previous template can be overwritten by the subsequent one.

Detailed Explanation of ModuleNameTemplate

  1. The module name can be an operator name or a prefix. When different module names in a ModuleNameTemplate have an overriding relationship, the longer name has higher priority. For example:

    ModuleNameTemplate( { "": qint8, # Global qint8 "head": qint16, # Higher priority than the global configuration; head is finally configured as int16 "head.conv0": torch.float16, # Higher priority than the configuration of head; head.conv is finally configured to output float16 } )
  2. The threshold of the operator can be specified (provided that there is a corresponding pseudo-quantization node in the operator). At this time, the calculation method of the quantization scale is scale = threshold / -qdtype.min. For example:

    ModuleNameTemplate( { "quant": {"dtype": qint8, "threshold": 1.0}, # The quantization scale of quant is 1.0/128 } )
  3. By default, dtype and threshold are configured on the output of the operator. You can configure the input or weight by specifying the key. When the operator has multiple inputs, None can be used as a placeholder. For example:

    ModuleNameTemplate( { "conv0": {"input": qint8, "weight": qint16}, "conv1": {"dtype": {"input": qint16}, "threshold": {"weight": 1.0}}, "matmul0": {"dtype": {"input": [qint16, None]}, "threshold": {"input": [1.0, None]}}, # Configure the first input of matmul0 as int16 with a fixed scale of 1.0/32768, and do not configure the second input } )

Scenario Examples

  1. All int8:

    QconfigSetter( reference_qconfig=get_qconfig(), templates=[ ModuleNameTemplate({"": qint8}), # All operators are configured to output int8 ConvDtypeTemplate(input_dtype=qint8, weight_dtype=qint8), # The input and weight of conv are configured as int8 MatmulDtypeTemplate(input_dtypes=qint8), # Both inputs of matmul are configured as int8 ], )
  2. Feature int16, weight int8:

    QconfigSetter( reference_qconfig=get_qconfig(), templates=[ ModuleNameTemplate({"": qint16}), # All operators are configured to output int16 ConvDtypeTemplate(input_dtype=qint16, weight_dtype=qint8), # The input of conv is configured as int16, and the weight is configured as int8 MatmulDtypeTemplate(input_dtypes=qint16), # Both inputs of matmul are configured as int16 ], )
  3. Gemm operators with double int8, other operators with fp16:

    QconfigSetter( reference_qconfig=get_qconfig(), templates=[ ModuleNameTemplate({"": torch.float16}), # All operators are configured to output fp16 ConvDtypeTemplate(input_dtype=qint8, weight_dtype=qint8), # The input and weight of conv are configured as int8 MatmulDtypeTemplate(input_dtypes=qint8), # Both inputs of matmul are configured as int8 ], )
  4. Gemm operators with double int8, other operators with int16, and high-sensitivity gemm configured as int16:

    QconfigSetter( reference_qconfig=get_qconfig(), templates=[ ModuleNameTemplate({"": qint16}), # All operators are configured to output int16 ConvDtypeTemplate(input_dtype=qint8, weight_dtype=qint8), # The input and weight of conv are configured as int8 MatmulDtypeTemplate(input_dtypes=qint8), # Both inputs of matmul are configured as int8 SensitivityTemplate( # If the highly sensitive feat or weight is configured as int8, modify it to int16 sensitive_table=..., topk_or_ratio=..., ), ], )

Description of Default Optimization Passes

In addition to the templates you can configure, QconfigSetter also integrates a series of optimization and legalization templates, which are explained in this section.

  1. CanonicalizeTemplate: Legalize dtype configuration according to operator types. The current default rules are:

    • Gemm-type operators do not support float inputs(include weight).

    • Interpolation-type operators: There are different restrictions under different march.

    • Special operators such as DPP and RPP only support int8.

    • General rules for other operators: The input dtype and output dtype of an operator cannot have both qint and float.

  2. EqualizeInOutScaleTemplate: For relu, concat, and stack operators, the scale should be counted after the operator; otherwise, there may be a loss in precision or performance. To this end:

    • Configure the output dtype of the previous operator as float32.

    • When exporting hbir for relu, concat, and stack operators, insert pseudo-quantization at the input, and reuse the output scale for the scale.

  3. FuseConvAddTemplate: The hardware supports the fusion of conv + add. To this end:

    • Configure the output dtype of conv as float32.

    • Configure the corresponding input dtype of add as float32.

  4. GridHighPrecisionTemplate: According to experience, the grid calculation process of grid sample with qint8 is not precise enough, so the relevant operators are automatically configured to high precision.

  5. InternalQuantsTemplate: In the scenario of segmented model deployment, QuantStub will be inserted at the segmentation points to record the dtype and scale here. The dtype configuration of such QuantStub must be consistent with the input.

  6. OutputHighPrecisionTemplate: When a Gemm-type operator is used as the model output, configure it to output with high precision.

  7. PropagateTemplate: For operators split into subgraphs for implementation, there are empirical configurations. For example, the small internal operators of LayerNorm and Softmax should use high precision.

  8. SimpleIntPassTemplate: For performance optimization, for computation graphs such as op0->op1->op2, if the following conditions are met at the same time, modify the output type of op1 to int:

    • op2 requires int input.

    • op0 can output int.

    • op1 currently outputs float16 and belongs to the following types:

      • cat, stack.

      • mul_scalar.

      • Lookup table operators without precision risks (that is, operators that use lookup table implementation by default on fp16).

  9. SimplifyTemplate: Delete redundant quantization node configurations (modify the corresponding dtype to None).