The quantization refers to an inference acceleration technique that performs computations and stores tensors using bit widths lower than floating-point precision. Quantized computations only support forward propagation.
Quantized models use integers instead of floating-point values to perform some or all operations on tensors.
Compared with typical FP32 models, quantized models can save computational resources. Taking INT8 quantization as an example, the model size is reduced by 4 times, and the memory bandwidth requirement is reduced by 4 times. Hardware support for INT8 computations is typically 2 to 4 times faster than FP32 computations.
horizon_plugin_pytorch provides the BPU-adapted quantization operations and supports quantized awareness training (QAT). The QAT uses fake-quantization modules to model the quantization errors in forward computation and backpropagation. Note that the computation process of the QAT is performed by using floating-point operations. At the end of the QAT, horizon_plugin_pytorch provides the conversion functions to convert the trained model to a fixed-point model, using a more compact model for representation and high-performance vectorization on the BPU.
This section gives you a detailed introduction to horizon_plugin_pytorch quantitative training tool developed on the basis of PyTorch.
Horizon_plugin_pytorch is developed based on PyTorch, in order to reduce the learning cost of you, we refer to the design of PyTorch on quantized awareness training. This doc doesn't repeat the contents which PyTorch doc already contains, if you want to understand the details of the tool, we recommend that you read the Official source code or the python source code of this tool. To ensure a smooth experience, we recommend that you first read the PyTorch documentation and familiarize yourself with the quantized awareness training and deployment tools provided by PyTorch.
For the purpose of brevity, the code in the documentation has been replaced with the following aliases by default:
QAT requires necessary modifications to the floating-point model. Please ensure a deep understanding of the BPU deployment, be familiarity with the structure of the floating-point model and the related code. It is recommended that algorithm engineers who work on floating-point models participate in QAT.