This section provides you with some concepts that may appear frequently within the following as well as some commonly used background knowledge.
Original floating-point model
Available models obtained from the DL framework training like TensorFlow, PyTorch, etc. This model is computed with a precision of float32.
Board-side Deployable Model(HBM model)
A model format suitable for running on the Horizon computing platform. It can support model execution on both ARM CPU and BPU. Since the operation speed on the BPU will be much faster than that on the CPU, the operators will be computed on the BPU as much as possible. For operators that are not supported on the BPU at the moment, they will be computed on the CPU.
Operator
Deep learning algorithm are composed of computational units, we call these computational units as the Operator (also known as op). The operator is a mapping from a function space onto a function space, the name of the operator is unique in the same model, but more than one operator of the same type can exist. For example, Conv1, Conv2, are two different operators with the same operator type.
Model conversion
Process of converting the original floating-point model or the standard-compliant onnx model into a deployable model for the Horizon board side.
Model quantization
Currently one of the most effective model optimization methods in industry. Quantization is to establish data mapping relationships between fixed-point data and floating-point data to achieve inference performance gains with little accuracy loss, which can be simply understood as using "low-bit" numbers to represent FP32 or other types of values, e.g., FP32 --> INT8 can achieve 4 times parameter compression, and faster calculations can be achived while memory usage is reduced.
The Quantize node is used to quantize the input data of the model from the [float] type to [int8] type, which uses the following formula:
round(x) rounds the floating point number.clamp(x) clamps the data to an integer value between -128 and 127.scale is the quantized scale factor.zero_point is the asymmetric quantization zero-point offset value. When in symmetric quantization, zero_point = 0.The C++ reference implementation is as follows:
The Dequantize node is used to dequantize output data of the model from the int8 or int32 type back to float or double type with the following formula:
The C++ reference implementation is as follows.
PTQ
PTQ conversion scheme, a quantization method that first trains a floating-point model and then uses a calibration image to calculate quantization parameters to convert the floating-point model into a quantized model. For more details, refer to the PTQ and QAT Introduction section.
QAT
QAT (Quantized Awareness Training) scheme, which intervenes in the floating-point model structure during the floating-point training to enable the model to perceive the loss from quantization and reduce the quantization loss accuracy. For more details, refer to the PTQ and QAT Introduction section.
Tensor
The Tensor is a multidimensional array with a uniform data type, as a container for the data computed by the operator, it contains the input and output data. The carrier of tensor specific information, contains the name, shape, data layout, data type, etc. of the tensor data.
Data layout
In the deep learning, multidimensional data is stored through the multidimensional array (tensor), and the generic neural network featuremaps are usually stored using the four-dimensional array (i.e., 4D) format, i.e., the following four dimensions:
However, the data can only be stored linearly, so the four dimensions have a corresponding order, and different layout formats of the data will affect the computational performance. The common data storage formats are NCHW and NHWC:
As shown below:
Data type
The image data types commonly used below include rgb, bgr, gray, yuv444, nv12, and featuremap.
Batch, Batch Size
In the model training process, a set of training samples used in each iteration is called a batch. The batch size refers to the number of samples that the model processes in each iteration.
Cosine similarity
One of the accuracy comparison algorithms, the computation result takes the value range of [-1,1], if the result of the comparison is closer to 1, it means that the value of the two is more similar, and the closer to -1 means that the value of the two is more opposite.
Stride
Stride is the actual size of the space occupied by each line of an image when it is stored in memory. Most computer processors work with 32-bit or 64-bit, so the processor will read the complete amount of data at a time, preferably in multiples of 4 bytes or 8 bytes, if other values, the computer will need to specialize in processing, which will lead to a reduction in efficiency. In order to efficiently process the images by the computer, it is common to fill in some extra data on top of the original data to achieve 4-byte or 8-byte alignment. The operation of alignment is also called Padding, and the actual alignment rules depend on the specific hardware and software system.
Suppose we have an 8-bit deep grayscale image with a height (Height) of 20 pixels and a width (Width) of 30 pixels, then the effective data per line of the image is 30 bytes, and if the computer's alignment rule is 8 bytes, then the span of the image after alignment is 32 bytes, at which point the amount of data that needs to be Padding per line is 2 bytes.
Calibration dataset
The dataset used to do forward inference in the PTQ scenario. The distribution of this dataset represents the distribution of all datasets and should be representative when obtaining the calibration set. If the dataset is not the model-matched dataset or is not representative enough, the quantization factor computed from the calibration set performs poorly on the full dataset, with high quantization loss and low accuracy after quantization.
BPU Architecture-Computing Platform Mapping
| Computing Platform | S100 | S100P |
|---|---|---|
| BPU Architecture | nash-e | nash-m |
For more information about abbreviations in the documents, please refer to the section Common Abbreviations.