Efficient Model Design Advice

Overview

To meet the stringent performance requirements of highly complex smart-driving scenarios, we have launched S100, which is an efficient development platform designed to provide quantitative deployment of models. As an ASIC, the BPU contains a variety of specialized hardware computing units, each of which may have been designed to sacrifice some flexibility for higher performance. Therefore, you need to understand the characteristics of the BPU, and make some targeted optimization of the model to better fit the characteristics of the hardware, in order to give full play to the computational advantages of the BPU as much as possible. The following section describes the recommendations for designing efficient models on the S100 computing platform.

Warning

The optimization ideas given below are only for guidance and suggestions, so please try more according to the actual situation.

General Recommendations

Use BPU Ops to Build a Model

The BPU algorithm performs much better than the CPU algorithm, and the heterogeneous scheduling between CPU and BPU introduces a quantization and dequantization node. The computation of this node takes time proportional to the shape size because it needs to traverse the data.

  • For quantization, dequantization operators: It is recommended to manually extract and merge the relevant calculations into the pre- and post-processing code, which can save the redundant time consuming traversal of the data once. Meanwhile, in the model post-processing, we can also consider completing the screening and filtering operation first, and only do the dequantization of the remaining data, which can further compress the time-consuming.

  • For the CPU operators in the model: You can check the supported public operators based on the operator constraints list, make replacements of CPU operators or spell out equivalent operators using the supported ones, guaranteeing that the model will run in the full segment of the BPU as much as possible.

Use Backbones with Efficient BPU

Based on the hardware characteristics of S100, we provide the efficient model HENet(Hybrid Efficient Network). Based on performance and accuracy considerations, we provide two quantum networks, TinyE and TinyM. On the ImageNet dataset, the accuracy of HENet's efficient model can not only reach a model accuracy not less than that of Resnet50, but the performance is also improved exponentially.

Below is the performance of several models including HENet on the classification task:

Note

The data source is OE package v3.0.22, the model is from OE/samples/ai_toolchain/horizon_model_train_sample, there may be slight differences between different versions, the data is for reference only.

ModelInput SizeS100 Frame Rate (FPS)Floating Point AccuracyQuantification AccuracyDataset
Resnet181x3x224x224244771.6971.64ImageNet
1x3x704x1280275//ImageNet
Resnet501x3x224x224110177.0376.78ImageNet
1x3x704x1280118//ImageNet
Efficientnet-b01x3x224x224427074.3173.85ImageNet
1x3x704x1280501//ImageNet
Mobilenetv21x3x224x224465272.1771.47ImageNet
1x3x704x1280732//ImageNet
Mixvargenet1x3x224x224494770.7570.63ImageNet
1x3x704x1280436//ImageNet
Vargnetv21x3x224x224382273.4273.17ImageNet
1x3x704x1280667//ImageNet
HENet_TinyE1x3x224x224288677.6776.92ImageNet
1x3x704x1280518//ImageNet
HENet_TinyM1x3x224x224263878.3877.62ImageNet
1x3x704x1280412//ImageNet
Tips

For scenarios where the inputs and outputs have similar resolution, or are used as the main backbone, we recommend using the TinyM / TinyE native structure directly. For other scenarios, we suggest you refer to the basic block structure of TinyM to build the model flexibly.

Fusion of Fusible Operators

In the quantization phase, an operator fusion (fuse) is performed for the different layers of the current model structure. The BPU hardware is optimized for the basic structure of common models, and when computing combinations of operators such as Conv -> Add -> ReLU, they are treated as a whole and merged into a Module. (The combinations of operators that support fusion are detailed in the Quantized Awareness Training(QAT) - Advanced Tutorial - Operator Fusion section).

Operator fusion not only preserves the high-accuracy state of the intermediate results, but also eliminates the process of converting the intermediate results to a low-accuracy representation. Typically, all fusible parts should be fused.

We recommend using jit mode, the jit mode will automatically do operator fusion (need to pay attention to the impact of shared operators on operator fusion), if considering quantitative accuracy, pay attention to the support of separate operators and the model performance when dismantling. Fuse as much as possible while maintaining quantization accuracy, a large number of unfused operators can have a small impact on performance.

ModelS100 Performance (FPS)
UnfuseFuse
fcos_efficientnetb0_mscoco852.26FPS1206.89FPS

Follow the Hardware Alignment Rules

There is a minimum alignment unit when computing on the S100 computing platform, if the rule is not satisfied, it will automatically padding the Tensor, resulting in an invalid waste of computing power. Different operator alignment rules vary, and the following alignment suggestions apply to the generic case:

  • General tensor, each dimension shape is aligned to a power of 2.
  • conv like used tensor, H aligned to 8, W aligned to 16, C aligned to 32.

When designing the model, we should try to put the big data in the dimension with larger alignment unit, and avoid the situation where the C dimension is too big and the H and W dimensions are too small, because this will result in padding of H and W, which will multiply the amount of data.

Improve the Computation/Visit Ratio of the Model

The operator runtime consists of the access time and the computation time, if the computation/access ratio is to be improved, the access time needs to be reduced, the most common operation that leads to an increase in the access time is usually the data handling operation. Here are some optimization suggestions to reduce data handling consumption and improve the model compute/visit ratio:

  1. Relieve bandwidth pressure with GroupConv, when the model channel is small, we can utilize ordinary dense convolution to exploit the powerful arithmetic of BPU. As the model deepens, the channel will gradually become larger, and when the number of channels is large, we recommend using GroupConv to alleviate the bandwidth pressure.

  2. The smaller featuremap between blocks, which reduces data handling between DDR and SRAM.

  3. Avoid large number of weight parameters, too large kernel_size and too high channel dimension will increase the number of weight parameters, resulting in more frequent data handling. For high channel dimensions group convolution can be used to increase the gain.

  4. Shortcut should not be too long, a shortcut that is too long can cause the featuremap to reside in SRAM for too long, or even cause the data to be dumped to DDR.

  5. Small picture with multiple Batch, because of the large computational power of the S100, the performance on large models (720P and above) is better. For small models (resolution <= 256), we suggest using batch mode for inference, which reduces the number of weight loads and thus allows for a more efficient balance of computation/visit ratio.

  6. Reduce data transform operations, BPU is a tensor computing processor and the data layout is a multidimensional representation, whereas the expression semantics of Reshape, Transpose is based on a linear layout of CPU/GPU. Therefore, the data transform operation on the BPU introduces a certain amount of time-consuming. When reshaping, you should minimize the number of dimensions that are changed, the more dimensions that are changed, the less efficient the computation will be.