When deploying a model, discrepancies may arise between the accuracy of QAT models and HBM models. This chapter explains how to avoid such issues under the QAT strategy, and how to diagnose and resolve them when they occur.
Bit-level consistency between training and deployment is not achievable with fake quantization training. A few cases will inevitably fail to align perfectly. Please use accuracy of dataset as the basis for determining if a consistency issue exists.
Training-deployment consistency issues fall into two categories:
User-side issues. E.g., mismatched pre/post-processing or model versions.
Tool-side issues. inconsistencies introduced during export / convert / compile processes.
Regardless of the issue type, you’ll need to use consistency debug tools. For user-side issues, you should resolve them by yourself. For tool-side issues, provide debugging artifacts to Horizon’s technical support team for analysis and resolution.
During the troubleshooting of consistency issues, the following models are involved:
| Model | Description | How to Obtain |
|---|---|---|
| qat.pt | Torch QAT model. | Use the prepare API on a float model. |
| qat.export.pt | Torch QAT export model. Applies non-equivalent replacements to qat.pt. The computation logic of qat.export.pt is completely consistent with that of qat.bc. | Use the pre_export API on qat.pt. |
| qat.bc | HBIR model generated during export. | Use the export API on qat.pt. |
| quantized.bc | HBIR model generated during conversion. | Use the convert API on qat.bc. |
| hbm | Deployment model generated during compilation. | Use the compile API on quantized.bc. |
pre_export usage:
The high-consistency strategy is encapsulated in horizon_plugin_pytorch.qat_mode.ConsistencyStrategy. You can use set_consistency_level to configure it.
There are five strategy levels (0–4). Higher levels yield better consistency but may slightly impact QAT accuracy. Level 2 is recommended, as it usually improves accuracy and consistency with negligible performance loss.
If a QAT model was trained without a high-consistency strategy and retraining is not an option, set the level to 0 before prepare (levels 1–4 require retraining).
High-consistency QAT requires:
hbdk >= 4.4.2
plugin >= 2.7.1

The consistency issue debugging process is as follows:
Build a large dataset。The dataset should include at least 1000 frames, show stable evaluation accuracy, and can reproduce inconsistency issue between HBM/quantized.bc and torch QAT.
Develop inference code for the bc model.
User-side issue troubleshooting. Get accuracies of qat.export.pt and qat.bc on small dataset with CPU.
3.1 If the accuracy is consistent, proceed to tool-side issue investigation.
3.2 If the accuracy is inconsistent, disable fake quantization and re-verify the accuracy of qat.export.pt and qat.bc on the small dataset.
3.2.1 If the accuracy is now consistent, check whether the fake quantization and observer states before disabling were aligned with the model's validation state.
3.2.2 If accuracy is inconsistent, run statistical analysis separately on qat.export.pt and qat.bc, and perform per-layer comparison. (Note: You need to ensure that layer names are aligned and the same input is used for both models.)
4.1 Get accuracy of quantized.bc on large dataset
4.2 If there is an accuracy inconsistency between quantized.bc and qat.pt, then evaluate qat.export.pt on the large dataset.
4.2.1 If qat.export.pt also shows accuracy inconsistency compared to qat.pt, the issue lies in the export stage. Perform sensitivity analysis and per-layer comparison between qat.pt and qat.export.pt. If the issue cannot be located using conventional methods, use the pre_export interface to localize it.
4.2.2 If qat.export.pt and qat.pt show consistent accuracy, then the issue lies in the convert stage. Perform bad case detection and per-layer comparison between qat.bc and quantized.bc, and run sensitivity analysis on qat.export.pt (reusing the bad cases identified between qat.bc and quantized.bc).
When conventional methods fail to locate the issue, use the BC Editor tool to partially convert quantized.bc back to CPU for debugging.
Compare quantized.bc before and after editing to identify which ops, when reverted to CPU, improve consistency.