Training-Deployment Consistency

When deploying a model, discrepancies may arise between the accuracy of QAT models and HBM models. This chapter explains how to avoid such issues under the QAT strategy, and how to diagnose and resolve them when they occur.

Attention

Bit-level consistency between training and deployment is not achievable with fake quantization training. A few cases will inevitably fail to align perfectly. Please use accuracy of dataset as the basis for determining if a consistency issue exists.

Training-deployment consistency issues fall into two categories:

  1. User-side issues. E.g., mismatched pre/post-processing or model versions.

  2. Tool-side issues. inconsistencies introduced during export / convert / compile processes.

Regardless of the issue type, you’ll need to use consistency debug tools. For user-side issues, you should resolve them by yourself. For tool-side issues, provide debugging artifacts to Horizon’s technical support team for analysis and resolution.

During the troubleshooting of consistency issues, the following models are involved:

ModelDescriptionHow to Obtain
qat.ptTorch QAT model.Use the prepare API on a float model.
qat.export.ptTorch QAT export model. Applies non-equivalent replacements to qat.pt. The computation logic of qat.export.pt is completely consistent with that of qat.bc.Use the pre_export API on qat.pt.
qat.bcHBIR model generated during export.Use the export API on qat.pt.
quantized.bcHBIR model generated during conversion.Use the convert API on qat.bc.
hbmDeployment model generated during compilation.Use the compile API on quantized.bc.

pre_export usage:

from horizon_plugin_pytorch.quantization.hbdk4 import pre_export qat_export_pt = pre_export(qat_pt)

High-Consistency QAT Strategy(Beta Feature)

The high-consistency strategy is encapsulated in horizon_plugin_pytorch.qat_mode.ConsistencyStrategy. You can use set_consistency_level to configure it.

There are five strategy levels (0–4). Higher levels yield better consistency but may slightly impact QAT accuracy. Level 2 is recommended, as it usually improves accuracy and consistency with negligible performance loss.

If a QAT model was trained without a high-consistency strategy and retraining is not an option, set the level to 0 before prepare (levels 1–4 require retraining).

from horizon_plugin_pytorch.qat_mode import ConsistencyStrategy # Must be set before prepare. ConsistencyStrategy.set_consistency_level(2) ... qat_pt = prepare(float_model) ... qat_bc = export(qat_pt, example_inputs) # If using level 0, verify with: print(qat_bc._high_precision_qpp) # Should be true print(qat_bc._fuse_requantize) # Should be false quantized_bc = convert(qat_bc, march)
Attention

High-consistency QAT requires:

hbdk >= 4.4.2

plugin >= 2.7.1

Consistency Debugging Workflow

The consistency issue debugging process is as follows:

  1. Build a large dataset。The dataset should include at least 1000 frames, show stable evaluation accuracy, and can reproduce inconsistency issue between HBM/quantized.bc and torch QAT.

  2. Develop inference code for the bc model.

  3. User-side issue troubleshooting. Get accuracies of qat.export.pt and qat.bc on small dataset with CPU.

3.1 If the accuracy is consistent, proceed to tool-side issue investigation.

3.2 If the accuracy is inconsistent, disable fake quantization and re-verify the accuracy of qat.export.pt and qat.bc on the small dataset.

from horizon_plugin_pytorch.quantization.hbdk4 import export, pre_export from horizon_plugin_pytorch.quantization import FakeQuantState qat_pt.eval() set_fake_quantize(qat_pt, FakeQuantState._FLOAT) qat_bc = export(qat_pt, example_inputs) qat_export_pt = pre_export(qat_pt)

3.2.1 If the accuracy is now consistent, check whether the fake quantization and observer states before disabling were aligned with the model's validation state.

print(qat_pt) # fake_quant_enabled should be True, observer_enabled should be False GraphModuleImpl( (quant): QuantStub( (activation_post_process): FakeQuantize( dtype=qint8, fake_quant_enabled=True, observer_enabled=False, qscheme=torch.per_tensor_symmetric, ch_axis=-1, scale=tensor([0.1250]), zero_point=tensor([0]) (activation_post_process): MinMaxObserver(min_val=-10.0,max_val=9.999998092651367,averaging_constant=1) ) ) ... )

3.2.2 If accuracy is inconsistent, run statistical analysis separately on qat.export.pt and qat.bc, and perform per-layer comparison. (Note: You need to ensure that layer names are aligned and the same input is used for both models.)

from horizon_plugin_profiler import QuantAnalysis qa = QuantAnalysis(qat_export_pt, qat_bc, "export") # If Torch and BC models accept the same input format, run analysis together. qa.set_bad_case(badcase) qa.run() # If Torch and BC models require different input formats, run separately. # Ensure pt_badcase and bc_badcase are identical in content except for format. qa.set_bad_case(pt_badcase) qa.run(run_baseline_model=True, run_analysis_model=False) qa.set_bad_case(bc_badcase) qa.run(run_baseline_model=False, run_analysis_model=True) # Per-layer comparison qa.compare_per_layer()
  1. Tool-side issue troubleshooting.

4.1 Get accuracy of quantized.bc on large dataset

4.2 If there is an accuracy inconsistency between quantized.bc and qat.pt, then evaluate qat.export.pt on the large dataset.

4.2.1 If qat.export.pt also shows accuracy inconsistency compared to qat.pt, the issue lies in the export stage. Perform sensitivity analysis and per-layer comparison between qat.pt and qat.export.pt. If the issue cannot be located using conventional methods, use the pre_export interface to localize it.

from horizon_plugin_profiler import QuantAnalysis from horizon_plugin_pytorch.quantization.hbdk4 import pre_export # Run sensitivity and per-layer comparison between qat.pt and qat.export.pt qa = QuantAnalysis(qat_pt, qat_export_pt, "pre_export") qa.auto_find_bad_case(dataloader) qa.run() qa.compare_per_layer() qa.sensitivity() # Use pre_export to localize issue qat_pt.module_a = pre_export(qat_pt.module_a)

4.2.2 If qat.export.pt and qat.pt show consistent accuracy, then the issue lies in the convert stage. Perform bad case detection and per-layer comparison between qat.bc and quantized.bc, and run sensitivity analysis on qat.export.pt (reusing the bad cases identified between qat.bc and quantized.bc).

from horizon_plugin_profiler import QuantAnalysis # Run bad case search and per-layer comparison between qat.bc and quantized.bc qa = QuantAnalysis(qat_bc, quantized_bc, "convert") qa.auto_find_bad_case(dataloader) qa.run() qa.compare_per_layer() # Run sensitivity analysis using qat.export.pt qa = QuantAnalysis(qat_export_pt, quantized_bc, "convert") qa.load_bad_case() qa.sensitivity()

When conventional methods fail to locate the issue, use the BC Editor tool to partially convert quantized.bc back to CPU for debugging.

# View the qat.bc model in text form print(qat_bc.module.get_asm(enable_debug_info=True)) # The number after “%” is the hbir operator ID module attributes {hbdk.legacy_round = true} { func.func @bev_gkt_mixvargenet_multitask_nuscenes(%arg0: tensor<6x3x512x960... %0 = "qnt.const_fake_quant"(%arg0) <{bits = 8 : i64, illegal = true, max... %1 = "hbir.constant"() <{values = dense<"0xC27B5D3DFF6DE33C1822093DDA9642... %2 = "qnt.const_fake_quant"(%1) <{axis = 0 : i64, bits = 8 : i64, illegal... ...... # config.json { "remove_fake_quant": [[1, 100], 102] // Remove fake quant ops with IDs from 1 to 100 and 102 } # After editing, you’ll get qat_modified.bc. Running convert on it will produce a quantized.bc with some ops reverted to CPU. python3 bc_editor.py --bc_path qat.bc --config_path config.json --new_bc_path qat_modified.bc

Compare quantized.bc before and after editing to identify which ops, when reverted to CPU, improve consistency.