Common Failure

import Error

Error I: Cannot find the extension library(_C.so)

Solution:

  • Make sure that the horizon_plugin_pytorch version corresponds to the cuda version.
  • In python3, find the execution path of horizon_plugin_pytorch and check for .so files in that directory. There may be multiple versions of horizon_plugin_pytorch at the same time, so you need to uninstall it and keep only the one you need.

Error II: RuntimeError: Cannot load custom ops. Please rebuild the horizon_plugin_pytorch

Solution: check if the local CUDA environment is OK, such as path, version, etc.

Quantized Accuracy Anomaly

The QAT/Quantized accuracy is not as expected, there is a NAN, or the initial QAT loss is clearly anomalous with respect to float.

Solution: please refer to the section Accuracy Tuning Tool Guide.

Module doesn't support deepcopy

Some frameworks (e.g., PyTorch Lightning) provide wrappers which do not support deepcopy around the native torch module.

Solution: Implement the __deepcopy__ method for the model.

class Model(Module): ... def __deepcopy__(self, memo): new_model = Model() new_model.xxx = self.xxx return new_model

Cannot find the extension library(_C.so)

Solution: it mainly happens when horizon_plugin_pytorch installs successfully but import fails, the solution is as follows:

  1. Make sure that the horizon_plugin_pytorch version and the cuda version correspond;
  2. In python3, find the execution path of horizon-plugin-pytorch and check for .so files in that directory. There may be multiple versions of horizon-plugin-pytorch at the same time, so you need to uninstall it and keep only the one you need.

RuntimeError: Cannot load custom ops. Please rebuild the horizon_plugin_pytorch.

Solution: please make sure that the local CUDA environment is working properly, e.g., the path and version are as expected.

RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment

Solution: it mainly occurs in the phase of not being able to prepare properly, which is usually caused by non-leaf tensor in the model, please configure the inplace of prepare to True.

torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL

Solution: probably caused by a multi-threaded, python program that wasn't completely killed.

AttributeError: 'NoneType' object has no attribute 'numel'

Solution: this error occurs mainly during the insertion of pseudo-quantized nodes and is caused by the input scale of the operator being None. The reason may be that the output layer conv is inserted into dequant and then connected to an op, which is similar to the structure of conv+dequant+conv; or the conv configured with high accuracy output is connected to other operators. In this case, please check whether the dequant operator or the high accuracy output configuration is used correctly.

symbolically traced variables cannot be used as inputs to control flow

Solution: this error is caused by using dynamic control flow such as if, loop, etc. in fx mode. Currently, fx mode only supports static control flow, so you need to avoid using dynamic statements such as if, for, assert, etc. in forward.

NotImplementedError: function <method ‘neg’ of ‘torch._C._TensorBase’ objects> is not implemented for QTensor.

Solution: this error may occur in the Calibration phase in fx mode because fx mode does not support calculations of the form (-x), please change (-x) to (-1)*(x).

NotimplementedError: function <function Tensor.rsub at 0x7f5a7cdiee50> is not implemented for QTensor.

Solution: this error may occur in the Calibration phase in fx mode because the logic of operator substitution in fx mode is that if the subtracted number in the subtraction is a constant, the operator substitution is not performed automatically, so you need to change the subtraction to addition, e.g., change (1-x) to (x+(-1))*(-1).