Transmission Optimization

Reducing the amount of input and output data transferred between the X86 side and the board side can improve the tool's performance. The tool can provide transmission optimization support for the following three usage scenarios.

  1. The model input is fixed or updated periodically. If the input tensor of a multi-frame inference model remains unchanged, the input tensor does not need to be transmitted repeatedly; it only needs to be stored on the board side, and subsequent inferences can directly reuse the tensor on the board.

  2. Model output filtering: Unused model outputs do not need to be transmitted back to the X86 side.

  3. In multi-model scenarios, the output of a preceding model can be directly used as the input for a subsequent model. In this case, the output tensor of the preceding model can be stored on the board side without being transmitted back, and the corresponding input for the subsequent model also does not need to be transmitted, as it can use the tensor stored on the board.

Classes, Interfaces, and Parameters

  1. Class HTensor

    HTensor is used as the input and output tensor in transmission optimization scenarios. It is a wrapper class for tensors on the X86 side or the board side, designed to provide a unified data interface and restrict modifications to certain attributes to ensure data consistency.

    1). HTensor Member Method: __init__

    def __init__( self, data: Union[np.ndarray, torch.Tensor, None], device: Union[str, List[str], None], key: Optional[str] = None, ) -> None:

    Initialize an HTensor object.

    • Parameters
    PARAMETERDESCRIPTIONS
    dataTensor data to be encapsulated. Supported types include numpy.ndarray , torch.Tensor , or None .
    deviceStorage device of the tensor: None (unavailable), "cpu" (X86 side), "bpu" (board side), or ["cpu", "bpu"] (both).
    keyA key value used on the board side to uniquely identify the tensor.

    2). data Attribute of HTensor

    Retrieves or assigns the tensor data. When assigning new data, if the existing data is not None, the data type must match that of the original.

    3). device Attribute of HTensor

    Obtains the storage device information of the tensor. This attribute is immutable once the object is constructed.

    4). key Attribute of HTensor

    Obtains the tensor’s unique identification key on the server. This attribute is immutable once the object is constructed.

    5). shape Attribute of HTensor

    Obtains the shape of the tensor. Manual modifications are prohibited, as it is maintained automatically by the tool.

  2. output_config Parameter of HbmRpcSession.__call__

    This parameter is used to configure the transmission behavior of the output tensor after the current inference frame ends. Its type is Dict[str, Dict[str, Any]] , where the first-level keys are the model output names, and the second-level keys must include "device" or "key" (optional). The meanings and constraints of the values corresponding to "device" or "key" are consistent with the device or key parameters in the HTensor constructor.

    When a certain output name of the model is correctly configured in output_config , the corresponding output tensor in the inference results returned for the current frame will be of type HTensor. Outputs that are not configured will be returned as regular types ( numpy.ndarray or torch.Tensor ).

Examples of Transmission Optimization Application

  1. Periodic Input Update

    In this example, it is assumed that the model has an input named img , which is updated every 10 frames during inference. On the first frame after the img update, the tool will transfer it to the board side, while no input data will be transferred in the remaining frames.

    import torch import logging from hbm_infer.hbm_rpc_session import HbmRpcSession, HTensor, logger def periodic_input_update(epoch: int = 50): # Create session session = HbmRpcSession(host=<available_ip>, local_hbm_path=<local_hbm_path>) # img is updated periodically, so it needs to be wrapped using HTensor fixed_img = HTensor( # Set initial data data=torch.ones((1, 3, 224, 224), dtype=torch.int8), # For fixed input or periodically updated input scenarios, # device can only be set to ["cpu", "bpu"] # "cpu" means the tensor is stored on the X86 side, # "bpu" means a copy of this tensor will be created and stored on the board side device=["cpu", "bpu"], # Since "bpu" is included in device, # a key needs to be set to uniquely identify this tensor on the board side key="model.input_0", ) # Create model input dict input = {"img": fixed_img} for e in range(epoch): print(f"Epoch: {e}") # Update img input when a certain condition is met if e % 10 == 0: # When HTensor.data is updated, the updated tensor data # will be transmitted to the board side in subsequent inference calls; # if not updated, no data transmission occurs fixed_img.data = torch.ones((1, 3, 224, 224), dtype=torch.int8) # Inference output = session(data=input) for k, v in output.items(): print(k, v.shape) # Close session session.close_server() if __name__ == "__main__": logger.setLevel(logging.DEBUG) periodic_input_update(epoch=50)
  2. Output Filtering

    In this example, the model has three outputs: output_0 , output_1 , and output_2 . Among them, output_2 is unused and filtered out, so only output_0 and output_1 are returned to the X86 side.

    import torch import logging from hbm_infer.hbm_rpc_session import HbmRpcSession, HTensor, logger def output_discard(): session = HbmRpcSession(host=<available_ip>, local_hbm_path=<local_hbm_path>) # Create input dict input = { "input_0": torch.ones((4, 1024, 1024), dtype=torch.float32), "input_1": torch.ones((4, 1024, 1024), dtype=torch.float32), } # Configure output_config output_config = { # output_0 is returned normally, so no configuration needed # output_1 is returned normally, so no configuration needed # output_2 is filtered out, i.e., it is not transmitted back to the X86 side # nor stored on the board side; set device to None. # Since device does not include "bpu", key does not need to be set. "output_2": {"device": None} } # Run inference on the model output = session(data=input, output_config=output_config) # In the inference results, output_0 and output_1 are returned normally as torch.Tensor; # output_2 is of type HTensor print(type(output["output_0"])) # <class 'torch.Tensor'> print(type(output["output_1"])) # <class 'torch.Tensor'> print(type(output["output_2"])) # <class 'hbm_infer.utils.HTensor'> # Its data attribute is None, meaning data is not transmitted back print(output["output_2"].data) # None # Regardless of whether the output is discarded, its shape information is always returned print(output["output_2"].shape) # (4, 1024, 1024) if __name__ == "__main__": logger.setLevel(logging.DEBUG) output_discard()
  3. Model Chaining

    This example assumes that the input HBM file contains two models: model0 and model1. The output of model0 named output_0 will be directly used as the input named input_0 for model1. In this process, the output of model0 does not need to be transmitted back to the X86 side, and the input of model1 does not need to be transmitted to the board side.

    import torch import logging from hbm_infer.hbm_rpc_session import HbmRpcSession, HTensor, logger def model_chaining(): # Create session session = HbmRpcSession(host=<available_ip>, local_hbm_path=<local_hbm_path>) # [model0, model1] # Note: Model names are not equivalent to the HBM file names print(session.get_model_names()) # Create input dict for model0 model0_input = {"input_0": torch.ones((4, 1024, 1024), dtype=torch.float32)} # Configure output_config for model0 model0_output_config = { "output_0": { # output_0 does not need to be transmitted back and is stored directly on the board, # so set device to "bpu" and assign a key "device": "bpu", "key": "model0.output_0", } } # Run inference for model0 model0_output = session( data=model0_input, output_config=model0_output_config, model_name="model0" ) # model0.output_0 is of type HTensor print(type(model0_output["output_0"])) # <class 'hbm_infer.utils.HTensor'> # Its data attribute is None, meaning data is not transmitted back print(model0_output["output_0"].data) # None # Regardless of whether the output tensor is transmitted back, # its shape information is always returned print(model0_output["output_0"].shape) # (4, 1024, 1024) # Create input dict for model1 model1_input = { # Directly use model0's output_0 inference result as model1's input_0. # In subsequent inferences, model1's input_0 does not need to be transmitted, # but reuses the tensor stored on the board based on the key value "input_0": model0_output["output_0"] } # Run inference for model1, its output is returned normally, no need to configure output_config model1_output = session(data=model1_input, model_name="model1") print(type(model1_output["output_0"])) # <class 'torch.Tensor'> # Close session session.close_server() if __name__ == "__main__": logger.setLevel(logging.DEBUG) model_chaining()
  4. Comprehensive Application

    The following inference pipeline covers the three scenarios of periodic input update, output filtering, and model chaining. The flowchart is as follows:

    hbm_infer_pipeline_sample

    The reference code is as follows:

    import torch import logging from hbm_infer.hbm_rpc_session import HbmRpcSession, HTensor, logger def test_pipeline(epoch=50): # Create session session = HbmRpcSession(host=<available_ip>, local_hbm_path=<pipeline_hbm_path>) # [model0, model1, model2] # Note: Model names are not equivalent to HBM file names print(f"Model list: {session.get_model_names()}") # model1's input_1 is updated periodically and needs to be wrapped with HTensor model1_fixed_input1 = HTensor( # Set initial data data=torch.ones((4, 1024, 1024), dtype=torch.float32) * 2, # For fixed or periodically updated input scenarios, # device can only be set to ["cpu", "bpu"] # "cpu" means the tensor is stored on the X86 side, # "bpu" means a copy of this tensor will be created and stored on the board side device=["cpu", "bpu"], # Since device includes "bpu", # a key must be set to uniquely identify this tensor on the board side key="model1.input_1", ) for e in range(epoch): print(f"Epoch: {e}") # model0's input_0 is a normal input model0_input = { "input_0": torch.ones((4, 1024, 1024), dtype=torch.float32) * -1 } # Save model0's output_0 on the board side, and the actual tensor data # is not transmitted back to the X86 side. The inference result is returned # as an HTensor and can be directly used as input for subsequent models. # The data attribute of this HTensor will be None, but you can still get # its shape attribute. When used as input for subsequent models, the tool # will automatically locate the corresponding tensor on the board by its key. model0_output_config = {"output_0": {"device": "bpu", "key": "model0.output_0"}} # Inference for model0 model0_output = session( data=model0_input, # Set output_config output_config=model0_output_config, # Specify the model name for inference model_name="model0", ) # Simulate periodic update of model1's input_1 if e % 10 == 0: model1_fixed_input1.data = ( torch.ones((4, 1024, 1024), dtype=torch.float32) * 2 ) model1_input = { # Directly use model0's output_0 as model1's input_0 "input_0": model0_output["output_0"], # Fixed input1 "input_1": model1_fixed_input1, } model1_output_config = { # Discard output_0 by setting device to None "output_0": { "device": None, }, # output_1 is returned normally, so no need to configure in output_config # Store output_2 on the board side and simultaneously transmit it back to the X86 side "output_2": {"device": ["bpu", "cpu"], "key": "model1.output2"}, } # Inference for model1 model1_output = session( data=model1_input, output_config=model1_output_config, model_name="model1", ) # Use model1's output_2 as model2's input_0 model2_input = {"input_0": model1_output["output_2"]} # Inference for model2 model2_output = session(data=model2_input, model_name="model2") # Check the result print(f"{torch.all(model2_output['output_0'] == 1)}") # Close session session.close_server() if __name__ == "__main__": logger.setLevel(logging.DEBUG) test_pipeline(epoch=100)