FAQ

How are Latency and FPS data calculated?

Latency refers to the average time spent by a single-process inference model. It focuses on the average time it takes to infer one frame when resources are sufficient. This is reflected in the statistics of single-core and single-thread running on the board. The pseudo code of the statistical method is as follows:

// Load model and prepare input and output tensor ... // Loop run inference and get latency { int32_t const loop_num{1000}; start = std::chrono::steady_clock::now(); for(int32_t i = 0; i < loop_num; i++){ hbUCPSchedParam sched_param{}; HB_UCP_INITIALIZE_SCHED_PARAM(&sched_param); // create task hbDNNInferV2(&task_handle, output_tensor, input_tensor, dnn_handle); // submit task hbUCPSubmitTask(task_handle, &sched_param); // wait task done hbUCPWaitTaskDone(task_handle, 0); // release task handle hbUCPReleaseTask(task_handle); task_handle = nullptr; } end = std::chrono::steady_clock::now(); latency = (end - start) / loop_num; } // release tensor and model ...

FPS refers to the average number of frames per second of model inference performed by multiple processes at the same time, which focuses on the throughput of the model when fully utilizing the resources.

In on-board running situation, it is represented as single-core and multi-threaded. The statistical method is to perform the model inference by initiating multiple threads at the same time and calculate the total number of frames of the inference in average 1 second.

Why is the FPS estimated by Latency inconsistent with the FPS measured by the tool?

Latency and FPS are different in statistical scenarios. Latency is single-process (single-core, single-thread) inference, and FPS is multi-process (single-core, multi-thread) inference, so the calculation is different. If the number of processes (threads) is set to 1 when counting the FPS, then the FPS estimated by Latency is consistent with the measured one.

How to deal with out of memory during Perf?

When prompted to be out of memory, you can solve it in the following ways:

  • Reduce the value of thread_num to reduce parallelism and reduce memory usage.

  • Optimize the model to reduce the model's memory usage.