3. Running Inference¶
TVM inference requires the three artifact files produced by compilation:
deploy_lib.so, deploy_graph.json, and deploy_param.params.
The runtime behaviour is determined at compile time: with c7x_codegen=1
the Arm core dispatches the entire graph to C7™ NPU via OpenVX; with
c7x_codegen=0 the Arm core executes non-TIDL layers directly alongside
TIDL subgraph dispatch.
The same inference code works for both cases.
3.1. Running Inference with Python¶
Load the compiled module and run inference using TVM’s graph_executor:
import tvm
from tvm.contrib import graph_executor
import numpy as np
artifacts_folder = "./artifacts"
# Load compiled artifacts
graph = open(f"{artifacts_folder}/deploy_graph.json").read()
lib = tvm.runtime.load_module(f"{artifacts_folder}/deploy_lib.so")
params = bytearray(open(f"{artifacts_folder}/deploy_param.params", "rb").read())
# Create runtime session
sess = graph_executor.create(graph, lib, tvm.cpu())
sess.load_params(params)
# Run inference
sess.set_input("input", input_data) # replace "input" with your input name
sess.run()
output = sess.get_output(0).asnumpy()
Note
The first inference call includes TIDL initialisation overhead. Run a warm-up inference before measuring performance.
The tvmrt_wrapper.py in edgeai-tidl-tools wraps this pattern and
automatically selects the correct artifact suffix (.pc or .evm)
based on the host architecture. See
tvmrt_wrapper.py
for a complete reference implementation.
3.2. Performance Profiling¶
After running inference, retrieve performance metrics via
get_TI_benchmark_data():
benchmark = sess.get_TI_benchmark_data()
The tvmrt_wrapper.py get_performance() method processes this data
into the following metrics:
Metric |
Unit |
Description |
|---|---|---|
|
ms |
Total inference time from run start to run end |
|
ms |
Total time excluding I/O copy overhead |
|
ms |
Time spent executing TIDL subgraphs on C7™ NPU |
|
bytes |
DDR read bytes during inference (not available on x86) |
|
bytes |
DDR write bytes during inference (not available on x86) |
|
— |
Number of TIDL subgraphs detected in the compiled model |
For trace-based performance profiling using TVM_RT_DEBUG, see
Debugging Inference.
3.3. Running edgeai-tidl-tools Examples¶
The edgeai-tidl-tools repository provides ready-to-run examples for both Python and C++.
3.3.1. On the host (x86)¶
Ensure environment variables are set by sourcing setup_env.sh before
running any examples.
cd ./runtimes/examples/python/basic_example/
# Compile and run inference with TVM runtime
python3 basic_example.py --config ./config.yaml -r tvmrt --compile
python3 basic_example.py --config ./config.yaml -r tvmrt --infer
cd -
Model artifacts are saved to ./runtimes/examples/model-artifacts/.
Inference outputs are saved to
./runtimes/examples/python/basic_example/outputs/{model_name}/offload/frame_{frame_num}/.
C++ inference examples are also provided (compilation not supported from C++):
cd ./runtimes/examples/cpp/basic_example
../bin/Release/basic_example --config ./config.yaml
cd -
3.3.2. On the EVM¶
Model compilation must be performed on the host first. Transfer artifacts to the EVM (or use NFS):
scp -r ./runtimes/examples/model-artifacts root@<evm-ip>:~/
Then run inference on the EVM:
export SOC=<SOC>
cd ./runtimes/examples/python/basic_example/
python3 basic_example.py --config ./config.yaml -r tvmrt --infer
cd -
Note
Processor SDK 11.2 is missing two TVM runtime dependencies. Install them before running TVM examples on the EVM:
pip3 install psutil typing_extensions
See Installation Instructions for details.
3.4. Debugging Inference¶
TIDL uses the TIDL_RT_DEBUG environment variable to control debug output
during inference.
3.4.1. Vision apps “printf” terminal¶
Open a terminal on the EVM and run the following command to see debug output from the C7™ NPU core:
# if using EdgeAI Start Kit SDK
root@tda4vm-sk:~# /opt/vx_app_arm_remote_log.out
# if using J721E SDK
root@j7-evm:~# cd /opt/vision_apps/
root@j7-evm:/opt/vision_apps# source ./vision_apps_init.sh
Then run inference in a separate terminal.
3.4.2. Debugging TIDL subgraphs¶
3.4.2.1. TIDL_RT_DEBUG=1¶
When set to 1, TIDL subgraph performance information is printed during inference, either on the “printf” terminal or on the terminal where inference is running. For example:
[C7x_1 ] 1851913.287814 s: Layer, Layer Cycles,kernelOnlyCycles, coreLoopCycles,LayerSetupCycles,dmaPipeupCycles, dmaPipeDownCycles, PrefetchCycles,copyKerCoeffCycles,LayerDeinitCycles,LastBlockCycles, paddingTrigger, paddingWait,LayerWithoutPad,LayerHandleCopy, BackupCycles, RestoreCycles,
[C7x_1 ] 1851913.287889 s: 0, 201247, 171496, 173177, 1021, 9371, 20, 0, 0, 375, 41487, 6392, 51, 191300, 0, 0, 0,
[C7x_1 ] 1851913.287956 s: 1, 44208, 17603, 18221, 4170, 2679, 18, 0, 0, 662, 17603, 7720, 454, 33530, 2096, 0, 0,
... ... ...
3.4.2.2. TIDL_RT_DEBUG=2, 3¶
When set to 2 or 3, detailed TIDL subgraph import and layer execution information is provided in addition to the level 1 output. For example:
[C7x_1 ] 1852081.872134 s: Alg Alloc for Layer # - 0
[C7x_1 ] 1852081.872160 s: Alg Alloc for Layer # - 1
... ... ...
[C7x_1 ] 1852081.873059 s: TIDL Memory requirement
[C7x_1 ] 1852081.873087 s: MemRecNum , Space , Attribute , SizeinBytes
[C7x_1 ] 1852081.873117 s: 0 , DDR , Persistent, 15208
[C7x_1 ] 1852081.873145 s: 1 , DDR , Persistent, 136
... ... ...
[C7x_1 ] 1852081.874400 s: Alg Init for Layer # - 2 out of 32
[C7x_1 ] 1852081.874470 s: Alg Init for Layer # - 3 out of 32
... ... ...
[C7x_1 ] 1852081.911120 s: Starting Layer # - 1
[C7x_1 ] 1852081.911145 s: Processing Layer # - 1
[C7x_1 ] 1852081.911375 s: End of Layer # - 1 with outPtrs[0] = 7002001e
[C7x_1 ] 1852081.911400 s: Starting Layer # - 2
[C7x_1 ] 1852081.911422 s: Processing Layer # - 2
[C7x_1 ] 1852081.911493 s: End of Layer # - 2 with outPtrs[0] = 7004550e
... ...
3.4.2.3. TIDL_RT_DEBUG=4, 5¶
Supported only when running TIDL unsupported layers on Arm. When set to 4
or 5, tensor output from each layer in the TIDL subgraph is dumped to the
Arm Linux filesystem with filenames
tidl_trace_subgraph_<subgraph_id>_<layer_id>_<tensor_shape>.y for raw
data and
tidl_trace_subgraph_<subgraph_id>_<layer_id>_<tensor_shape>_float.bin
for converted float data.
3.4.3. Debugging TVM nodes¶
The TVM_RT_DEBUG, TVM_RT_TRACE_NODE, TVM_RT_TRACE_SIZE, and
TVM_TRACE_NODE environment variables control TVM node debugging.
3.4.3.1. TVM_RT_DEBUG=1¶
When set to 1, TVM runtime on Arm collects performance statistics for each
node in the graph and saves them to the Arm Linux filesystem as
tvm_arm.trace. Use the dump_tvm_trace.py script to read the file:
# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_arm.trace
Trace size: 633 version: 0x20220728 device: J7 core: Arm
node 1: tidl_8 1520.9 microseconds
node 2: tvmgen_default_fused_multiply 436.81 microseconds
node 3: tidl_7 993.62 microseconds
node 4: tvmgen_default_fused_multiply_1 149.11 microseconds
node 5: tidl_6 1162.075 microseconds
node 6: tvmgen_default_fused_multiply_11 176.68 microseconds
node 7: tidl_5 2233.29 microseconds
node 8: tvmgen_default_fused_multiply_2 120.095 microseconds
node 9: tidl_4 1503.91 microseconds
node 10: tvmgen_default_fused_multiply_3 163.83 microseconds
node 11: tidl_3 1381.615 microseconds
node 12: tvmgen_default_fused_multiply_4 61.94 microseconds
node 13: tidl_2 1195.17 microseconds
node 14: tvmgen_default_fused_multiply_5 70.53 microseconds
node 15: tidl_1 1311.9 microseconds
node 16: tvmgen_default_fused_multiply_51 71.32 microseconds
node 17: tidl_0 1279.285 microseconds
node 4294967295: Graph 13856.93 microseconds
When TIDL unsupported layers are running on C7™ NPU, tvm_arm.trace
shows only a single node representing the whole graph:
# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_arm.trace
Trace size: 69 version: 0x20220728 device: J7 core: Arm
node 1: tidl_tvm_0 9126.275 microseconds
node 4294967295: Graph 9129.92 microseconds
3.4.3.2. TVM_RT_DEBUG=2¶
When set to 2, TVM runtime on C7™ NPU also collects per-node performance
statistics and saves them to tvm_c7x.trace in addition to the level 1
output. For example:
# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_c7x.trace
Trace size: 631 version: 0x20220728 device: J7 core: C7x
node 1: tidl_8 909.002 microseconds
node 2: tvmgen_default_fused_multiply 62.201 microseconds
node 3: tidl_7 378.24 microseconds
node 4: tvmgen_default_fused_multiply_1 83.055 microseconds
node 5: tidl_6 445.359 microseconds
node 6: tvmgen_default_fused_multiply_1 83.433 microseconds
node 7: tidl_5 1590.525 microseconds
node 8: tvmgen_default_fused_multiply_2 82.979 microseconds
node 9: tidl_4 809.138 microseconds
node 10: tvmgen_default_fused_multiply_3 110.202 microseconds
node 11: tidl_3 802.037 microseconds
node 12: tvmgen_default_fused_multiply_4 46.95 microseconds
node 13: tidl_2 848.143 microseconds
node 14: tvmgen_default_fused_multiply_5 55.662 microseconds
node 15: tidl_1 908.129 microseconds
node 16: tvmgen_default_fused_multiply_5 54.628 microseconds
node 17: tidl_0 990.953 microseconds
node 4294967295: Graph 8283.967 microseconds
tvm_c7x.trace is only available when the model was compiled with
c7x_codegen=1.
3.4.3.3. TVM_RT_DEBUG=3¶
When set to 3, TVM runtime on Arm and C7™ NPU also collects output tensor statistics (min, max, sum, and sum of the first half) for each layer and saves them to the trace files.
Note
Performance numbers collected at level 3 include instrumentation overhead and should not be used for benchmarking.
# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_arm.trace
Trace size: 1833 version: 0x20220728 device: J7 core: Arm
node 1: tidl_8 1527.845 microseconds
output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
min=0.0 max=21.715360641479492 sum=234286.71875 fh_sum=110520.9296875
output 1: ndim=4 type_code=2 elem_bytes=4 num_elements=72
min=0.73150634765625 max=0.999969482421875 sum=68.35482788085938 fh_sum=34.87158203125
node 2: tvmgen_default_fused_multiply 416.19 microseconds
output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
min=0.0 max=21.439014434814453 sum=226814.515625 fh_sum=106811.2890625
... ... ...
# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_c7x.trace
Trace size: 1831 version: 0x20220728 device: J7 core: C7x
node 1: tidl_8 934.738 microseconds
output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
min=0.0 max=21.715360641479492 sum=234286.71875 fh_sum=110520.9296875
output 1: ndim=4 type_code=2 elem_bytes=4 num_elements=72
min=0.73150634765625 max=0.999969482421875 sum=68.35482788085938 fh_sum=34.87158203125
node 2: tvmgen_default_fused_multiply 64.704 microseconds
output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
min=0.0 max=21.439014434814453 sum=226814.515625 fh_sum=106811.2890625
... ... ...
3.4.3.4. TVM_RT_DEBUG=4¶
When set to 4, TVM runtime on Arm and C7™ NPU prints information about each node as it executes, either on the inference terminal or the “printf” terminal. Useful for tracing execution order and identifying hangs.
3.4.3.5. TVM_RT_TRACE_NODE=<node_id> TVM_RT_DEBUG=3, 4¶
When set, TVM runtime saves the tensor outputs of the specified node to the
trace file. The dump_tvm_trace.py script then saves them as numpy files
named n<node_id>_o<output_id>.npy. For example:
# TVM_RT_TRACE_NODE=1 TVM_RT_DEBUG=4 python3 ./infer_model.py ...
...
# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_c7x.trace
Trace size: 227911 version: 0x20220728 device: J7 core: C7x
node 1: tidl_8 912.196 microseconds
output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
min=0.0 max=21.715360641479492 sum=234286.71875 fh_sum=110520.9296875
tensor values saved in n1_o0.npy
output 1: ndim=4 type_code=2 elem_bytes=4 num_elements=72
min=0.73150634765625 max=0.999969482421875 sum=68.35482788085938 fh_sum=34.87158203125
tensor values saved in n1_o1.npy
node 2: tvmgen_default_fused_multiply 65.955 microseconds
output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
min=0.0 max=21.439014434814453 sum=226814.515625 fh_sum=106811.2890625
node 3: tidl_7 388.9 microseconds
3.4.3.6. TVM_RT_TRACE_SIZE=<new_size> TVM_TRACE_NODE=<node_id> TVM_RT_DEBUG=3, 4¶
If you see a Not enough trace memory for dumping output <node_id> message,
use TVM_RT_TRACE_SIZE to increase the trace buffer size. The default is
2 MB (2*1024*1024 bytes).