3. Running Inference

TVM inference requires the three artifact files produced by compilation: deploy_lib.so, deploy_graph.json, and deploy_param.params. The runtime behaviour is determined at compile time: with c7x_codegen=1 the Arm core dispatches the entire graph to C7™ NPU via OpenVX; with c7x_codegen=0 the Arm core executes non-TIDL layers directly alongside TIDL subgraph dispatch.

The same inference code works for both cases.

3.1. Running Inference with Python

Load the compiled module and run inference using TVM’s graph_executor:

import tvm
from tvm.contrib import graph_executor
import numpy as np

artifacts_folder = "./artifacts"

# Load compiled artifacts
graph  = open(f"{artifacts_folder}/deploy_graph.json").read()
lib    = tvm.runtime.load_module(f"{artifacts_folder}/deploy_lib.so")
params = bytearray(open(f"{artifacts_folder}/deploy_param.params", "rb").read())

# Create runtime session
sess = graph_executor.create(graph, lib, tvm.cpu())
sess.load_params(params)

# Run inference
sess.set_input("input", input_data)   # replace "input" with your input name
sess.run()

output = sess.get_output(0).asnumpy()

Note

The first inference call includes TIDL initialisation overhead. Run a warm-up inference before measuring performance.

The tvmrt_wrapper.py in edgeai-tidl-tools wraps this pattern and automatically selects the correct artifact suffix (.pc or .evm) based on the host architecture. See tvmrt_wrapper.py for a complete reference implementation.

3.2. Performance Profiling

After running inference, retrieve performance metrics via get_TI_benchmark_data():

benchmark = sess.get_TI_benchmark_data()

The tvmrt_wrapper.py get_performance() method processes this data into the following metrics:

Metric

Unit

Description

total_time

ms

Total inference time from run start to run end

core_time

ms

Total time excluding I/O copy overhead

subgraph_time

ms

Time spent executing TIDL subgraphs on C7™ NPU

read_total

bytes

DDR read bytes during inference (not available on x86)

write_total

bytes

DDR write bytes during inference (not available on x86)

num_subgraphs

Number of TIDL subgraphs detected in the compiled model

For trace-based performance profiling using TVM_RT_DEBUG, see Debugging Inference.

3.3. Running edgeai-tidl-tools Examples

The edgeai-tidl-tools repository provides ready-to-run examples for both Python and C++.

3.3.1. On the host (x86)

Ensure environment variables are set by sourcing setup_env.sh before running any examples.

cd ./runtimes/examples/python/basic_example/

# Compile and run inference with TVM runtime
python3 basic_example.py --config ./config.yaml -r tvmrt --compile
python3 basic_example.py --config ./config.yaml -r tvmrt --infer

cd -

Model artifacts are saved to ./runtimes/examples/model-artifacts/. Inference outputs are saved to ./runtimes/examples/python/basic_example/outputs/{model_name}/offload/frame_{frame_num}/.

C++ inference examples are also provided (compilation not supported from C++):

cd ./runtimes/examples/cpp/basic_example
../bin/Release/basic_example --config ./config.yaml
cd -

3.3.2. On the EVM

Model compilation must be performed on the host first. Transfer artifacts to the EVM (or use NFS):

scp -r ./runtimes/examples/model-artifacts root@<evm-ip>:~/

Then run inference on the EVM:

export SOC=<SOC>
cd ./runtimes/examples/python/basic_example/
python3 basic_example.py --config ./config.yaml -r tvmrt --infer
cd -

Note

Processor SDK 11.2 is missing two TVM runtime dependencies. Install them before running TVM examples on the EVM:

pip3 install psutil typing_extensions

See Installation Instructions for details.

3.4. Debugging Inference

TIDL uses the TIDL_RT_DEBUG environment variable to control debug output during inference.

3.4.1. Vision apps “printf” terminal

Open a terminal on the EVM and run the following command to see debug output from the C7™ NPU core:

# if using EdgeAI Start Kit SDK
root@tda4vm-sk:~# /opt/vx_app_arm_remote_log.out

# if using J721E SDK
root@j7-evm:~# cd /opt/vision_apps/
root@j7-evm:/opt/vision_apps# source ./vision_apps_init.sh

Then run inference in a separate terminal.

3.4.2. Debugging TIDL subgraphs

3.4.2.1. TIDL_RT_DEBUG=1

When set to 1, TIDL subgraph performance information is printed during inference, either on the “printf” terminal or on the terminal where inference is running. For example:

[C7x_1 ] 1851913.287814 s:  Layer,   Layer Cycles,kernelOnlyCycles, coreLoopCycles,LayerSetupCycles,dmaPipeupCycles, dmaPipeDownCycles, PrefetchCycles,copyKerCoeffCycles,LayerDeinitCycles,LastBlockCycles, paddingTrigger,    paddingWait,LayerWithoutPad,LayerHandleCopy,   BackupCycles,  RestoreCycles,
[C7x_1 ] 1851913.287889 s:      0,         201247,         171496,         173177,           1021,           9371,                20,              0,                 0,            375,          41487,           6392,             51,         191300,              0,              0,              0,
[C7x_1 ] 1851913.287956 s:      1,          44208,          17603,          18221,           4170,           2679,                18,              0,                 0,            662,          17603,           7720,            454,          33530,           2096,              0,              0,
... ... ...

3.4.2.2. TIDL_RT_DEBUG=2, 3

When set to 2 or 3, detailed TIDL subgraph import and layer execution information is provided in addition to the level 1 output. For example:

[C7x_1 ] 1852081.872134 s: Alg Alloc for Layer # -    0
[C7x_1 ] 1852081.872160 s: Alg Alloc for Layer # -    1
... ... ...
[C7x_1 ] 1852081.873059 s: TIDL Memory requirement
[C7x_1 ] 1852081.873087 s: MemRecNum , Space     , Attribute ,    SizeinBytes
[C7x_1 ] 1852081.873117 s:  0         , DDR       , Persistent,    15208
[C7x_1 ] 1852081.873145 s:  1         , DDR       , Persistent,    136
... ... ...
[C7x_1 ] 1852081.874400 s: Alg Init for Layer # -    2 out of   32
[C7x_1 ] 1852081.874470 s: Alg Init for Layer # -    3 out of   32
... ... ...
[C7x_1 ] 1852081.911120 s: Starting Layer # -    1
[C7x_1 ] 1852081.911145 s: Processing Layer # -    1
[C7x_1 ] 1852081.911375 s: End of Layer # -    1 with outPtrs[0] = 7002001e
[C7x_1 ] 1852081.911400 s: Starting Layer # -    2
[C7x_1 ] 1852081.911422 s: Processing Layer # -    2
[C7x_1 ] 1852081.911493 s: End of Layer # -    2 with outPtrs[0] = 7004550e
... ...

3.4.2.3. TIDL_RT_DEBUG=4, 5

Supported only when running TIDL unsupported layers on Arm. When set to 4 or 5, tensor output from each layer in the TIDL subgraph is dumped to the Arm Linux filesystem with filenames tidl_trace_subgraph_<subgraph_id>_<layer_id>_<tensor_shape>.y for raw data and tidl_trace_subgraph_<subgraph_id>_<layer_id>_<tensor_shape>_float.bin for converted float data.

3.4.3. Debugging TVM nodes

The TVM_RT_DEBUG, TVM_RT_TRACE_NODE, TVM_RT_TRACE_SIZE, and TVM_TRACE_NODE environment variables control TVM node debugging.

3.4.3.1. TVM_RT_DEBUG=1

When set to 1, TVM runtime on Arm collects performance statistics for each node in the graph and saves them to the Arm Linux filesystem as tvm_arm.trace. Use the dump_tvm_trace.py script to read the file:

# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_arm.trace
Trace size: 633 version: 0x20220728 device: J7 core: Arm
node 1: tidl_8  1520.9 microseconds
node 2: tvmgen_default_fused_multiply  436.81 microseconds
node 3: tidl_7  993.62 microseconds
node 4: tvmgen_default_fused_multiply_1  149.11 microseconds
node 5: tidl_6  1162.075 microseconds
node 6: tvmgen_default_fused_multiply_11  176.68 microseconds
node 7: tidl_5  2233.29 microseconds
node 8: tvmgen_default_fused_multiply_2  120.095 microseconds
node 9: tidl_4  1503.91 microseconds
node 10: tvmgen_default_fused_multiply_3  163.83 microseconds
node 11: tidl_3  1381.615 microseconds
node 12: tvmgen_default_fused_multiply_4  61.94 microseconds
node 13: tidl_2  1195.17 microseconds
node 14: tvmgen_default_fused_multiply_5  70.53 microseconds
node 15: tidl_1  1311.9 microseconds
node 16: tvmgen_default_fused_multiply_51  71.32 microseconds
node 17: tidl_0  1279.285 microseconds
node 4294967295: Graph  13856.93 microseconds

When TIDL unsupported layers are running on C7™ NPU, tvm_arm.trace shows only a single node representing the whole graph:

# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_arm.trace
Trace size: 69 version: 0x20220728 device: J7 core: Arm
node 1: tidl_tvm_0  9126.275 microseconds
node 4294967295: Graph  9129.92 microseconds

3.4.3.2. TVM_RT_DEBUG=2

When set to 2, TVM runtime on C7™ NPU also collects per-node performance statistics and saves them to tvm_c7x.trace in addition to the level 1 output. For example:

# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_c7x.trace
Trace size: 631 version: 0x20220728 device: J7 core: C7x
node 1: tidl_8  909.002 microseconds
node 2: tvmgen_default_fused_multiply  62.201 microseconds
node 3: tidl_7  378.24 microseconds
node 4: tvmgen_default_fused_multiply_1  83.055 microseconds
node 5: tidl_6  445.359 microseconds
node 6: tvmgen_default_fused_multiply_1  83.433 microseconds
node 7: tidl_5  1590.525 microseconds
node 8: tvmgen_default_fused_multiply_2  82.979 microseconds
node 9: tidl_4  809.138 microseconds
node 10: tvmgen_default_fused_multiply_3  110.202 microseconds
node 11: tidl_3  802.037 microseconds
node 12: tvmgen_default_fused_multiply_4  46.95 microseconds
node 13: tidl_2  848.143 microseconds
node 14: tvmgen_default_fused_multiply_5  55.662 microseconds
node 15: tidl_1  908.129 microseconds
node 16: tvmgen_default_fused_multiply_5  54.628 microseconds
node 17: tidl_0  990.953 microseconds
node 4294967295: Graph  8283.967 microseconds

tvm_c7x.trace is only available when the model was compiled with c7x_codegen=1.

3.4.3.3. TVM_RT_DEBUG=3

When set to 3, TVM runtime on Arm and C7™ NPU also collects output tensor statistics (min, max, sum, and sum of the first half) for each layer and saves them to the trace files.

Note

Performance numbers collected at level 3 include instrumentation overhead and should not be used for benchmarking.

# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_arm.trace
Trace size: 1833 version: 0x20220728 device: J7 core: Arm
node 1: tidl_8  1527.845 microseconds
  output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
           min=0.0 max=21.715360641479492 sum=234286.71875 fh_sum=110520.9296875
  output 1: ndim=4 type_code=2 elem_bytes=4 num_elements=72
           min=0.73150634765625 max=0.999969482421875 sum=68.35482788085938 fh_sum=34.87158203125
node 2: tvmgen_default_fused_multiply  416.19 microseconds
  output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
           min=0.0 max=21.439014434814453 sum=226814.515625 fh_sum=106811.2890625
... ... ...
# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_c7x.trace
Trace size: 1831 version: 0x20220728 device: J7 core: C7x
node 1: tidl_8  934.738 microseconds
  output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
           min=0.0 max=21.715360641479492 sum=234286.71875 fh_sum=110520.9296875
  output 1: ndim=4 type_code=2 elem_bytes=4 num_elements=72
           min=0.73150634765625 max=0.999969482421875 sum=68.35482788085938 fh_sum=34.87158203125
node 2: tvmgen_default_fused_multiply  64.704 microseconds
  output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
           min=0.0 max=21.439014434814453 sum=226814.515625 fh_sum=106811.2890625
... ... ...

3.4.3.4. TVM_RT_DEBUG=4

When set to 4, TVM runtime on Arm and C7™ NPU prints information about each node as it executes, either on the inference terminal or the “printf” terminal. Useful for tracing execution order and identifying hangs.

3.4.3.5. TVM_RT_TRACE_NODE=<node_id> TVM_RT_DEBUG=3, 4

When set, TVM runtime saves the tensor outputs of the specified node to the trace file. The dump_tvm_trace.py script then saves them as numpy files named n<node_id>_o<output_id>.npy. For example:

# TVM_RT_TRACE_NODE=1 TVM_RT_DEBUG=4 python3 ./infer_model.py ...
...
# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_c7x.trace
Trace size: 227911 version: 0x20220728 device: J7 core: C7x
node 1: tidl_8  912.196 microseconds
  output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
           min=0.0 max=21.715360641479492 sum=234286.71875 fh_sum=110520.9296875
           tensor values saved in n1_o0.npy
  output 1: ndim=4 type_code=2 elem_bytes=4 num_elements=72
           min=0.73150634765625 max=0.999969482421875 sum=68.35482788085938 fh_sum=34.87158203125
           tensor values saved in n1_o1.npy
node 2: tvmgen_default_fused_multiply  65.955 microseconds
  output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
           min=0.0 max=21.439014434814453 sum=226814.515625 fh_sum=106811.2890625
node 3: tidl_7  388.9 microseconds

3.4.3.6. TVM_RT_TRACE_SIZE=<new_size> TVM_TRACE_NODE=<node_id> TVM_RT_DEBUG=3, 4

If you see a Not enough trace memory for dumping output <node_id> message, use TVM_RT_TRACE_SIZE to increase the trace buffer size. The default is 2 MB (2*1024*1024 bytes).