3. Inference Explained¶
This section explains more details about running and debugging TVM inference. An introduction to this topic is provided in Running Inference.
There are two inference scenarios:
Running TIDL unsupported layers on Arm
Running TIDL unsupported layers on C7x
These
correspond to compiling the model with c7x_codegen=0
or c7x_codegen=1
. The same
inference script or application can be used for both scenarios.
3.1. Running Unsupported Layers on Arm¶
When running unsupported layers on Arm, TI TVM graph runtime on Arm looks at each node in the graph (.json).
If it is a TIDL subgraph, it is dispatched to C7x for execution (via OpenVX).
If it is a non-TIDL node, it is executed with TVM-generated Arm code.
3.2. Running Unsupported Layers on C7x¶
When running unsupported layers on C7x, TI TVM graph runtime on Arm looks at the single node in the wrapper graph (.json), ‘tidl_tvm_0’, and dispatches the whole graph to C7x (via OpenVX).
TI TVM graph runtime on C7x looks at each node in the graph.
If it is a TIDL subgraph, it is executed with the TIDL library.
If it is a non-TIDL node, it is executed with TVM-generated C7x code.
3.3. Debugging Inference¶
TIDL uses the TIDL_RT_DEBUG environment variable to help debug the TVM+TIDL inference flow.
3.3.1. Vision apps “printf” terminal¶
First, you need to open a “printf” terminal, where debug “printf” output from C7x core is shown.
Open a terminal on your EVM, and run the following commands.
# if using EdgeAI Start Kit SDK
root@tda4vm-sk:~# /opt/vx_app_arm_remote_log.out
# if using J721E SDK
root@j7-evm:~# cd /opt/vision_apps/
root@j7-evm:/opt/vision_apps# source ./vision_apps_init.sh
Then, run the inference in a different terminal.
3.3.2. Debugging TIDL subgraphs¶
The available settings for the TIDL_RT_DEBUG environment variable are as follows.
3.3.2.1. TIDL_RT_DEBUG=1¶
When set to 1, TIDL subgraph performance information is printed out during inference, either on the “printf” terminal or on the terminal where inference is running. For example:
[C7x_1 ] 1851913.287814 s: Layer, Layer Cycles,kernelOnlyCycles, coreLoopCycles,LayerSetupCycles,dmaPipeupCycles, dmaPipeDownCycles, PrefetchCycles,copyKerCoeffCycles,LayerDeinitCycles,LastBlockCycles, paddingTrigger, paddingWait,LayerWithoutPad,LayerHandleCopy, BackupCycles, RestoreCycles,
[C7x_1 ] 1851913.287889 s: 0, 201247, 171496, 173177, 1021, 9371, 20, 0, 0, 375, 41487, 6392, 51, 191300, 0, 0, 0,
[C7x_1 ] 1851913.287956 s: 1, 44208, 17603, 18221, 4170, 2679, 18, 0, 0, 662, 17603, 7720, 454, 33530, 2096, 0, 0,
... ... ...
3.3.2.2. TIDL_RT_DEBUG=2, 3¶
When set to 2 or 3, more verbose information about TIDL subgraphs is provided in addition to the information provided when the setting is 1. For example:
[C7x_1 ] 1852081.872134 s: Alg Alloc for Layer # - 0
[C7x_1 ] 1852081.872160 s: Alg Alloc for Layer # - 1
... ... ...
[C7x_1 ] 1852081.873059 s: TIDL Memory requirement
[C7x_1 ] 1852081.873087 s: MemRecNum , Space , Attribute , SizeinBytes
[C7x_1 ] 1852081.873117 s: 0 , DDR , Persistent, 15208
[C7x_1 ] 1852081.873145 s: 1 , DDR , Persistent, 136
... ... ...
[C7x_1 ] 1852081.874400 s: Alg Init for Layer # - 2 out of 32
[C7x_1 ] 1852081.874470 s: Alg Init for Layer # - 3 out of 32
... ... ...
[C7x_1 ] 1852081.911120 s: Starting Layer # - 1
[C7x_1 ] 1852081.911145 s: Processing Layer # - 1
[C7x_1 ] 1852081.911375 s: End of Layer # - 1 with outPtrs[0] = 7002001e
[C7x_1 ] 1852081.911400 s: Starting Layer # - 2
[C7x_1 ] 1852081.911422 s: Processing Layer # - 2
[C7x_1 ] 1852081.911493 s: End of Layer # - 2 with outPtrs[0] = 7004550e
... ...
3.3.2.3. TIDL_RT_DEBUG=4, 5¶
Setting this environment variable to 4 or 5 is supported only when running TIDL unsupported layers on Arm. When set to 4 or 5, tensor output
from each layer in the TIDL subgraph is dumped into the Arm Linux file system, with filenames
tidl_trace_subgraph_<subgraph_id>_<layer_id>_<tensor_shape>.y
for raw data and
tidl_trace_subgraph_<subgraph_id>_<layer_id>_<tensor_shape>_float.bin
for converted float data.
3.3.3. Debugging TVM nodes¶
The TVM_RT_DEBUG, TVM_RT_TRACE_NODE, TVM_RT_TRACE_SIZE, and TVM_TRACE_NODE environment variables can be used as follows to aid in debugging TVM nodes.
3.3.3.1. TVM_RT_DEBUG=1¶
When set to 1, TVM runtime on Arm collects performance statistics for each node
in the graph and saves into the Arm Linux file system, with the filename tvm_arm.trace
. You can use the
python/tvm/contrib/tidl/dump_tvm_trace.py
script to dump the details.
When TIDL unsupported layers are run on Arm, the tvm_arm.trace
file includes the execution time for
TIDL nodes and layers running on Arm. For example:
# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_arm.trace
Trace size: 633 version: 0x20220728 device: J7 core: Arm
node 1: tidl_8 1520.9 microseconds
node 2: tvmgen_default_fused_multiply 436.81 microseconds
node 3: tidl_7 993.62 microseconds
node 4: tvmgen_default_fused_multiply_1 149.11 microseconds
node 5: tidl_6 1162.075 microseconds
node 6: tvmgen_default_fused_multiply_11 176.68 microseconds
node 7: tidl_5 2233.29 microseconds
node 8: tvmgen_default_fused_multiply_2 120.095 microseconds
node 9: tidl_4 1503.91 microseconds
node 10: tvmgen_default_fused_multiply_3 163.83 microseconds
node 11: tidl_3 1381.615 microseconds
node 12: tvmgen_default_fused_multiply_4 61.94 microseconds
node 13: tidl_2 1195.17 microseconds
node 14: tvmgen_default_fused_multiply_5 70.53 microseconds
node 15: tidl_1 1311.9 microseconds
node 16: tvmgen_default_fused_multiply_51 71.32 microseconds
node 17: tidl_0 1279.285 microseconds
node 4294967295: Graph 13856.93 microseconds
When TIDL unsupported layers are running on C7x, tvm_arm.trace
includes only the execution
time for a single node representing the whole graph. For example:
# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_arm.trace
Trace size: 69 version: 0x20220728 device: J7 core: Arm
node 1: tidl_tvm_0 9126.275 microseconds
node 4294967295: Graph 9129.92 microseconds
3.3.3.2. TVM_RT_DEBUG=2¶
When set to 2, in addition to the information provided when the TVM_RT_DEBUG setting is 1, TVM runtime on C7x also collects performance
statistics for each node in the graph on the C7x and saves them to the Arm Linux file system,
with the filename tvm_c7x.trace
. You can use the python/tvm/contrib/tidl/dump_tvm_trace.py
script
to dump the details. For example:
# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_c7x.trace
Trace size: 631 version: 0x20220728 device: J7 core: C7x
node 1: tidl_8 909.002 microseconds
node 2: tvmgen_default_fused_multiply 62.201 microseconds
node 3: tidl_7 378.24 microseconds
node 4: tvmgen_default_fused_multiply_1 83.055 microseconds
node 5: tidl_6 445.359 microseconds
node 6: tvmgen_default_fused_multiply_1 83.433 microseconds
node 7: tidl_5 1590.525 microseconds
node 8: tvmgen_default_fused_multiply_2 82.979 microseconds
node 9: tidl_4 809.138 microseconds
node 10: tvmgen_default_fused_multiply_3 110.202 microseconds
node 11: tidl_3 802.037 microseconds
node 12: tvmgen_default_fused_multiply_4 46.95 microseconds
node 13: tidl_2 848.143 microseconds
node 14: tvmgen_default_fused_multiply_5 55.662 microseconds
node 15: tidl_1 908.129 microseconds
node 16: tvmgen_default_fused_multiply_5 54.628 microseconds
node 17: tidl_0 990.953 microseconds
node 4294967295: Graph 8283.967 microseconds
The tvm_c7x.trace
output is available only if the model was compiled with c7x_codegen=1
.
3.3.3.3. TVM_RT_DEBUG=3¶
When set to 3, in addition to the information provided when the setting is 1 or 2, TVM runtime on Arm and C7x collects output tensor statistics for each layer and saves it to the trace files. Collected statistics include minimum, maximum, sum, and the sum of the first half of the tensor.
Please ignore performance numbers obtained in this mode; there is overhead when collecting tensor statistics.
# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_arm.trace
Trace size: 1833 version: 0x20220728 device: J7 core: Arm
node 1: tidl_8 1527.845 microseconds
output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
min=0.0 max=21.715360641479492 sum=234286.71875 fh_sum=110520.9296875
output 1: ndim=4 type_code=2 elem_bytes=4 num_elements=72
min=0.73150634765625 max=0.999969482421875 sum=68.35482788085938 fh_sum=34.87158203125
node 2: tvmgen_default_fused_multiply 416.19 microseconds
output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
min=0.0 max=21.439014434814453 sum=226814.515625 fh_sum=106811.2890625
... ... ...
# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_c7x.trace
Trace size: 1831 version: 0x20220728 device: J7 core: C7x
node 1: tidl_8 934.738 microseconds
output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
min=0.0 max=21.715360641479492 sum=234286.71875 fh_sum=110520.9296875
output 1: ndim=4 type_code=2 elem_bytes=4 num_elements=72
min=0.73150634765625 max=0.999969482421875 sum=68.35482788085938 fh_sum=34.87158203125
node 2: tvmgen_default_fused_multiply 64.704 microseconds
output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
min=0.0 max=21.439014434814453 sum=226814.515625 fh_sum=106811.2890625
... ... ...
3.3.3.4. TVM_RT_DEBUG=4¶
When set to 4, TVM runtime on Arm and C7x prints out information about each node when each node is being executed, either on the terminal where inference is run or the “printf” terminal. This mode can be helpful for debugging the model’s execution.
3.3.3.5. TVM_RT_TRACE_NODE=<node_id> TVM_RT_DEBUG=3, 4¶
When these settings are used, TVM runtime on Arm and C7x also saves the tensor outputs of the specified node into
the trace file. The dump_trace.py
script saves the tensor outputs into the Arm Linux file system
as numpy files, with the filename n<node_id>_o<output_id>.npy
. For example, see the tensor values saved in
messages in the following example.
Tensor outputs are saved as float values in the trace.
# TVM_RT_TRACE_NODE=1 TVM_RT_DEBUG=4 python3 ./infer_model.py ... ...
... ...
# python3 $TVM_HOME/python/tvm/contrib/tidl/dump_tvm_trace.py tvm_c7x.trace
Trace size: 227911 version: 0x20220728 device: J7 core: C7x
node 1: tidl_8 912.196 microseconds
output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
min=0.0 max=21.715360641479492 sum=234286.71875 fh_sum=110520.9296875
tensor values saved in n1_o0.npy
output 1: ndim=4 type_code=2 elem_bytes=4 num_elements=72
min=0.73150634765625 max=0.999969482421875 sum=68.35482788085938 fh_sum=34.87158203125
tensor values saved in n1_o1.npy
node 2: tvmgen_default_fused_multiply 65.955 microseconds
output 0: ndim=4 type_code=2 elem_bytes=4 num_elements=56448
min=0.0 max=21.439014434814453 sum=226814.515625 fh_sum=106811.2890625
node 3: tidl_7 388.9 microseconds
3.3.3.6. TVM_RT_TRACE_SIZE=<new_size> TVM_TRACE_NODE=<node_id> TVM_RT_DEBUG=3, 4¶
If you see a Not enough trace memory for dumping output <node_id>
message in the verbose
output for a node, you can use TVM_RT_TRACE_SIZE
to increase the size of the trace buffer that stores
the trace. The default is 2*1024*1024 (2MB) bytes.