TI Deep Learning Library User Guide
TVM/Neo-AI-DLR + TIDL Heterogeneous Execution

Introduction

While you can use TIDL and its API directly, the Processor SDK also implements TIDL offload support using the TVM runtime and Neo-AI-DLR runtime. This heterogeneous execution enables:

  1. TVM/Neo-AI-DLR as the top level inference API for user applications
  2. Offloading subgraphs to C7x/MMA for accelerated execution with TIDL
  3. Generating code and running on the ARM core for layers that are not supported by TIDL

Neo-AI-DLR is an open source common runtime for machine learning models compiled by AWS SageMaker Neo, TVM, or Treelite. For the Processor SDK, we focus on models compiled by TVM. For these models, the Neo-AI-DLR runtime can be considered as a wrapper around the TVM runtime.

The following sections describe the details for compiling and deploying machine learning models for TVM/Neo-AI-DLR + TIDL heterogeneous execution.

Building TVM with TIDL offload support

The Processor SDK does not package TVM by default. You will need to build TVM on an x86-64 Linux machine running Ubuntu 18.04. We assume that the TIDL package is already built or available from the Processor SDK installation.

# Starting point: A x86_64 Linux environment running Ubuntu 18.04

# Install pre-requisites
cd ${HOME}
sudo apt install cmake python3-pip libtinfo-dev libtest libxml2-dev graphviz
sudo apt install lib32ncurses5 lib32z1
pip3 install matplotlib decorator pytest antlr4-python3-runtime typed_ast
pip3 install onnx gluoncv mxnet tflite torch torchvision
pip3 uninstall tensorflow
pip3 install tensorflow==1.14

# Set TIDL_PATH
export TIDL_PATH=/path/to/your/TIDL/package

# Set ARM_GCC_PATH/ARM64_GCC_PATH to your installation
# Download 64-bit gcc-arm from: https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-a/downloads/9-2-2019-12
wget https://developer.arm.com/-/media/Files/downloads/gnu-a/9.2-2019.12/binrel/gcc-arm-9.2-2019.12-x86_64-aarch64-none-linux-gnu.tar.xz
tar xf gcc-arm-9.2-2019.12-x86_64-aarch64-none-linux-gnu.tar.xz
export ARM_GCC_PATH=${HOME}/gcc-arm-9.2-2019.12-x86_64-aarch64-none-linux-gnu/bin
export ARM64_GCC_PATH=${ARM_GCC_PATH}

# Download llvm 10.0
wget https://github.com/llvm/llvm-project/releases/download/llvmorg-10.0.0/clang+llvm-10.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz
tar xf clang+llvm-10.0.0-x86_64-linux-gnu-ubuntu-18.04.tar.xz

# Make a j7 directory
mkdir -p ${HOME}/tvm-j7

# Clone TVM and build with TIDL backends
cd ${HOME}/tvm-j7
git clone --single-branch -b tidl-j7 https://github.com/TexasInstruments/tvm.git
cd tvm
# Check out tag in the format of REL.TIDL.J7.XX.YY.ZZ.WW, please use
# the corresponding TIDL package version in Processor SDK for XX.YY.ZZ.WW
# e.g. git checkout REL.TIDL.J7.01.03.00.11 for Processor SDK 7.1
git checkout REL.TIDL.J7.<TIDL_package_version>
git submodule init
git submodule update --init --recursive
mkdir build; cd build
cmake -DUSE_SORT=ON -DUSE_LLVM=${HOME}/clang+llvm-10.0.0-x86_64-linux-gnu-ubuntu-18.04/bin/llvm-config -DUSE_TIDL=ON -DUSE_TIDL_RT_PATH=${TIDL_PATH}/ti_dl/rt ..
make -j12
export TVM_HOME=${HOME}/tvm-j7/tvm
export PYTHONPATH=${TVM_HOME}/python:${TVM_HOME}/topi/python:${TVM_HOME}/nnvm/python:${PYTHONPATH}

Compilation with TVM compiler

We provide an example compilation test script called "test_tidl_j7.py" to illustrate TVM compilation with TIDL offload.

Running TVM+TIDL compilation tests

# Set TIDL tools directory, copy or link files from your TIDL package
cd ${HOME}/tvm-j7
mkdir -p tidl_tools; cd tidl_tools
ln -s $TIDL_PATH/ti_dl/test/testvecs/config/import/device_config.cfg .
ln -s $TIDL_PATH/ti_dl/test/PC_dsp_test_dl_algo.out .
ln -s $TIDL_PATH/ti_dl/utils/perfsim/ti_cnnperfsim.out .
ln -s $TIDL_PATH/ti_dl/utils/tidlModelImport/out/tidl_model_import_relay.so .
export TIDL_TOOLS_PATH=${HOME}/tvm-j7/tidl_tools

# Run compilation tests. The --target option specifies compilation for ARM, 
# for running inference on the EVM. With --host, TVM compiles for X86,
# for running inference in host emulation mode.
# Run with "-h" option for help messages.  "--target" is the default.
cd ${HOME}/tvm-j7/tvm/tests/python/relay/ti_tests
python3 ./test_tidl_j7.py -h
python3 ./test_tidl_j7.py --target
python3 ./test_tidl_j7.py --host

TIDL specific lines in compilation script

There are only 4 lines that are specific to TIDL offload in "test_tidl_j7.py". The rest of the script is no different from a regular TVM compilation script without TIDL offload.

tidl_compiler = tidl.TIDLCompiler(tidl_platform, tidl_version,
                                  num_tidl_subgraphs=num_tidl_subgraphs,
                                  artifacts_folder=tidl_artifacts_folder,
                                  tidl_tools_path=get_tidl_tools_path(),
                                  tidl_tensor_bits=8,
                                  tidl_calibration_options={'iterations':10},
                                  tidl_denylist=args.denylist)

We first instantiate a TIDLCompiler object. The parameters are explained in the following table.

Name/Position Value
tidl_platform "J7"
tidl_version (7,1)
num_tidl_subgraphs offload up to <num> tidl subgraphs
artifacts_folder where to store deployable module
tidl_tools_path set to environment variable TIDL_TOOLS_PATH
tidl_tensor_bits 8 or 16 for import TIDL tensor and weights
tidl_calibration_options optional, a dictionary to overwrite default calibration options
tidl_denylist optional, deny a TVM relay op for TIDL offloading

Advanced calibration can help improve 8-bit quantization. Please see TIDL Quantization for details. Default calibration options are specified in tvm source file, python/tvm/relay/backend/contrib/tidl.py. Please grep for "default_calib_options".

mod, status = tidl_compiler.enable(mod_orig, params, model_input_list)

In this step, the original machine learning model/network represented in TVM Relay IR, "mod_orig", goes through the following transformations:

  1. Allowlisting: each Relay Op is examined to see if it can be offloaded to TIDL
  2. Partitioning: Ops that TIDL supports are partitioned into TIDL subgraphs
  3. TIDL importing: TIDL subgraphs are imported from Relay IR into TIDL format
  4. TIDL postprocessing: sample inputs in "model_input_list" are used to calibrate quantization in the TIDL subgraphs.
with tidl.build_config(tidl_compiler=tidl_compiler):
    graph, lib, params = relay.build_module.build(mod, target=target, params=params)

In this step, TVM code generation takes place. Inside the TVM codegen, there is a TIDL codegen backend. "tidl.build_config" creates a context and tells the TIDL codegen backend where the artifacts from TIDL importing are. The backend then embeds the artifacts into the "lib".

tidl.remove_tidl_params(params)

This optional step removes the weights in TIDL subgraphs that have already been imported into the artifacts. Removing them results in a smaller deployable module.

Deployable module

The result of compilation is called a "deployable module". It consists of three files:

  1. deploy_graph.json: graph description of the compiled network for execution. In the graph, TIDL subgraphs are nodes with names "tidl_0", "tidl_1", etc.
  2. deploy_lib.so: executable code that runs the nodes in the graph. Code for offloading TIDL subgraphs and imported TIDL artifacts is also embedded in this file.
  3. deploy_param.params: constant weights/parameters for nodes in graph.

Taking the output of "test_tidl_j7.py" for TensorFlow MobilenetV1 for example, the deployable module for J7 target is located in "artifacts_MobileNetV1_target/". You can copy this deployable module to the target EVM for execution. Please see the "Inference" sections below for details.

artifacts_MobileNetV1_target
|-- deploy_graph.json
|-- deploy_lib.so
|-- deploy_param.params

Other compilation artifacts

All other compilation artifacts are stored in the "tempDir" directory under the specified "artifacts_folder". Interested users can look into this directory for TIDL importing details. This directory is for information only, and is not needed for inference/deployment.

One useful file is "relay.gv.svg". It gives a graphical view of the whole network and where the TIDL subgraphs are. You can view it using a browser or other viewer, for example:

firefox artifacts_MobileNetV1_target/tempDir/relay.gv.svg

Debugging

You can set the environment variable TIDL_RELAY_IMPORT_DEBUG to 0, 1, 2, 3, or 4 for detailed internal debug information and progress during TVM compilation. For example the compiler will dump the graph represented in TVM Relay IR, RelayIR to TIDL importing, etc.

Comparing TIDL per layer output with TVM per layer output

When TIDL_RELAY_IMPORT_DEBUG is set to 4, TIDL import will generate the output for each TIDL layer in the imported TIDL subgraph, using calibration inputs. The compilation will also generate corresponding output from running the original model in floating point mode, by compiling and running on the host using TVM. We name the tensors from TIDL quantized calibration execution "tidl_tensor"; we name the corresponding tensors from TVM floating point execution "tvm_tensor". A simple script, "compare_tensors.py", is provided to compare these two tensors.

TIDL_RELAY_IMPORT_DEBUG=4 python3 ./test_tidl_j7.py --target
# python3 ./compare_tensors.py <artifacts_folder> <subgraph_id> <layer_id>
python3 ./compare_tensors.py artifacts_MobileNetV1_target 0 1

Inference with Neo-AI-DLR runtime

The Neo-AI-DLR runtime is pre-built and packaged in target filesystem.

We provide an example inference script, "test_tidl_j7_deploy.py" to illustrate running a TVM-compiled deployable module with the Neo-AI-DLR runtime. Note that this script does not have anything TIDL specific: with or without TIDL offload, this script remains the same. Run the script with "-h" option for help messages. "--target --dlr --cv" are the default options.

# on target J7 EVM
cd /PATH/TO/tvm/tests/python/relay/ti_tests
python3 ./test_tidl_j7_deploy.py -h
python3 ./test_tidl_j7_deploy.py --target --dlr --cv <input_image>
python3 ./test_tidl_j7_deploy.py <input_image>

Please refer to the example inference script for details.

module = DLRModel(artifacts_dir)
results = module.run({input_tensor : input_data})
tvm_output = results[0]

If the model was compiled using the TIDLCompiler augmentation as shown above, TIDL-supported subgraphs will be offloaded and accelerated.

Should you want debug information during inference, you can set the environment variable TIDL_RT_DEBUG to 1, 2, 3, or 4 to see the internal more details of the inference progress.

TIDL_RT_DEBUG=1 python3 ./test_tidl_j7_deploy.py <input_image>

Inference with TVM runtime

If you choose to run the deployable module with the TVM runtime (rather than Neo-AI_DLR) on the target EVM, you will need to build TVM on the target. The steps are similar to building TVM on an x86_64 host, except that the target toolchain will be for 64-bit ARM.

Then, run the inference script with the –tvm option:

# on target J7 EVM
cd /PATH/TO/tvm/tests/python/relay/ti_tests
python3 ./test_tidl_j7_deploy.py --tvm

Please refer to the example inference script for details.

loaded_json = open(artifacts_dir + "deploy_graph.json").read()
loaded_lib = tvm.runtime.load_module(artifacts_dir + "deploy_lib.so")
loaded_params = bytearray(open(artifacts_dir + "deploy_param.params", "rb").read())
# create a runtime executor module
module = runtime.create(loaded_json, loaded_lib, tvm.cpu())
# load params into the module
module.load_params(loaded_params)
# feed input data
module.set_input(input_tensor, tvm.nd.array(input_data))
# run
module.run()
# get output
tvm_output = module.get_output(0).asnumpy()

Known Issues/Limitations

  • We are observing performance gap between the NEO_AI-DLR runtime and TIDL standalone runtime when entire network is fully offloaded also. Expected to reduce this gap with future releases.
  • mxnet_mobilenetv3_large - Squeeze and excitation (SE) is currently not supported in TIDL which resulted in creation of many subgraphs. Due to this graph switching overhead is quite high which ultimately nullified acceleration benefit of off-loading subgraphs to C7x-MMA. This may applicable to other networks also when number of subgraphs are more.