TI Deep Learning Library User Guide
TVM/Neo-AI-DLR + TIDL Heterogeneous Execution

Introduction

The Processor SDK implements TIDL offload support using the TVM runtime and Neo-AI-DLR runtime. This heterogeneous execution enables:

  1. TVM/Neo-AI-DLR as the top level inference API for user applications
  2. Offloading subgraphs to C7x/MMA for accelerated execution with TIDL
  3. Generating code and running on the ARM core for layers that are not supported by TIDL

Neo-AI-DLR is an open source common runtime for machine learning models compiled by AWS SageMaker Neo, TVM, or Treelite. For the Processor SDK, we focus on models compiled by TVM. For these models, the Neo-AI-DLR runtime can be considered as a wrapper around the TVM runtime.

The following sections describe the details for compiling and deploying machine learning models for TVM/Neo-AI-DLR + TIDL heterogeneous execution.

TVM/NEO-AI-DLR based user work flow

Find Below picture for TVM/NEO-AI-DLR based work flow. User needs to run the model complitation (Sub graphs creation and quantization) on PC and the generated artifacts can be used for running inference on the device.

tvmrt_work_flow.png
TVM/NEO-AI-DLR based user work flow

Model Compilation on PC

osrt_compile_steps.png
OSRT Compile Steps

The Processor SDK package includes all the required python packages for runtime support.

Pre-requisite : PSDK RA should be installed on the Host Ubuntu 18.04 machine and able to run pre-built demos on EVM.

Following steps need to be followed : (Note - All below scripts to be run from ${PSDKRA_PATH}/tidl_xx_xx_xx_xx/ti_dl/test/tvm-dlr/ folder)

  1. Prepare the Environment for the Model compilation
    source prepare_model_compliation_env.sh
    This script needs to be executed only once when user opens a new terminal. It performs the below operations. User also can perform these steps manually by following the scripts.
    • Downloads and Installs all the python dependent packages like TVM, DLR, numPy, Python image library (Pillow) etc. If user has a conflicting package version because of other installations, we recommend to create conda environment with python 3.6.x and run these scripts.
    • Download the models used by the OOB scripts if not available in the file system
    • Sets the environment variables required by the script e.g. path to tools and shared libraries
    • Checks if all the TIDL required tools are available in tools path

      Note
      This scripts downalod the TVM and DLR python packages from latest PSDK release in the ti.com If you are using diffrent version of SDK (Example RC versions), then please update links for these two python whl files in the download_models.py - 'dlr-1.4.0-py3-none-any.whl' : {'url':'http://swubn03.india.englab.ti.com/webgen/publish/PROCESSOR-SDK-LINUX-J721E/07_02_00_05/exports//dlr-1.4.0-py3-none-any.whl', 'dir':'./'} If you observe any issue in pip. Run below command to update pip

      python -m pip install --upgrade pip
  2. Run for model compilation – This step generates artifacts needed for inference in the artifacts folder. Each subgraph is identified in the artifacts using the tensor index of its output in the mode
    #Model Compilation for EVM
    python3 tvm-compilation-tflite-example.py
    python3 tvm-compilation-onnx-example.py
    #Model Compilation for PC
    python3 tvm-compilation-tflite-example.py --pc-inference
    python3 tvm-compilation-onnx-example.py --pc-inference
  3. Run Inference on PC - Optionally user can test the inference in PC emulation mode and check the output in the console
    python3 dlr-inference-example.py

Run Model on EVM

osrt_run_steps.png
OSRT Run Steps
  1. Copy the “${PSDKR_PATH}/tidl_xx_xx_xx_xx/ti_dl/test/” folder to the file system where the EVM is running Linux (SD card or NFS mount). This has all the OOB scripts and artifacts.
  2. “LD_LIBRARY_PATH=/usr/lib python3 dlr-inference-example.py” - Run the inference on the EVM and check the results, performance etc.

Note : These scripts are only for basic functionally testing and performance check. For accuracy benchmarking, we will be releasing more tutorials in upcoming releases

TIDL specific lines in compilation script

There are only 4 lines that are specific to TIDL offload in "test_tidl_j7.py". The rest of the script is no different from a regular TVM compilation script without TIDL offload.

tidl_compiler = tidl.TIDLCompiler(tidl_platform, tidl_version,
                                  num_tidl_subgraphs=num_tidl_subgraphs,
                                  artifacts_folder=tidl_artifacts_folder,
                                  tidl_tools_path=get_tidl_tools_path(),
                                  tidl_tensor_bits=8,
                                  tidl_calibration_options={'iterations':10},
                                  tidl_denylist=args.denylist)

We first instantiate a TIDLCompiler object. The parameters are explained in the following table.

Name/Position Value
tidl_platform "J7"
tidl_version (7,1)
num_tidl_subgraphs offload up to <num> tidl subgraphs
artifacts_folder where to store deployable module
tidl_tools_path set to environment variable TIDL_TOOLS_PATH
tidl_tensor_bits 8 or 16 for import TIDL tensor and weights
tidl_calibration_options optional, a dictionary to overwrite default calibration options
tidl_denylist optional, deny a TVM relay op for TIDL offloading

Advanced calibration can help improve 8-bit quantization. Please see TIDL Quantization for details. Default calibration options are specified in tvm source file, python/tvm/relay/backend/contrib/tidl.py. Please grep for "default_calib_options".

mod, status = tidl_compiler.enable(mod_orig, params, model_input_list)

In this step, the original machine learning model/network represented in TVM Relay IR, "mod_orig", goes through the following transformations:

  1. Allowlisting: each Relay Op is examined to see if it can be offloaded to TIDL
  2. Partitioning: Ops that TIDL supports are partitioned into TIDL subgraphs
  3. TIDL importing: TIDL subgraphs are imported from Relay IR into TIDL format
  4. TIDL postprocessing: sample inputs in "model_input_list" are used to calibrate quantization in the TIDL subgraphs.
with tidl.build_config(tidl_compiler=tidl_compiler):
    graph, lib, params = relay.build_module.build(mod, target=target, params=params)

In this step, TVM code generation takes place. Inside the TVM codegen, there is a TIDL codegen backend. "tidl.build_config" creates a context and tells the TIDL codegen backend where the artifacts from TIDL importing are. The backend then embeds the artifacts into the "lib".

tidl.remove_tidl_params(params)

This optional step removes the weights in TIDL subgraphs that have already been imported into the artifacts. Removing them results in a smaller deployable module.

C API based demo

NEO-AI-DLR on the EVM/Target supports both Pythin and C-API. This section of describes the usage of C-API

TVM/NEO-AI-DLR + TIDL Heterogeneous Execution + Display demo

Introduction

This demo uses previously created TVM artifacts from NN models. For details on how to compile NN models in TVM, using TI's J7 platform as a target, please refer to "Compilation with TVM compiler" in the "TVM/Neo-AI-DLR + TIDL Heterogeneous Execution" section.

  • Required HW for running this demo:
    • J721e EVM
    • Display monitor, and display port cable
    • SD card. At least 16GB SD card recommended
  • This demo assumes:
    • SD card created with PSDKLA + PSDKRA image
    • A set of bmp images.
    • Images are in SD card "rootfs" location: /home/root/test_data/app_tidl.
    • A names.txt file (at the same location) containing the list of images to use.

Supported platforms

Platform Linux x86_64 Linux+RTOS mode QNX+RTOS mode SoC
Support NO YES NO J721e

How it works

This demo showcases parallel execution of heterogenous TVM/NEO-AI-DLR with a display, using pthreads and openVX graphs.

  • Thread1:
    • Runs on A72, and dispatches to C7x TIDL subgraphs
      • Creates and runs one OpenVX graph with one TIDL node
    • On A72, this thread does image pre-processing and calls the DLR APIs for inference.
  • Thread2:
    • Runs on A72, and dispatches to R5F-0 DSS
      • Creates and runs one OpenVX graph with one Display node
    • On A72, this thread passes the input image and classification output (top_n) information to the display.

Procedure to build DLR classification demo

  1. Modify makefile paths
    • Export PSDK_INSTALL_PATH (or change path in makerules/config.mk)
  2. Additional TIOVX and vision_apps libraries
    • If you haven't already done so, you'll need to build some libraries inside vision_apps. For example: app_utils_draw2d. An easy way of taking care of this is by building "vx_app_tidl".
      • ${PSDK_INSTALL_PATH}/vision_apps$ make vx_app_tidl
  3. cd to ${PSDK_INSTALL_PATH}/tidl_xx_xx_xx_xx
  4. Do "make TARGET_BUILD=${PROFILE} demos".

How to run DLR classification demo on target

  1. Connect the display monitor to the J7's EVM Display0 port.
  2. Make a folder in the SD card's "rootfs" directory. Example: mkdir /home/root/dlr_demo
  3. Copy the demo binary tidl_dlr_classification.out from "c7x-mma-tidl/ti_dl/demos/out/J7/A72/LINUX/{PROFILE}" in the SD card "rootfs" partition, into the folder you just created (/home/root/dlr_demo).
  4. Copy do_dlr_classification.sh and labels_mobilenet_quant_v1_224.txt into the same folder.
  5. Make a folder in the SD card "rootfs" directory for your input images. Example: mkdir /home/root/test_data/app_tidl
  6. Copy test *.bmp images and names.txt filew into your previously created test_data folder. Example: from "{PSDK_PATH}/tiovx/conformance_tests/test_data/psdkra/app_tidl" to "/home/root/test_data/app_tidl"
  7. Copy your previously built TVM artifacts from the compilation step. Example artifacts folder location: /home/root/artifacts_MobileNetV1_target
  8. Insert the SD card, turn on the EVM, and log into the board.
  9. Create a ./local/dlr directory in home/root: mkdir -p /home/root/.local/dlr/
  10. Copy shared libdlr.so library in the folder you just created: cp /usr/lib/python3.8/site-packages/dlr/libdlr.so to /home/root/.local/dlr/
  11. Define shared libraries path: export LD_LIBRARY_PATH=/usr/lib:/home/root/.local/dlr:$LD_LIBRARY_PATH
  12. Go to the demo folder: cd to /home/root/dlr_demo
  13. Run the demo: ./do_dlr_classification.sh ../artifacts_MobileNetV1_target/
  14. To stop the demo press "Enter".

Options for Advanced Users

Deployable module

The result of compilation is called a "deployable module". It consists of three files:

  1. deploy_graph.json: graph description of the compiled network for execution. In the graph, TIDL subgraphs are nodes with names "tidl_0", "tidl_1", etc.
  2. deploy_lib.so: executable code that runs the nodes in the graph. Code for offloading TIDL subgraphs and imported TIDL artifacts is also embedded in this file.
  3. deploy_param.params: constant weights/parameters for nodes in graph.

Taking the output of "test_tidl_j7.py" for TensorFlow MobilenetV1 for example, the deployable module for J7 target is located in "artifacts_MobileNetV1_target/". You can copy this deployable module to the target EVM for execution. Please see the "Inference" sections below for details.

artifacts_MobileNetV1_target
|-- deploy_graph.json
|-- deploy_lib.so
|-- deploy_param.params

Other compilation artifacts

All other compilation artifacts are stored in the "tempDir" directory under the specified "artifacts_folder". Interested users can look into this directory for TIDL importing details. This directory is for information only, and is not needed for inference/deployment.

One useful file is "relay.gv.svg". It gives a graphical view of the whole network and where the TIDL subgraphs are. You can view it using a browser or other viewer, for example:

firefox artifacts_MobileNetV1_target/tempDir/relay.gv.svg

Debugging

You can set the environment variable TIDL_RELAY_IMPORT_DEBUG to 0, 1, 2, 3, or 4 for detailed internal debug information and progress during TVM compilation. For example the compiler will dump the graph represented in TVM Relay IR, RelayIR to TIDL importing, etc.

Comparing TIDL per layer output with TVM per layer output

When TIDL_RELAY_IMPORT_DEBUG is set to 4, TIDL import will generate the output for each TIDL layer in the imported TIDL subgraph, using calibration inputs. The compilation will also generate corresponding output from running the original model in floating point mode, by compiling and running on the host using TVM. We name the tensors from TIDL quantized calibration execution "tidl_tensor"; we name the corresponding tensors from TVM floating point execution "tvm_tensor". A simple script, "compare_tensors.py", is provided to compare these two tensors.

TIDL_RELAY_IMPORT_DEBUG=4 python3 ./test_tidl_j7.py --target
# python3 ./compare_tensors.py <artifacts_folder> <subgraph_id> <layer_id>
python3 ./compare_tensors.py artifacts_MobileNetV1_target 0 1

Known Issues/Limitations

  • We are observing performance gap between the NEO_AI-DLR runtime and TIDL standalone runtime when entire network is fully offloaded also. Expected to reduce this gap with future releases.
  • mxnet_mobilenetv3_large - Squeeze and excitation (SE) is currently not supported in TIDL which resulted in creation of many subgraphs. Due to this graph switching overhead is quite high which ultimately nullified acceleration benefit of off-loading subgraphs to C7x-MMA. This may applicable to other networks also when number of subgraphs are more.