Introduction

The Processor SDK implements TIDL offload support using the TVM runtime and Neo-AI-DLR runtime. This heterogeneous execution enables:

TVM/Neo-AI-DLR as the top level inference API for user applications
Offloading subgraphs to C7x/MMA for accelerated execution with TIDL
Generating code and running on the ARM core for layers that are not supported by TIDL

Neo-AI-DLR is an open source common runtime for machine learning models compiled by AWS SageMaker Neo, TVM, or Treelite. For the Processor SDK, we focus on models compiled by TVM. For these models, the Neo-AI-DLR runtime can be considered as a wrapper around the TVM runtime.

The following sections describe the details for compiling and deploying machine learning models for TVM/Neo-AI-DLR + TIDL heterogeneous execution.

TVM/NEO-AI-DLR based user work flow

Find Below picture for TVM/NEO-AI-DLR based work flow. User needs to run the model complitation (Sub graphs creation and quantization) on PC and the generated artifacts can be used for running inference on the device.

TVM/NEO-AI-DLR based user work flow

Model Compilation on PC

OSRT Compile Steps

The Processor SDK package includes all the required python packages for runtime support.

Pre-requisite : PSDK RA should be installed on the Host Ubuntu 18.04 machine and able to run pre-built demos on EVM.

Following steps need to be followed : (Note - All below scripts to be run from ${PSDKRA_PATH}/tidl_xx_xx_xx_xx/ti_dl/test/tvm-dlr/ folder)

Prepare the Environment for the Model compilation
source prepare_model_compliation_env.sh
This script needs to be executed only once when user opens a new terminal. It performs the below operations. User also can perform these steps manually by following the scripts.
- Downloads and Installs all the python dependent packages like TVM, DLR, numPy, Python image library (Pillow) etc. If user has a conflicting package version because of other installations, we recommend to create conda environment with python 3.6.x and run these scripts.
- Download the models used by the OOB scripts if not available in the file system
- Sets the environment variables required by the script e.g. path to tools and shared libraries
- Checks if all the TIDL required tools are available in tools path
  
  Note
  This scripts downalod the TVM and DLR python packages from latest PSDK release in the ti.com If you are using diffrent version of SDK (Example RC versions), then please update links for these two python whl files in the download_models.py - 'dlr-1.4.0-py3-none-any.whl' : {'url':'http://swubn03.india.englab.ti.com/webgen/publish/PROCESSOR-SDK-LINUX-J721E/07_02_00_05/exports//dlr-1.4.0-py3-none-any.whl', 'dir':'./'} If you observe any issue in pip. Run below command to update pip
  python -m pip install --upgrade pip
Run for model compilation – This step generates artifacts needed for inference in the artifacts folder. Each subgraph is identified in the artifacts using the tensor index of its output in the mode
#Model Compilation for EVM
python3 tvm-compilation-tflite-example.py
python3 tvm-compilation-onnx-example.py

#Model Compilation for PC
python3 tvm-compilation-tflite-example.py --pc-inference
python3 tvm-compilation-onnx-example.py --pc-inference
Run Inference on PC - Optionally user can test the inference in PC emulation mode and check the output in the console
python3 dlr-inference-example.py

Run Model on EVM

OSRT Run Steps

Copy the “${PSDKR_PATH}/tidl_xx_xx_xx_xx/ti_dl/test/” folder to the file system where the EVM is running Linux (SD card or NFS mount). This has all the OOB scripts and artifacts.
“LD_LIBRARY_PATH=/usr/lib python3 dlr-inference-example.py” - Run the inference on the EVM and check the results, performance etc.

Note : These scripts are only for basic functionally testing and performance check. For accuracy benchmarking, we will be releasing more tutorials in upcoming releases

TIDL specific lines in compilation script

There are only 4 lines that are specific to TIDL offload in "test_tidl_j7.py". The rest of the script is no different from a regular TVM compilation script without TIDL offload.

tidl_compiler = tidl.TIDLCompiler(tidl_platform, tidl_version,
                                  num_tidl_subgraphs=num_tidl_subgraphs,
                                  artifacts_folder=tidl_artifacts_folder,
                                  tidl_tools_path=get_tidl_tools_path(),
                                  tidl_tensor_bits=8,
                                  tidl_calibration_options={'iterations':10},
                                  tidl_denylist=args.denylist)

We first instantiate a TIDLCompiler object. The parameters are explained in the following table.

Name/Position	Value
tidl_platform	"J7"
tidl_version	(7,1)
num_tidl_subgraphs	offload up to <num> tidl subgraphs
artifacts_folder	where to store deployable module
tidl_tools_path	set to environment variable TIDL_TOOLS_PATH
tidl_tensor_bits	8 or 16 for import TIDL tensor and weights
tidl_calibration_options	optional, a dictionary to overwrite default calibration options
tidl_denylist	optional, deny a TVM relay op for TIDL offloading

Advanced calibration can help improve 8-bit quantization. Please see TIDL Quantization for details. Default calibration options are specified in tvm source file, python/tvm/relay/backend/contrib/tidl.py. Please grep for "default_calib_options".

mod, status = tidl_compiler.enable(mod_orig, params, model_input_list)

In this step, the original machine learning model/network represented in TVM Relay IR, "mod_orig", goes through the following transformations:

Allowlisting: each Relay Op is examined to see if it can be offloaded to TIDL
Partitioning: Ops that TIDL supports are partitioned into TIDL subgraphs
TIDL importing: TIDL subgraphs are imported from Relay IR into TIDL format
TIDL postprocessing: sample inputs in "model_input_list" are used to calibrate quantization in the TIDL subgraphs.

with tidl.build_config(tidl_compiler=tidl_compiler):
    graph, lib, params = relay.build_module.build(mod, target=target, params=params)

In this step, TVM code generation takes place. Inside the TVM codegen, there is a TIDL codegen backend. "tidl.build_config" creates a context and tells the TIDL codegen backend where the artifacts from TIDL importing are. The backend then embeds the artifacts into the "lib".

tidl.remove_tidl_params(params)

This optional step removes the weights in TIDL subgraphs that have already been imported into the artifacts. Removing them results in a smaller deployable module.

C API based demo

NEO-AI-DLR on the EVM/Target supports both Pythin and C-API. This section of describes the usage of C-API

TVM/NEO-AI-DLR + TIDL Heterogeneous Execution + Display demo

Introduction

This demo uses previously created TVM artifacts from NN models. For details on how to compile NN models in TVM, using TI's J7 platform as a target, please refer to "Compilation with TVM compiler" in the "TVM/Neo-AI-DLR + TIDL Heterogeneous Execution" section.

Required HW for running this demo:
- J721e EVM
- Display monitor, and display port cable
- SD card. At least 16GB SD card recommended
This demo assumes:
- SD card created with PSDKLA + PSDKRA image
- A set of bmp images.
- Images are in SD card "rootfs" location: /home/root/test_data/app_tidl.
- A names.txt file (at the same location) containing the list of images to use.

Supported platforms

Platform	Linux x86_64	Linux+RTOS mode	QNX+RTOS mode	SoC
Support	NO	YES	NO	J721e

How it works

This demo showcases parallel execution of heterogenous TVM/NEO-AI-DLR with a display, using pthreads and openVX graphs.

Thread1:
- Runs on A72, and dispatches to C7x TIDL subgraphs
  - Creates and runs one OpenVX graph with one TIDL node
- On A72, this thread does image pre-processing and calls the DLR APIs for inference.
Thread2:
- Runs on A72, and dispatches to R5F-0 DSS
  - Creates and runs one OpenVX graph with one Display node
- On A72, this thread passes the input image and classification output (top_n) information to the display.

Procedure to build DLR classification demo

Modify makefile paths
- Export PSDK_INSTALL_PATH (or change path in makerules/config.mk)
Additional TIOVX and vision_apps libraries
- If you haven't already done so, you'll need to build some libraries inside vision_apps. For example: app_utils_draw2d. An easy way of taking care of this is by building "vx_app_tidl".
  - ${PSDK_INSTALL_PATH}/vision_apps$ make vx_app_tidl
cd to ${PSDK_INSTALL_PATH}/tidl_xx_xx_xx_xx
Do "make TARGET_BUILD=${PROFILE} demos".

How to run DLR classification demo on target

Connect the display monitor to the J7's EVM Display0 port.
Make a folder in the SD card's "rootfs" directory. Example: mkdir /home/root/dlr_demo
Copy the demo binary tidl_dlr_classification.out from "c7x-mma-tidl/ti_dl/demos/out/J7/A72/LINUX/{PROFILE}" in the SD card "rootfs" partition, into the folder you just created (/home/root/dlr_demo).
Copy do_dlr_classification.sh and labels_mobilenet_quant_v1_224.txt into the same folder.
Make a folder in the SD card "rootfs" directory for your input images. Example: mkdir /home/root/test_data/app_tidl
Copy test *.bmp images and names.txt filew into your previously created test_data folder. Example: from "{PSDK_PATH}/tiovx/conformance_tests/test_data/psdkra/app_tidl" to "/home/root/test_data/app_tidl"
Copy your previously built TVM artifacts from the compilation step. Example artifacts folder location: /home/root/artifacts_MobileNetV1_target
Insert the SD card, turn on the EVM, and log into the board.
Create a ./local/dlr directory in home/root: mkdir -p /home/root/.local/dlr/
Copy shared libdlr.so library in the folder you just created: cp /usr/lib/python3.8/site-packages/dlr/libdlr.so to /home/root/.local/dlr/
Define shared libraries path: export LD_LIBRARY_PATH=/usr/lib:/home/root/.local/dlr:$LD_LIBRARY_PATH
Go to the demo folder: cd to /home/root/dlr_demo
Run the demo: ./do_dlr_classification.sh ../artifacts_MobileNetV1_target/
To stop the demo press "Enter".

Options for Advanced Users

Deployable module

The result of compilation is called a "deployable module". It consists of three files:

deploy_graph.json: graph description of the compiled network for execution. In the graph, TIDL subgraphs are nodes with names "tidl_0", "tidl_1", etc.
deploy_lib.so: executable code that runs the nodes in the graph. Code for offloading TIDL subgraphs and imported TIDL artifacts is also embedded in this file.
deploy_param.params: constant weights/parameters for nodes in graph.

Taking the output of "test_tidl_j7.py" for TensorFlow MobilenetV1 for example, the deployable module for J7 target is located in "artifacts_MobileNetV1_target/". You can copy this deployable module to the target EVM for execution. Please see the "Inference" sections below for details.

artifacts_MobileNetV1_target
|-- deploy_graph.json
|-- deploy_lib.so
|-- deploy_param.params

Other compilation artifacts

All other compilation artifacts are stored in the "tempDir" directory under the specified "artifacts_folder". Interested users can look into this directory for TIDL importing details. This directory is for information only, and is not needed for inference/deployment.

One useful file is "relay.gv.svg". It gives a graphical view of the whole network and where the TIDL subgraphs are. You can view it using a browser or other viewer, for example:

firefox artifacts_MobileNetV1_target/tempDir/relay.gv.svg

Debugging

You can set the environment variable TIDL_RELAY_IMPORT_DEBUG to 0, 1, 2, 3, or 4 for detailed internal debug information and progress during TVM compilation. For example the compiler will dump the graph represented in TVM Relay IR, RelayIR to TIDL importing, etc.

Comparing TIDL per layer output with TVM per layer output

When TIDL_RELAY_IMPORT_DEBUG is set to 4, TIDL import will generate the output for each TIDL layer in the imported TIDL subgraph, using calibration inputs. The compilation will also generate corresponding output from running the original model in floating point mode, by compiling and running on the host using TVM. We name the tensors from TIDL quantized calibration execution "tidl_tensor"; we name the corresponding tensors from TVM floating point execution "tvm_tensor". A simple script, "compare_tensors.py", is provided to compare these two tensors.

TIDL_RELAY_IMPORT_DEBUG=4 python3 ./test_tidl_j7.py --target
# python3 ./compare_tensors.py <artifacts_folder> <subgraph_id> <layer_id>
python3 ./compare_tensors.py artifacts_MobileNetV1_target 0 1

Known Issues/Limitations

We are observing performance gap between the NEO_AI-DLR runtime and TIDL standalone runtime when entire network is fully offloaded also. Expected to reduce this gap with future releases.
mxnet_mobilenetv3_large - Squeeze and excitation (SE) is currently not supported in TIDL which resulted in creation of many subgraphs. Due to this graph switching overhead is quite high which ultimately nullified acceleration benefit of off-loading subgraphs to C7x-MMA. This may applicable to other networks also when number of subgraphs are more.

Table of Contents

Introduction

TVM/NEO-AI-DLR based user work flow

Model Compilation on PC

Run Model on EVM

TIDL specific lines in compilation script

C API based demo

TVM/NEO-AI-DLR + TIDL Heterogeneous Execution + Display demo

Introduction

Supported platforms

How it works

Procedure to build DLR classification demo

How to run DLR classification demo on target

Options for Advanced Users

Deployable module

Other compilation artifacts

Debugging

Comparing TIDL per layer output with TVM per layer output

Known Issues/Limitations