3.15.2. Neo-AI Deep Learning Runtime

3.15.2.1. Introduction

Neo-AI-DLR is an open source common runtime for deep learning models and decision tree models compiled by TVM, AWS SageMaker Neo, or Treelite. Processor SDK Linux has integrated Neo-AI-DLR. DLR stands for Deep Learning Runtime. With this integration, the models compiled by AWS SageMaker Neo and TVM can run on all Arm core of all Sitara devices (AM3/AM4/AM5/AM6).

On A5729 and AM5749 which has deep learning accelerators, models compiled by Neo can be accelerated either fully or partially on the accelerators (EVE/DSP cores). A graph of any model supported by Neo compiler can be split into sub-graphs, where TIDL supported sub-graph will run on EVE/DSP cores, while unsupported layer sub-graph will run on Arm core. Sections below explain how to run heterogeneous inference on AM5729/49.

3.15.2.2. Patch Installation

Please download Processor-SDK Linux AM57x patch.

  • Patch installation on the target

Untar the patch to a temporary folder in AM57x EVM file system and run the commands below.

./install_patches.sh
reboot (or power cycle the EVM)
  • Patch installation on the host (needed only when compiling models on local host)

Untar the patch and copy following files to a folder containing TIDL tools for compilation.

tidl_relayImport.so
eve_test_dl_algo_ref.out

3.15.2.3. Examples

Examples of running inference with Neo-AI-DLR are available in /usr/share/dlr/demos of the target filesystem. The patch release installed above contains a pre-compiled MobileNetV1 model. To run the examples with MobileNetV1, use the following commands:

cd /usr/share/dlr/demos
Run Python-API example: ./do_tidl4.sh ./tf_mnet1_batch4
Run C-API example with video clip: ./do_mobilenet4_CLIP.sh ./tf_mnet1_batch4
Run C-API example with live camera: ./do_mobilenet4_CAM.sh ./tf_mnet1_batch4

For more information about running examples of Neo-Ai-DLR with TIDL, please refer to Neo-ai-dlr Texas Instruments fork in github.

3.15.2.4. Build Your Own Applications with DLR

To deply a deep learning model to AM57x devices, you will first need to compile the model to generate runtime artifacts according to Compiling Network Models to Run with DLR. You will then need to build your applications using the DLR, either with Python API or C API. Following examples in /usr/share/dlr/demos can be reference:

  • /usr/share/dlr/demos/tidl_dlr4.py (using Python API)
  • /usr/share/dlr/demos/run_mobilenet_cv_mt.cc (using C API)

Once a model is compiled, copy the generated artifacts from host to AM57x and run inference according to following instructions:

Copy artifacts to <artifacts_folder>
Run user defined inference program within <artifacts_folder>
If inference program runs from a different place, do the following:
   - Edit <artifacts_folder>/subgraph*.cfg to provide the path for .bin files:
     netBinFile    = <absolute path of artifacts_folder>/tidl_subgraph0_net.bin
     paramsBinFile = <absolute path of artifacts_folder>/tidl_subgraph0_params.bin
   - export TIDL_SUBGRAPH_DIR=<absolute path of artifacts_folder>
   - export TIDL_SUBGRAPH_DYNAMIC_OUTSCALE=1
   - export TIDL_SUBGRAPH_DYNAMIC_INSCALE=1
   - Run inference program from anywhere

3.15.2.5. Compiling Network Models to Run with DLR

Deep learning models can be compiled by SageMaker Neo service in order to run with DLR on AM57x devices.

An alternative is to build the TVM compiler from source and compile models on local host. The compiler source code supporting AM57x devices is hosted at https://github.com/TexasInstruments/tvm with branch “tidl-j6”. Follow instructions below to build TVM from source (these instructions have been validated only on Ubuntu 18.04).

  • Install required tools
sudo apt-get install -y python3 python3-dev python3-setuptools gcc libtinfo-dev zlib1g-dev build-essential cmake libedit-dev libxml2-dev
  • Build TVM from source
git clone --recursive https://github.com/TexasInstruments/tvm.git neo-ai-tvm --branch tidl-j6
cd neo-ai-tvm
mkdir build && cd build
cp ../cmake/config.cmake .
modify config.cmake:
  - change "set(USE_LLVM OFF)" to "set(USE_LLVM ON)"
  - change "set(USE_TIDL OFF)" to "set(USE_TIDL ON)"
cmake ..
make –j8
  • Install Python Package
export TVM_HOME=/path/to/neo-ai-tvm
export PYTHONPATH=$TVM_HOME/python:${PYTHONPATH}
  • Set up TIDL tools

Make sure the two binaries are copied to <TIDL_TOOLS_folder> according to Patch Installation for host. Then set up the following environment variables:

export ARM_GCC_PATH=<Processor-SDK installation folder>/linux-devkit/sysroots/x86_64-arago-linux/usr/bin
export TIDL_TOOLS_PATH=<TIDL_TOOLS_folder>
  • Compile Neural Network Models

Refer to https://github.com/neo-ai/tvm/blob/tidl-j6/tests/python/relay/test_tidl.py for examples of compiling deep learning models.

3.15.2.6. Benchmarking

Performance with and without TIDL offload is shown below for TensorFlow MobileNet v1 and v2. The performance depends significantly on batch size (if batch size is 1, only one EVE is operating and performance would be very poor).

Batch Size TIDL MobileNetV1 (fps) ARM MobileNetV1 (fps) TIDL MobileNetV2 (fps) ARM MobileNetV2 (fps)
4 30.1260 2.2018 30.5178 3.6843
16 34.8465 2.2210 36.0127 3.6775
32 35.5279   37.5482  

Note

  • This release only supports batch size up to 32.
  • There is no TVM auto-tuning for ARM (using default scheduling) and it is single A15 core execution.

3.15.2.7. Rebuilding DLR from Source

DLR for AM57x devices is included in Proc-SDK Linux target file system. Source code is hosted at https://github.com/TexasInstruments/neo-ai-dlr. Users may rebuild the latest source code before official Proc-SDK release, following steps below:

  • Clone git repo on x86 host to target NFS folder (git cloning may not work on EVM):
git clone --recursive https://github.com/TexasInstruments/neo-ai-dlr.git --branch tidl-j6
  • Build and Install DLR on AM57x9 EVM:
cd neo-ai-dlr
mkdir build && cd build
cmake .. -DUSE_TIDL=ON
make –j2
cd ../python
python3 setup.py install --user
  • Rebuild TIDL demo with DLR:
cd neo-ai-dlr/examples/tidl
make clean
make

3.15.2.8. Trouble Shooting

When deploying a deep learning model on AM5729/49, the most common problem is that the execution may run out of memory, especially if the model is compiled to run with batch size larger than 1. When this problem happens, following error message will be displayed:

inc/executor.h:172: T* tidl::malloc_ddr(size_t) [with T = char; size_t = unsigned int]: Assertion `val != nullptr' failed.

This problem can be solved by following steps:

  • Clean up heap:
ti-mct-heap-check -c
  • Find out the memory requirement by running inference with environment variables for debugging:
TIDL_SUBGRAPH_TRACE=1 TIDL_SUBGRAPH_NUM_EVES=1 TIDL_NETWORK_HEAP_SIZE_EVE=200000000 TIDL_PARAM_HEAP_SIZE_EVE=180000000 \
 <command to run inference program, e.g. python3 run_tidl_infer.py> | egrep "PARAM|NETWORK"

This command will generate results similar to the example given below:

[eve 0]         TIDL Device Trace: NETWORK heap: Size 200000000, Free 76721664, Total requested 123278336
[eve 0]         TIDL Device Trace: PARAM heap: Size 180000000, Free 152464932, Total requested 27535068

From the generated results, find the total requested size for NETWORK heap and PARAM heap. Then run inference with environment variables to configure heap to meet the requested sizes. If the model is compiled for batch size larger than 1, run inference for multiple trials, each with a different number of EVE cores. For example:

TIDL_SUBGRAPH_NUM_EVES=1 TIDL_NETWORK_HEAP_SIZE_EVE=123300000 TIDL_PARAM_HEAP_SIZE_EVE=28000000 <command to run inference program>
TIDL_SUBGRAPH_NUM_EVES=2 TIDL_NETWORK_HEAP_SIZE_EVE=123300000 TIDL_PARAM_HEAP_SIZE_EVE=28000000 <command to run inference program>
TIDL_SUBGRAPH_NUM_EVES=4 TIDL_NETWORK_HEAP_SIZE_EVE=123300000 TIDL_PARAM_HEAP_SIZE_EVE=28000000 <command to run inference program>

If batch size is larger than 1, then choose the maximum number of EVE cores that can have successful execution. If batch size is 1, just use 1 EVE core.

3.15.2.9. Known Issues

Resnet50 based models may have run time errors with following message:

[eve 0]         TIDL Device Error: TIDL_process() failed, -1120

This is a known problem of TIDL and will be addressed in future releases.