3.8.1. TensorFlow Lite (LiteRT)
3.8.1.1. Introduction
LiteRT, formerly known as TensorFlow Lite, is an open-source library designed for running machine learning models on mobile and embedded devices. Processor SDK has integrated open-source TensorFlow Lite for deep learning inference at the edge. TensorFlow Lite runs on Arm for Sitara devices (AM3/AM4/AM5/AM6).
It supports on-device inference with low latency and a compact binary size. You can find more information at TensorFlow Lite
3.8.1.2. Features
TensorFlow Lite v2.18.0 via Yocto - meta-arago-extras/recipes-framework/tensorflow-lite/tensorflow-lite_2.18.0.bb
Multithreaded computation with acceleration using Arm Neon SIMD instructions on Cortex-A cores
C++ Library and Python interpreter (supported Python version 3)
TensorFlow Lite Model benchmark Tool (i.e. benchmark_model)
3.8.1.3. Inference backends and delegates
An inference backend is a compute engine designed for the efficient execution of machine learning models on edge devices. TensorFlow Lite provides options to enable various backends using the delegate mechanism.

3.8.1.3.1. Built-in kernels / CPU Delegate
The default inference backend for TensorFlow Lite is the CPU, utilizing reference kernels from its implementation. These built-in kernels fully support the TensorFlow Lite operator set.
3.8.1.3.2. XNNPACK Delegate
The XNNPACK library is a highly optimized collection of floating-point quantized neural network inference operators. It can be accessed through the XNNPACK delegate in TensorFlow Lite, with computations performed on the CPU. This library offers optimized implementations for a subset of TensorFlow Lite operators.
Note
The XNNPACK delegate is not supported for ARMv7-based platforms like AM335x and AM437x. Refer XNNPACK supported architectures for more details.
3.8.1.4. Benchmark Tool for TFLite Model
The tisdk-default-image
wic image from AM64x-SDK-Download-page by default contains a pre-installed benchmarking application named benchmark_model.
It’s a C++ binary designed to benchmark a TFLite model and its individual operators. It takes a TFLite model, generates random inputs, and repeatedly runs the model
for a specified number of runs. After running the benchmark, it reports aggregate latency statistics.
The benchmark_model
binary is located at /opt/tensorflow-lite/tools/
.
Refer TFLite Model Benchmark Tool - README for more
details.
3.8.1.4.1. How to run benchmark using CPU delegate
To execute the benchmark using CPU for computation, use the following command:
root@<machine>:~# /opt/tensorflow-lite/tools/benchmark_model --graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite --use_xnnpack=false
The output of the benchmarking application should be similar to:
root@am62xx-evm:~# /opt/tensorflow-lite/tools/benchmark_model --graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite --num_threads=4 --use_xnnpack=false
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Num threads: [4]
INFO: Graph: [/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite]
INFO: Signature to run: []
INFO: #threads used for CPU inference: [4]
INFO: Use xnnpack: [0]
INFO: Loaded model /usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite
INFO: The input model file size (MB): 67.3128
INFO: Initialized session in 6.418ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=1 curr=1041765
INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
INFO: count=50 first=977738 curr=964908 min=911877 max=1112273 avg=971535 std=39112
INFO: Inference timings in us: Init: 6418, First inference: 1041765, Warmup (avg): 1.04176e+06, Inference (avg): 971535
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=6.14844 overall=109.848
Where,
/opt/tensorflow-lite/tools/benchmark_model
: This is the path to the benchmark_model binary, which is used to benchmark TensorFlow Lite models.--graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite
: Specifies the path to the TFLite model file to be benchmarked. In this case, it’s an SSD MobileNet V2 model trained on the COCO dataset.--use_xnnpack=false
: Disables the use of the XNNPACK delegate for optimized CPU inference. The model will run without XNNPACK optimizations.--num_threads=4
: Sets the number of threads to use for inference. In this case, it uses 4 threads.
3.8.1.4.2. How to run benchmark using XNNPACK delegate
To execute the benchmark with the XNNPACK delegate, use the following command:
root@<machine>:~# /opt/tensorflow-lite/tools/benchmark_model --graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite --use_xnnpack=true
The output of the benchmarking application should be similar to,
root@am62xx-evm:~# /opt/tensorflow-lite/tools/benchmark_model --graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite --num_threads=4 --use_xnnpack=true
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Num threads: [4]
INFO: Graph: [/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite]
INFO: Signature to run: []
INFO: #threads used for CPU inference: [4]
INFO: Use xnnpack: [1]
INFO: Loaded model /usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: XNNPACK delegate created.
INFO: Explicitly applied XNNPACK delegate, and the model graph will be partially executed by the delegate w/ 1 delegate kernels.
INFO: The input model file size (MB): 67.3128
INFO: Initialized session in 592.232ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=1 curr=633430
INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
INFO: count=50 first=605745 curr=618849 min=568228 max=722188 avg=602943 std=27690
INFO: Inference timings in us: Init: 592232, First inference: 633430, Warmup (avg): 633430, Inference (avg): 602943
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=133.086 overall=149.531
Where,
--use_xnnpack=true
: Enables the use of the XNNPACK delegate for optimized CPU inference. The model will run with XNNPACK optimizations.
3.8.1.5. Performance Numbers of Benchmark Tool
The following performance numbers are captured with benchmark_model on different SoCs using /usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite
model
& setting --num_threads
to max value (i.e. number of Cortex-A core)
SOC |
Delegates |
Inference Time (sec) |
Initialization Time (ms) |
Overall Memory Footprint (MB) |
---|---|---|---|---|
AM62X |
CPU only |
0.977168 |
6.129 |
110.07 |
XNNPACK |
0.613474 |
593.558 |
149.699 |
|
AM62PX |
CPU only |
0.419261 |
4.79 |
108.707 |
XNNPACK |
0.274756 |
1208.04 |
149.395 |
|
AM64X |
CPU only |
1.10675 |
144.535 |
109.562 |
XNNPACK |
0.702809 |
601.33 |
149.602 |
|
AM62L |
CPU only |
1.04867 |
6.088 |
110.129 |
XNNPACK |
0.661133 |
466.216 |
149.703 |
Based on the above data, using the XNNPACK delegate significantly improves inference times across all SoCs, though it generally increases initialization time and overall memory footprint.
Note
The performance numbers mentioned above were recorded after stopping the out-of-box (OOB) demos included in the TI SDK.
3.8.1.6. Example Applications
Processor SDK has integrated opensource components like NNStreamer which can be used for neural network inferencing using the sample tflite models under /usr/share/oob-demo-assets/models/
Checkout the Object Detection usecase under TI Apps Launcher - User Guide
Alternatively, if a display is connected, you can run the Object Detection pipeline using this command,
Attention
The Example Applications section is not applicable for AM64x