3.8.1. TensorFlow Lite (LiteRT)

3.8.1.1. Introduction

LiteRT, formerly known as TensorFlow Lite, is an open-source library designed for running machine learning models on mobile and embedded devices. Processor SDK AM64x has integrated open-source TensorFlow Lite for deep learning inference at the edge. TensorFlow Lite runs on Arm for Sitara devices (AM3/AM4/AM5/AM6).

It supports on-device inference with low latency and a compact binary size. You can find more information at TensorFlow Lite

3.8.1.2. Features

TensorFlow Lite v2.18.0 via Yocto - meta-arago-extras/recipes-framework/tensorflow-lite/tensorflow-lite_2.18.0.bb

Multithreaded computation with acceleration using Arm Neon SIMD instructions on Cortex-A cores

C++ Library and Python interpreter (supported Python version 3)

TensorFlow Lite Model benchmark Tool (i.e. benchmark_model)

3.8.1.3. Inference backends and delegates

An inference backend is a compute engine designed for the efficient execution of machine learning models on edge devices. TensorFlow Lite provides options to enable various backends using the delegate mechanism.

../../../_images/TFLite-arm-only-armv8-sw-stack.png

3.8.1.3.1. Built-in kernels / CPU Delegate

The default inference backend for TensorFlow Lite is the CPU, utilizing reference kernels from its implementation. These built-in kernels fully support the TensorFlow Lite operator set.

3.8.1.3.2. XNNPACK Delegate

The XNNPACK library is a highly optimized collection of floating-point quantized neural network inference operators. It can be accessed through the XNNPACK delegate in TensorFlow Lite, with computations performed on the CPU. This library offers optimized implementations for a subset of TensorFlow Lite operators.

Note

The XNNPACK delegate is not supported for ARMv7-based platforms like AM335x and AM437x. Refer XNNPACK supported architectures for more details.

3.8.1.4. Benchmark Tool for TFLite Model

The tisdk-default-image wic image from AM64x-SDK-Download-page by default contains a pre-installed benchmarking application named benchmark_model. It’s a C++ binary designed to benchmark a TFLite model and its individual operators. It takes a TFLite model, generates random inputs, and repeatedly runs the model for a specified number of runs. After running the benchmark, it reports aggregate latency statistics.

The benchmark_model binary is located at /opt/tensorflow-lite/tools/. Refer TFLite Model Benchmark Tool - README for more details.

3.8.1.4.1. How to run benchmark using CPU delegate

To execute the benchmark using CPU for computation, use the following command:

root@<machine>:~# /opt/tensorflow-lite/tools/benchmark_model --graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite --use_xnnpack=false

The output of the benchmarking application should be similar to:

root@am62xx-evm:~# /opt/tensorflow-lite/tools/benchmark_model --graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite --num_threads=4 --use_xnnpack=false
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Num threads: [4]
INFO: Graph: [/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite]
INFO: Signature to run: []
INFO: #threads used for CPU inference: [4]
INFO: Use xnnpack: [0]
INFO: Loaded model /usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite
INFO: The input model file size (MB): 67.3128
INFO: Initialized session in 6.418ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=1 curr=1041765

INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
INFO: count=50 first=977738 curr=964908 min=911877 max=1112273 avg=971535 std=39112

INFO: Inference timings in us: Init: 6418, First inference: 1041765, Warmup (avg): 1.04176e+06, Inference (avg): 971535
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=6.14844 overall=109.848

Where,

/opt/tensorflow-lite/tools/benchmark_model: This is the path to the benchmark_model binary, which is used to benchmark TensorFlow Lite models.
--graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite: Specifies the path to the TFLite model file to be benchmarked. In this case, it’s an SSD MobileNet V2 model trained on the COCO dataset.
--use_xnnpack=false: Disables the use of the XNNPACK delegate for optimized CPU inference. The model will run without XNNPACK optimizations.
--num_threads=4: Sets the number of threads to use for inference. In this case, it uses 4 threads.

3.8.1.4.2. How to run benchmark using XNNPACK delegate

To execute the benchmark with the XNNPACK delegate, use the following command:

root@<machine>:~# /opt/tensorflow-lite/tools/benchmark_model --graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite --use_xnnpack=true

The output of the benchmarking application should be similar to,

root@am62xx-evm:~# /opt/tensorflow-lite/tools/benchmark_model --graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite --num_threads=4 --use_xnnpack=true
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Num threads: [4]
INFO: Graph: [/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite]
INFO: Signature to run: []
INFO: #threads used for CPU inference: [4]
INFO: Use xnnpack: [1]
INFO: Loaded model /usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: XNNPACK delegate created.
INFO: Explicitly applied XNNPACK delegate, and the model graph will be partially executed by the delegate w/ 1 delegate kernels.
INFO: The input model file size (MB): 67.3128
INFO: Initialized session in 592.232ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=1 curr=633430

INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
INFO: count=50 first=605745 curr=618849 min=568228 max=722188 avg=602943 std=27690

INFO: Inference timings in us: Init: 592232, First inference: 633430, Warmup (avg): 633430, Inference (avg): 602943
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=133.086 overall=149.531

Where,

--use_xnnpack=true: Enables the use of the XNNPACK delegate for optimized CPU inference. The model will run with XNNPACK optimizations.

3.8.1.5. Performance Numbers of Benchmark Tool

The following performance numbers are captured with benchmark_model on different SoCs using /usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite model & setting --num_threads to max value (i.e. number of Cortex-A core)

Table 3.3 Performance Benchmarks of TFLite on Different SoCs
SOC	Delegates	Inference Time (sec)	Initialization Time (ms)	Overall Memory Footprint (MB)
AM62X	CPU only	0.977168	6.129	110.07
	XNNPACK	0.613474	593.558	149.699
AM62PX	CPU only	0.419261	4.79	108.707
	XNNPACK	0.274756	1208.04	149.395
AM64X	CPU only	1.10675	144.535	109.562
	XNNPACK	0.702809	601.33	149.602
AM62L	CPU only	1.04867	6.088	110.129
	XNNPACK	0.661133	466.216	149.703

Based on the above data, using the XNNPACK delegate significantly improves inference times across all SoCs, though it generally increases initialization time and overall memory footprint.

Note

The performance numbers mentioned above were recorded after stopping the out-of-box (OOB) demos included in the TI SDK.

3.8.1.6. Example Applications

Processor SDK AM64x has integrated opensource components like NNStreamer which can be used for neural network inferencing using the sample tflite models under /usr/share/oob-demo-assets/models/ Checkout the Object Detection usecase under TI Apps Launcher - User Guide

Alternatively, if a display is connected, you can run the Object Detection pipeline using this command,

Attention

The Example Applications section is not applicable for AM64x