3.12.1. TensorFlow Lite (LiteRT)

3.12.1.1. Introduction

LiteRT, formerly known as TensorFlow Lite, is an open-source library designed for running machine learning models on mobile and embedded devices. Processor SDK Linux AM62x has integrated open-source TensorFlow Lite for deep learning inference at the edge. TensorFlow Lite runs on Arm for Sitara devices (AM3/AM4/AM5/AM6).

It supports on-device inference with low latency and a compact binary size. You can find more information at TensorFlow Lite

3.12.1.2. Features

3.12.1.3. Inference backends and delegates

An inference backend is a compute engine designed for the efficient execution of machine learning models on edge devices. TensorFlow Lite provides options to enable various backends using the delegate mechanism.

../../../_images/TFLite-arm-only-armv8-sw-stack.png

3.12.1.3.1. Built-in kernels / CPU Delegate

The default inference backend for TensorFlow Lite is the CPU, utilizing reference kernels from its implementation. These built-in kernels fully support the TensorFlow Lite operator set.

3.12.1.3.2. XNNPACK Delegate

The XNNPACK library is a highly optimized collection of floating-point quantized neural network inference operators. It can be accessed through the XNNPACK delegate in TensorFlow Lite, with computations performed on the CPU. This library offers optimized implementations for a subset of TensorFlow Lite operators.

Note

The XNNPACK delegate is not supported for ARMv7-based platforms like AM335x and AM437x. Refer XNNPACK supported architectures for more details.

3.12.1.4. Benchmark Tool for TFLite Model

The tisdk-default-image wic image from AM62x-SDK-Download-page by default contains a pre-installed benchmarking application named benchmark_model. It’s a C++ binary designed to benchmark a TFLite model and its individual operators. It takes a TFLite model, generates random inputs, and repeatedly runs the model for a specified number of runs. After running the benchmark, it reports aggregate latency statistics.

The benchmark_model binary is located at /opt/tensorflow-lite/tools/. Refer TFLite Model Benchmark Tool - README for more details.

3.12.1.4.1. How to run benchmark using CPU delegate

To execute the benchmark using CPU for computation, use the following command:

root@<machine>:~# /opt/tensorflow-lite/tools/benchmark_model --graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite --use_xnnpack=false

The output of the benchmarking application should be similar to:

root@am62xx-evm:~# /opt/tensorflow-lite/tools/benchmark_model --graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite --num_threads=4 --use_xnnpack=false
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Graph: [/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite]
INFO: Signature to run: []
INFO: Use xnnpack: [0]
INFO: Loaded model /usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite
INFO: The input model file size (MB): 67.3128
INFO: Initialized session in 5.579ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=1 curr=1357602 p5=1357602 median=1357602 p95=1357602

INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
INFO: count=50 first=1249964 curr=1240143 min=1238588 max=1252566 avg=1.24027e+06 std=2565 p5=1238753 median=1239807 p95=1247415

INFO: Inference timings in us: Init: 5579, First inference: 1357602, Warmup (avg): 1.3576e+06, Inference (avg): 1.24027e+06
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=6.36328 overall=109.832

Where,

  • /opt/tensorflow-lite/tools/benchmark_model: This is the path to the benchmark_model binary, which is used to benchmark TensorFlow Lite models.

  • --graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite: Specifies the path to the TFLite model file to be benchmarked. In this case, it’s an SSD MobileNet V2 model trained on the COCO dataset.

  • --use_xnnpack=false: Disables the use of the XNNPACK delegate for optimized CPU inference. The model will run without XNNPACK optimizations.

  • --num_threads=4: Sets the number of threads to use for inference. In this case, it uses 4 threads.

3.12.1.4.2. How to run benchmark using XNNPACK delegate

To execute the benchmark with the XNNPACK delegate, use the following command:

root@<machine>:~# /opt/tensorflow-lite/tools/benchmark_model --graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite --use_xnnpack=true

The output of the benchmarking application should be similar to,

root@am62xx-evm:~# /opt/tensorflow-lite/tools/benchmark_model --graph=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite --num_threads=4 --use_xnnpack=true
INFO: STARTING!
INFO: Log parameter values verbosely: [0]
INFO: Graph: [/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite]
INFO: Signature to run: []
INFO: Use xnnpack: [1]
INFO: Loaded model /usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: XNNPACK delegate created.
INFO: Explicitly applied XNNPACK delegate, and the model graph will be partially executed by the delegate w/ 1 delegate kernels.
INFO: The input model file size (MB): 67.3128
INFO: Initialized session in 614.333ms.
INFO: Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
INFO: count=1 curr=905463 p5=905463 median=905463 p95=905463
INFO: Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
INFO: count=50 first=900416 curr=898333 min=898007 max=906121 avg=899641 std=1549 p5=898333 median=899281 p95=904305
INFO: Inference timings in us: Init: 614333, First inference: 905463, Warmup (avg): 905463, Inference (avg): 899641
INFO: Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
INFO: Memory footprint delta from the start of the tool (MB): init=146.363 overall=150.141

Where,

  • --use_xnnpack=true: Enables the use of the XNNPACK delegate for optimized CPU inference. The model will run with XNNPACK optimizations.

3.12.1.5. Performance Numbers of Benchmark Tool

The following performance numbers are captured with benchmark_model on different SoCs using /usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite model & setting --num_threads to max value (i.e. number of Cortex-A core)

Table 3.23 Performance Benchmarks of TFLite on Different SoCs

SOC

Delegates

Inference Time (sec)

Initialization Time (ms)

Overall Memory Footprint (MB)

AM62X

CPU only

1.24027

5.579

109.832

XNNPACK

0.899641

614.333

150.141

AM62PX

CPU only

1.23341

252.390

111.121

XNNPACK

0.875280

597.639

150.52

AM64X

CPU only

1.26429

135.579

110.188

XNNPACK

0.740743

885.636

150.484

AM62L

CPU only

1.3708

807.076

111.152

XNNPACK

0.930577

769.145

150.496

Based on the above data, using the XNNPACK delegate significantly improves inference times across all SoCs, though it generally increases initialization time and overall memory footprint.

Note

The performance numbers mentioned above were recorded after stopping the out-of-box (OOB) demos included in the TI SDK.

3.12.1.6. Example Applications

Processor SDK Linux AM62x has integrated opensource components like NNStreamer which can be used for neural network inferencing using the sample tflite models under /usr/share/oob-demo-assets/models/ Checkout the Object Detection usecase under TI Apps Launcher - User Guide

Alternatively, if a display is connected, you can run the Object Detection pipeline using this command,

gst-launch-1.0 multifilesrc location=/usr/share/oob-demo-assets/videos/oob-gui-video-objects.h264 loop=true ! \
h264parse ! avdec_h264 ! \
tee name=tee_split0 \
tee_split0. ! \
    queue ! \
    videoconvertscale ! video/x-raw,width=300,height=300,format=RGB ! \
    tensor_converter ! \
    tensor_transform mode=arithmetic option=typecast:float32,add:-127.5,div:127.5 ! \
    tensor_filter framework=tensorflow2-lite model=/usr/share/oob-demo-assets/models/ssd_mobilenet_v2_coco.tflite custom=Delegate:XNNPACK,NumThreads:4 latency=1 ! \
    tensor_decoder \
    mode=bounding_boxes \
        option1=mobilenet-ssd \
        option2=/usr/share/oob-demo-assets/labels/coco_labels.txt \
        option3=/usr/share/oob-demo-assets/labels/box_priors.txt \
        option4=1280:720 \
        option5=300:300 ! \
    mix.sink_0 \
tee_split0. ! \
    queue ! \
    mix.sink_1 \
compositor name=mix sink_0::zorder=2 sink_1::zorder=1 ! \
kmssink name=sink

The above GStreamer pipeline reads an H.264 video file, decodes it, and processes it for object detection using a TensorFlow Lite model, displaying bounding boxes around detected objects. The processed video is then composited and rendered on the screen using the kmssink element.

Attention

The Example Applications section is not applicable for AM64x