3. Inference Explained¶
Table of Contents
This section explains how to invoke the inference function from the compiled library.
There are two inference scenarios, depending on the ti-npu type=
option specified during compilation:
Running the model on a host processor with an optimized software implementation.
Running the model on a dedicated hardware accelerator, for example, the Neural network Processing Unit (NPU)
3.1. Inference Function and Input/Output¶
In the header file generated by the compiler (tvmgen_default.h
) and stored in the artifacts directory,
information about input/output data shapes and types is provided,
and input/output data structures and the inference function are defined. For example:
/* The generated model library expects the following inputs/outputs:
* Inputs:
* Tensor[(1, 3, 128, 1), float32]
* Outputs:
* Tensor[(1, 6), float32]
* Tensor[(1, 60), float32]
*/
/*!
* \brief Input tensor pointers for TVM module "default"
*/
struct tvmgen_default_inputs {
void* onnx__Add_0;
};
/*!
* \brief Output tensor pointers for TVM module "default"
*/
struct tvmgen_default_outputs {
void* output0;
void* output1;
};
/*!
* \brief entrypoint function for TVM module "default"
* \param inputs Input tensors for the module
* \param outputs Output tensors for the module
*/
int32_t tvmgen_default_run(
struct tvmgen_default_inputs* inputs,
struct tvmgen_default_outputs* outputs
);
Note
When model name is specified using compiler option (see Multiple Model Compilation & Inference),
default
part in the file name, the data structure names and the model run function name
will be replaced with the specified model name.
For example, if --module-name=a
is specified, the header file name will become
tvmgen_a.h
, the data structure names will become tvmgen_a_inputs
, tvmgen_a_outputs
,
and the model run function name will become tvmgen_a_run
.
In this example, the neural network takes one tensor input and produces two tensor outputs. The expected input is a 1x3x128x1 tensor in floating point type. The two outputs are 1x6 and 1x60 tensors in floating point type. Define the input/output in your application code and initialize the struct members with pointers to your input/output data. For example:
#include "tvmgen_default.h"
float float_input[1*3*128*1];
float float_output0[1*6];
float float_output1[1*60];
struct tvmgen_default_inputs inputs = { (void*) &float_input[0] };
struct tvmgen_default_outputs outputs = { (void*) &float_output0[0], (void*) &float_output1[0] };
/* ... code that computes float_input array before model inference ... */
When the skip_normalize=true
and output_int=true
compiler options are specified
(see Performance Options),
the compiler will try to skip float to integer input normalization sequence at the beginning and
integer to float type cast at the end of the model and generate model inference code with
integer inputs and outputs. If the compiler succeeds,
the generated model library will expect integer inputs and produce integer outputs.
Please refer to the tvmgen_default.h
file for details about input/output data types and
the input feature normalization parameters.
3.1.1. NPU Quantized Model Input Normalization Parameters¶
/* The generated model library expects the following inputs/outputs:
* Inputs:
* Tensor[(1, 3, 128, 1), int8]
* Outputs:
* Tensor[(1, 6), int8]
* Tensor[(1, 60), uint8]
*/
/* Input feature normalization parameters:
* input_int = clip(((int32_t)((input_float + bias) * scale)) >> shift, min, max)
* where (min, max) = (-128, 127) if int8 type, (0, 255) if uint8 type
*/
#define TVMGEN_DEFAULT_BIAS_LEN 3
#define TVMGEN_DEFAULT_SCALE_LEN 1
#define TVMGEN_DEFAULT_SHIFT_LEN 1
extern const float tvmgen_default_bias_data[] __attribute__((weak)) = {12., 12., 18.};
extern const int32_t tvmgen_default_scale_data[] __attribute__((weak)) = {171};
extern const int32_t tvmgen_default_shift_data[] __attribute__((weak)) = {5};
3.1.2. CPU Quantized Model Input Normalization and Output Dequantization Parameters¶
/* The generated model library expects the following inputs/outputs:
* Inputs:
* Tensor[(1, 1, 28, 28), uint8]
* Outputs:
* Tensor[(1, 10), uint8]
*/
/* Input feature normalization parameters:
* input_int = clip(round(input_float * reciprocal_scale) + zero_point, min, max)
* where (min, max) = (-128, 127) if int8 type, (0, 255) if uint8 type
*/
extern const float tvmgen_default_input_reciprocal_scale_data[] __attribute__((weak)) = {127.1333};
extern const int32_t tvmgen_default_input_zero_point_data[] __attribute__((weak)) = {0};
/* Output dequantization parameters:
* output_float = scale * (output_int - zero_point)
*/
extern const float tvmgen_default_output_scale_data[] __attribute__((weak)) = {0.03781982};
extern const int32_t tvmgen_default_output_zero_point_data[] __attribute__((weak)) = {95};
3.2. Converting Input Data from Float to Int with Input Normalization¶
When the compiler successfully applies skip_normalize=true
optimization, you
must convert model input values from floating point to integer before invoking
the model inference function.
3.2.1. NPU Quantized Models¶
NPU QAT trained models can use
the tvmgen_default_bias_data
, TVMGEN_DEFAULT_BIAS_LEN
,
tvmgen_default_scale_data
, TVMGEN_DEFAULT_SCALE_LEN
, tvmgen_default_shift_data
,
and TVMGEN_DEFAULT_SHIFT_LEN
definitions in tvmgen_default.h
to convert model input
from float to int using the example code below.
#include <math.h>
#include "tvmgen_default.h"
float float_input[1*3*128*1];
int8_t int_input[1*3*128*1];
int8_t int_output0[1*6];
int8_t int_output1[1*60];
struct tvmgen_default_inputs inputs = { (void*) &int_input[0] };
struct tvmgen_default_outputs outputs = { (void*) &int_output0[0], (void*) &int_output1[0] };
/* ... code that computes the float_input array ... */
/* The following code does float to int normalization in user app before model inference */
for (int c = 0; c < 3; c++) {
for (int h = 0; h < 128; h++) {
float float_val = float_input[c * 128 + h];
int32_t scaled_val = (int32_t) floorf((float_val
+ tvmgen_default_bias_data[c % TVMGEN_DEFAULT_BIAS_LEN])
* tvmgen_default_scale_data[c % TVMGEN_DEFAULT_SCALE_LEN]);
int32_t shifted_val = scaled_val >> tvmgen_default_shift_data[c % TVMGEN_DEFAULT_SHIFT_LEN];
// clip to 8-bit range, use [-128, 127] for int8_t, [0, 255] for uint8_t
if (shifted_val < -128) shifted_val = -128;
if (shifted_val > 127) shifted_val = 127;
int_input[c * 128 + h] = (int8_t) shifted_val;
}
}
3.2.2. CPU Quantized Models¶
QDQ models can use the tvmgen_default_input_reciprocal_scale_data
and tvmgen_default_input_zero_point
definitions in tvmgen_default.h
to convert model input from float to int using the example code below.
#include <math.h>
#include "tvmgen_default.h"
float float_input[1*1*28*28];
uint8_t int_input[1*1*28*28];
uint8_t int_output[1*10];
struct tvmgen_default_inputs inputs = { (void*) &int_input[0] };
struct tvmgen_default_outputs outputs = { (void*) &int_output[0] };
/* ... code that computes the float input array ... */
/* The following code does float to int normalization in user app before model inference */
for (int h = 0; h < 28; h++) {
for (int w = 0; w < 28; w++) {
float float_val = float_input[h * 28 + w];
int32_t int_val = roundf(float_val * tvmgen_default_input_reciprocal_scale_data[0])
+ tvmgen_default_input_zero_point_data[0];
// clip to 8-bit range, use [-128, 127] for int8_t, [0, 255] for uint8_t
if (int_val < 0) int_val = 0;
if (int_val > 255) int_val = 255;
int_input[h * 28 + w] = int_val;
}
}
3.3. Output Dequantization for CPU Quantized Models¶
If the user desires the QDQ model output in float with the
output_int=true
option specified, then they should perform the output dequantization within the user application.
3.3.1. Example¶
CPU quantized QDQ models can use the tvmgen_default_output_scale_data
and tvmgen_default_output_zero_point_data
definitions in (tvmgen_default.h
)
to convert the model output from integer to float using the example code below.
#include <math.h>
#include "tvmgen_default.h"
float float_input[1*1*28*28];
uint8_t int_input[1*1*28*28];
uint8_t int_output[1*10];
float float_output[1*10];
struct tvmgen_default_inputs inputs = { (void*) &int_input[0] };
struct tvmgen_default_outputs outputs = { (void*) &int_output[0] };
/* perform input feature normalization float_input -> int_input */
/* call the model inference function */
tvmgen_default_run(inputs, outputs);
/* The following code dequantizes the model output */
for (int j = 0; j < 10; j++) {
float_output[j] = tvmgen_default_output_scale_data[0] * (int_output[j] - tvmgen_default_output_zero_point_data[0]);
}
Note
On C2000 devices, since there are no 8-bit integer data types, int8_t
is aliased to int16_t
,
and uint8_t
is aliased to uint16_t
.
This type aliasing is consistent with the TI C2000Ware SDK.
Please use int16_t
/uint16_t
to declare your input/output tensor data accordingly.
The input feature normalization sequence should still clip the values in the range of [-128, 127] or [0, 255], respectively.
3.4. Running Model on Host Processor (ti-npu type=soft)¶
When running the model on the host processor, the symbol TVMGEN_DEFAULT_TI_NPU_SOFT
is defined in the generated header file
(tvmgen_default.h
).
/* Symbol defined when running model on the host processor */
#define TVMGEN_DEFAULT_TI_NPU_SOFT
#ifdef TVMGEN_DEFAULT_TI_NPU
#error Conflicting definition for where model should run.
#endif
After the input/output data structure has been set up, running inference is as simple as invoking the inference function.
#include "tvmgen_default.h"
tvmgen_default_run(&inputs, &outputs);
When the inference function returns, the inference results are stored in outputs
.
3.5. Running Model on Hardware NPU Accelerator (ti-npu)¶
Because the NPU accelerator is a separate core that runs asynchronously to the host processor,
the accelerator must be initialized by the application before it can be used. Perform this initialization by calling the TI_NPU_init()
function.
As explained in Integrating Compilation Artifacts into a CCS Project, hardware NPU requires interrupt to work, please ensure that the interrupt are enabled in the CCS project.
After invoking the inference function, the application needs to check a volatile variable as follows to see if the inference has been completed.
When running the model on the hardware NPU accelerator, the symbol TVMGEN_DEFAULT_TI_NPU
is defined in the generated header file (tvmgen_default.h
).
Both the NPU initialization function and the flag for checking model completion are declared in the generated header file (tvmgen_default.h
).
/* Symbol defined when running model on TI NPU hardware accelerator */
#define TVMGEN_DEFAULT_TI_NPU
#ifdef TVMGEN_DEFAULT_TI_NPU_SOFT
#error Conflicting definition for where model should run.
#endif
/* TI NPU hardware accelerator initialization */
extern void TI_NPU_init();
/* Flag for model execution completion on TI NPU hardware accelerator */
extern volatile int32_t tvmgen_default_finished;
Example code for running inference on hardware NPU accelerator is as follows:
#include "tvmgen_default.h"
#ifdef TVMGEN_DEFAULT_TI_NPU
TI_NPU_init(); /* one time initialization */
#endif
/* ... other code can go here ... */
tvmgen_default_run(&inputs, &outputs);
/* ... other code can go here ... */
#ifdef TVMGEN_DEFAULT_TI_NPU
while (!tvmgen_default_finished) ;
#endif