6.2. NPU Guidelines

This guide covers the constraints and best practices for designing models that run on TI’s Neural Processing Unit (TINPU).

6.2.1. NPU-Enabled Devices

F28P55 - C2000 family
AM13E2 - AM13 family
MSPM0G5187 - MSPM0 family

6.2.2. Layer Constraints

Models running on the NPU must follow these constraints:

First Convolution Layer (FCONV)

Input channels	Must be 1
Output channels	Must be multiple of 4
Kernel width	Maximum 8

Generic Convolution Layer (GCONV)

Input channels	Must be multiple of 4
Output channels	Must be multiple of 4
Kernel height	Maximum 7 (critical constraint)
Stride	Supported: (1,1), (2,1), (2,2)

Depthwise Convolution (DWCONV)

Groups	Must equal input channels (true depthwise)
Kernel width	Maximum 7

Pointwise Convolution (PWCONV)

Kernel size	Must be (1, 1)
Channels	Must be multiples of 4

Pooling Layers

MaxPool kernel	Maximum 4x4
AvgPool (global)	Input size must satisfy (H × W) > 2

Fully Connected (FC) Layer

Input features (8-bit)	Minimum 16
Input features (4-bit)	Minimum 8

6.2.3. Using NPU-Compatible Models

Use model names ending in _NPU:

training:
  model_name: 'CLS_1k_NPU'    # NPU-compatible
  # not: model_name: 'CLS_1k'  # Non-NPU version

Available NPU models:

Classification: CLS_100_NPU through CLS_55k_NPU
Regression: REGR_500_NPU through REGR_20k_NPU
Anomaly Detection: AD_500_NPU through AD_20k_NPU
Forecasting: FCST_500_NPU through FCST_20k_NPU

6.2.4. Channel Multiples of 4

All intermediate channels must be multiples of 4:

Correct:

Input: 1 channel
Conv1: 1 → 4 channels
Conv2: 4 → 8 channels
Conv3: 8 → 16 channels
FC: 16 → num_classes

Incorrect:

Input: 1 channel
Conv1: 1 → 3 channels    # NOT multiple of 4
Conv2: 3 → 6 channels    # NOT multiple of 4

6.2.5. Kernel Size Restrictions

The most common issue is kernel height exceeding 7:

Correct:

# Kernel (5, 1) - height 5 is OK
# Kernel (7, 1) - height 7 is OK (maximum)

Incorrect:

# Kernel (8, 1) - height 8 exceeds limit
# Kernel (9, 1) - NOT supported

6.2.6. Compilation Preset

For NPU devices, use the appropriate compilation preset:

compilation:
  enable: True
  preset_name: 'compress_npu_layer_data'  # For NPU devices

The compress_npu_layer_data preset optimizes memory layout for NPU.

6.2.7. Custom NPU-Compatible Models

When creating custom models for NPU, follow this template:

class MY_NPU_MODEL(GenericModelWithSpec):
    def __init__(self, config, input_features=128, variables=1, num_classes=3):
        super().__init__(config, input_features=input_features,
                        variables=variables, num_classes=num_classes)
        self.model_spec = self.gen_model_spec()
        self._init_model_from_spec(...)

    def gen_model_spec(self):
        layers = py_utils.DictPlus()

        # First conv: in_channels=1 (variables), out_channels=4 (multiple of 4)
        layers += {'0': dict(type='ConvBNReLULayer',
                            in_channels=self.variables,  # Must be 1 for FCONV
                            out_channels=4,              # Multiple of 4
                            kernel_size=(5, 1),          # Height ≤ 7
                            stride=(1, 1))}

        # Subsequent convs: all channels multiple of 4
        layers += {'1': dict(type='ConvBNReLULayer',
                            in_channels=4,
                            out_channels=8,
                            kernel_size=(5, 1),
                            stride=(1, 1))}

        # MaxPool: kernel ≤ 4
        layers += {'2': dict(type='MaxPoolLayer',
                            kernel_size=(2, 1),
                            stride=(2, 1))}

        # FC: input features ≥ 16
        layers += {'3': dict(type='ReshapeLayer', ndim=2)}
        layers += {'4': dict(type='LinearLayer',
                            in_features=...,  # ≥ 16
                            out_features=self.num_classes)}

        return dict(model_spec=layers)

6.2.8. Troubleshooting NPU Compilation

“Channel count not multiple of 4”

Adjust your model architecture to use channels that are multiples of 4.

“Kernel size exceeds limit”

Reduce kernel height to 7 or less. Use multiple smaller kernels instead.

“Unsupported layer type”

Check that all layers are in the supported list (Conv, Pool, FC, BN, ReLU).

“FC input features too small”

Ensure the FC layer receives at least 16 input features.

6.2.9. Performance Comparison

Example inference times (approximate):

Model	CPU (F28P55)	NPU (F28P55)	Speedup
CLS_1k	2000 µs	150 µs	~13x
CLS_4k	5000 µs	300 µs	~17x
CLS_13k	15000 µs	600 µs	~25x

Actual performance varies by model architecture and input size.