9.2. NPU Device Deployment
This guide covers deployment to TI devices with Neural Processing Unit (NPU) hardware acceleration.
9.2.1. NPU-Enabled Devices
The following devices include TI’s TINPU:
Device |
Family |
NPU Features |
|---|---|---|
F28P55 |
C2000 |
8-bit/4-bit inference, up to 60k params |
AM13E2 |
AM13 |
8-bit inference, Cortex-M33 + NPU |
MSPM0G5187 |
MSPM0 |
8-bit inference, ultra-low power |
9.2.2. NPU Compilation
To compile for NPU, use the correct preset:
common:
target_device: 'F28P55' # NPU device
training:
model_name: 'CLS_4k_NPU' # NPU-compatible model
compilation:
enable: True
preset_name: 'compress_npu_layer_data' # NPU optimization
The compress_npu_layer_data preset:
Optimizes memory layout for NPU
Compresses weight data
Generates NPU-specific code
9.2.3. NPU Model Requirements
Models must follow NPU constraints (see NPU Guidelines):
Use model names ending in
_NPUChannel counts must be multiples of 4
Kernel heights ≤ 7
Must use INT8 or INT4 quantization
9.2.4. NPU Compilation Artifacts
After compilation:
.../compilation/artifacts/
├── mod.a # Compiled library (includes NPU code)
├── mod.h # Model interface
├── model_config.h # NPU configuration
├── npu_layer_data.bin # NPU weight data
├── feature_extraction.c # Feature extraction
└── inference_example.c # Example code
9.2.5. NPU Initialization
NPU requires initialization before inference:
#include "mod.h"
#include "npu.h"
void main(void) {
// Initialize system
System_Init();
// Initialize NPU hardware
NPU_Init();
// Initialize model (loads weights to NPU)
mod_init();
// Now ready for inference
while (1) {
if (data_ready) {
run_npu_inference();
}
}
}
9.2.6. NPU Inference Code
#include "mod.h"
#include "feature_extraction.h"
// Buffers
float input_buffer[INPUT_SIZE];
float feature_buffer[FEATURE_SIZE];
float output_buffer[NUM_CLASSES];
void run_npu_inference(void) {
// 1. Collect sensor data
collect_sensor_data(input_buffer);
// 2. Extract features (runs on CPU)
extract_features(input_buffer, feature_buffer);
// 3. Run NPU inference
// NPU handles quantization internally
mod_inference(feature_buffer, output_buffer);
// 4. Get prediction
int prediction = argmax(output_buffer, NUM_CLASSES);
// 5. Act on result
handle_prediction(prediction);
}
9.2.7. NPU Memory Management
NPU requires specific memory regions:
Weight Memory:
NPU weights are stored in dedicated memory:
// Linker command file
MEMORY
{
NPU_WEIGHTS : origin = 0x00080000, length = 0x00010000
}
SECTIONS
{
.npu_weights : > NPU_WEIGHTS
}
Activation Memory:
NPU uses scratch memory for intermediate results:
// Allocate NPU scratch buffer
#pragma DATA_SECTION(npu_scratch, ".npu_scratch")
uint8_t npu_scratch[NPU_SCRATCH_SIZE];
9.2.8. NPU Performance
Typical NPU performance on F28P55:
Model |
CPU Time |
NPU Time |
Speedup |
|---|---|---|---|
CLS_1k_NPU |
2000 µs |
150 µs |
13x |
CLS_4k_NPU |
5000 µs |
300 µs |
17x |
CLS_13k_NPU |
15000 µs |
600 µs |
25x |
Note: Actual performance depends on model architecture and input size.
9.2.9. NPU Power Considerations
NPU can be power-managed:
// Disable NPU when not in use
void enter_low_power(void) {
NPU_Disable(); // Saves power
}
// Re-enable before inference
void prepare_inference(void) {
NPU_Enable();
// May need small delay for NPU to stabilize
delay_us(10);
}
9.2.10. NPU Debugging
Verify NPU Initialization:
if (NPU_GetStatus() != NPU_STATUS_READY) {
// NPU initialization failed
handle_error();
}
Check Inference Results:
Compare NPU results with expected values from training:
// Known test input
float test_input[] = {...};
float expected_output[] = {...};
mod_inference(test_input, output_buffer);
// Compare
float max_error = 0;
for (int i = 0; i < NUM_CLASSES; i++) {
float error = fabs(output_buffer[i] - expected_output[i]);
if (error > max_error) max_error = error;
}
// Quantization error should be small
if (max_error > 0.1) {
// Unexpected deviation
debug_print("Max error: %f\n", max_error);
}
9.2.11. NPU Error Handling
Handle NPU errors gracefully:
int run_safe_inference(float* features, float* output) {
// Check NPU status
if (NPU_GetStatus() != NPU_STATUS_READY) {
NPU_Reset();
if (NPU_GetStatus() != NPU_STATUS_READY) {
return -1; // NPU unavailable
}
}
// Run inference
int result = mod_inference(features, output);
if (result != 0) {
// Inference error
NPU_Reset();
return -2;
}
return 0; // Success
}
9.2.12. CCS Project Setup for NPU
1. Include NPU Support Files:
From your device SDK, add:
NPU driver files
NPU header files
NPU configuration files
2. Configure Linker:
Ensure linker command file includes NPU memory regions.
3. Add Compiler Defines:
Project Properties → Build → Compiler → Predefined Symbols
Add: NPU_ENABLED=1
9.2.13. Example: Arc Fault on F28P55 NPU
Complete deployment example:
Configuration:
common:
task_type: 'generic_timeseries_classification'
target_device: 'F28P55'
dataset:
dataset_name: 'dc_arc_fault_example_dsk'
training:
model_name: 'CLS_4k_NPU'
quantization: 2
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
compilation:
enable: True
preset_name: 'compress_npu_layer_data'
Main Application:
#include "device.h"
#include "mod.h"
#include "feature_extraction.h"
#include "npu.h"
#define SAMPLE_SIZE 1024
#define FEATURE_SIZE 256
#define NUM_CLASSES 2 // Normal, Arc
float adc_buffer[SAMPLE_SIZE];
float feature_buffer[FEATURE_SIZE];
float output_buffer[NUM_CLASSES];
volatile uint8_t inference_flag = 0;
void main(void) {
// System initialization
Device_init();
Device_initGPIO();
// Initialize ADC for current sensing
ADC_Init();
// Initialize NPU
NPU_Init();
// Initialize model
mod_init();
// Enable interrupts
EINT;
while (1) {
if (inference_flag) {
// Extract features
extract_features(adc_buffer, feature_buffer);
// Run NPU inference
mod_inference(feature_buffer, output_buffer);
// Check for arc fault
if (output_buffer[1] > output_buffer[0]) {
// Arc detected!
GPIO_writePin(ALERT_PIN, 1);
trigger_protection();
}
inference_flag = 0;
}
}
}
__interrupt void ADC_ISR(void) {
static uint16_t sample_idx = 0;
adc_buffer[sample_idx++] = ADC_readResult();
if (sample_idx >= SAMPLE_SIZE) {
sample_idx = 0;
inference_flag = 1;
}
ADC_clearInterruptStatus();
}
9.2.14. Troubleshooting NPU Issues
NPU Initialization Fails:
Check device is NPU-enabled
Verify NPU clock is enabled
Ensure NPU memory regions are defined
Incorrect Results:
Verify model is NPU-compatible
Check quantization settings match
Compare with float model on same input
NPU Hangs:
Check for memory conflicts
Verify buffer alignments
Reset NPU and retry
9.2.15. CCS Studio Walkthrough: F28P55x
This section provides a complete step-by-step walkthrough for deploying an arc fault classification model to the LAUNCHXL-F28P55X board using Code Composer Studio. The F28P55x device includes TI’s TINPU, so this example exercises the full NPU inference path.
Requirements
LaunchPad |
LAUNCHXL-F28P55X |
SDK |
C2000Ware 6.00 |
IDE |
CCS Studio 20.2.0 or later |
9.2.15.1. Step 1 – Load the Example from Resource Explorer
The F28P55x arc fault example is available directly through the CCS Resource Explorer.
Open Code Composer Studio.
Navigate to View → Resource Explorer.
Opening Resource Explorer from the View menu.
In the Resource Explorer, set the Board or Device filter to
LAUNCHXL-F28P55X.In the search bar, type
arc_fault_dataset_validation_f28p55x.
Resource Explorer with board and keyword filters filled in.
Select the folder arc_fault_dataset_validation_f28p55x and click Import.
Importing the arc fault project from Resource Explorer.
Download and install any required dependencies when prompted.
Downloading required SDK dependencies.
After all packages are installed the final import dialog appears. Click Finish to import the project into your workspace.
Final import dialog after dependencies are resolved.
9.2.15.2. Step 2 – Build the Project
Go to Project → Build Project(s) (or press
Ctrl+B).
Building the project from the Project menu.
Verify that the build completes without errors in the Console view.
9.2.15.3. Step 3 – Set Target Configuration
Switch the active target configuration from
TMS320F28P550SJ9.ccxmltoTMS320F28P550SJ9_LaunchPad.ccxml. Right-click the.ccxmlfile in Project Explorer and select Set as Active Target Configuration.
Selecting the LaunchPad target configuration.
9.2.15.4. Step 4 – Flash the Device
Connect the LAUNCHXL-F28P55X LaunchPad to your PC via USB.
Go to Run → Flash Project.
Flashing the built project to the device.
(Optional) If a firmware update prompt appears, click Update.
Firmware update dialog – click Update if it appears.
9.2.15.5. Step 5 – Debug and Verify
After flashing, the Debug perspective opens. Click the Debug icon to start a debug session.
CCS Debug perspective after flashing.
Place a breakpoint on the line that follows the inference call in
application_main.c.
Setting a breakpoint after the inference call.
Click Resume (F8) to run the program. When the breakpoint is hit, add the variable
test_resultto the Watch window.
Adding test_result to the Watch window.
Inspect the value:
test_result == 1– model inference passed (output matches golden vector).test_result == 0– model inference failed.
Verifying the test_result value in the Watch window.
9.2.16. Required Files from ModelMaker
The CCS example arc_fault_dataset_validation_f28p55x requires four files
generated by a ModelMaker run. After ModelMaker finishes, copy each file
from its ModelMaker output path to the corresponding CCS project path.
File |
Purpose |
ModelMaker Source Path |
CCS Project Destination |
|---|---|---|---|
|
Compiled model library |
|
|
|
Model inference API header |
|
|
|
Golden-vector test data |
|
|
|
Feature extraction config |
|
|
The ... prefix in the source paths expands to your ModelMaker data
directory, for example:
tinyml-modelmaker/data/projects/dc_arc_fault_example_dsk/run/<run_name>/
After copying the four files, rebuild the CCS project, flash, and verify
test_result in the debugger as described above.
9.2.17. Model Performance Profiling
Understanding inference performance is critical when deploying models to resource-constrained MCUs. This section describes how to measure inference cycle counts on the F28P55x device so you can evaluate the tradeoff between model accuracy and inference latency. Developers typically want to minimize cycles (faster inference), but reducing computation can also reduce model accuracy.
By profiling different model and input-size combinations, you can select the configuration that meets your latency budget while maintaining acceptable accuracy.
Requirements
Device: LAUNCHXL-F28P55X
C2000Ware 6.00.00.00
Code Composer Studio 20.2.0
Importing the Profiling Project
The example project ex_model_performance_f28p55x is not available
through CCS Resource Explorer. You must manually copy it into the C2000Ware AI
examples directory and import it:
Copy the
ex_model_performance_f28p55xproject folder.Paste it into the C2000Ware AI examples path:
C2000Ware_6_00_00_00/libraries/ai/feature_extract/c28/examples/
In CCS, go to File -> Import Projects(s).
Browse to the
ex_model_performance_f28p55xfolder and click Select Folder.Click Finish to import the project.
Running the Profiling Example
Build the project: Project -> Build Project(s).
Set the active target configuration to
TMS320F28P550SJ9_LaunchPad.ccxml(matching your LAUNCHXL-F28P55X).Connect the launchpad to your system.
Flash the project: Run -> Flash Project.
After flashing, click the Debug icon to enter debug mode.
Click Continue in the Debug Window to let the example run.
Read the inference cycle count from the GEL Output window.
Required Files from ModelMaker
The CCS project ex_model_performance_f28p55x requires three files generated
by ModelMaker after a training and compilation run:
File |
Description |
Destination in CCS Project |
|---|---|---|
|
Compiled model library |
|
|
Header for model inference APIs |
|
|
Model input/output size configuration |
|
After each ModelMaker run, copy these three files from the ModelMaker output into the corresponding CCS project locations, rebuild the project, flash, and debug to obtain the cycle count.
Configuring Different Models and Frame Sizes
To sweep across different configurations, edit the ModelMaker YAML
configuration. The two key parameters are frame_size (which controls the
input tensor dimension) and model_name (which selects the neural network
architecture):
data_processing_feature_extraction:
feature_extraction_name: 'Custom_Default'
data_proc_transforms: ['SimpleWindow']
frame_size: 128 # Input to the model (N,C,H,W) -> (1,1,128,1)
training:
enable: True
model_name: 'TimeSeries_Generic_1k_t' # Select the model
By sweeping frame_size over values such as 128, 256, 512, and 1024 and
choosing among TimeSeries_Generic_1k_t, TimeSeries_Generic_4k_t,
TimeSeries_Generic_6k_t, and TimeSeries_Generic_13k_t, you can
characterize the full accuracy-vs-cycles landscape for your application.
Profiling Results
The table below shows measured inference cycle counts and corresponding
accuracies on the arc fault dataset. Input size is
(N,C,H,W) = (1, 1, frame_size, 1) and the output size is 2 (binary
classification).
Model |
128 |
256 |
512 |
1024 |
|---|---|---|---|---|
TimeSeries_Generic_1k_t |
103,882 cycles (80.32%) |
188,397 cycles (85.83%) |
372,242 cycles (87.33%) |
692,220 cycles (94.43%) |
TimeSeries_Generic_4k_t |
71,595 cycles (86.45%) |
109,534 cycles (89.29%) |
184,981 cycles (91.79%) |
326,782 cycles (94.85%) |
TimeSeries_Generic_6k_t |
107,982 cycles (89.97%) |
164,676 cycles (90.05%) |
261,792 cycles (92.62%) |
462,680 cycles (95.68%) |
TimeSeries_Generic_13k_t |
199,691 cycles (91.83%) |
312,406 cycles (92.60%) |
535,437 cycles (93.21%) |
985,529 cycles (95.54%) |
Key Takeaways
Larger models (higher parameter count) generally deliver better accuracy but consume more inference cycles.
Larger frame sizes improve accuracy at the cost of proportionally more cycles.
TimeSeries_Generic_4k_toffers the best cycles-per-accuracy ratio for smaller input sizes, making it a strong default choice when latency is constrained.Use this profiling workflow to select the right model and frame size combination that fits within your application’s latency budget while meeting accuracy requirements.
9.2.18. Next Steps
Review NPU Guidelines for model constraints
See CCS Integration Guide for general CCS setup
Check Common Errors for issues