9.3. Non-NPU Deployment
This guide covers deployment to TI devices without NPU hardware acceleration. These devices run inference entirely on the CPU.
9.3.1. Non-NPU Devices
Devices without NPU include:
C2000 Family (F28x):
F28003x, F28004x, F28P65x
F280013x, F280015x, F2837x
C2000 Family (F29x):
F29H85x, F29P58x, F29P32x
MSPM0 Family:
MSPM0G3507, MSPM0G3519
MSPM33 Family:
MSPM33C321Ax
Sitara MCU Family (AM26x):
AM263, AM263P, AM261
Connectivity:
CC2755, CC1352, CC1354, CC35X1
9.3.2. Configuration
For non-NPU devices, use standard models:
common:
target_device: 'F28P65' # Non-NPU device
training:
model_name: 'CLS_4k' # Standard model (no _NPU suffix)
compilation:
enable: True
preset_name: 'default_preset' # Standard compilation
9.3.3. Model Selection
Without NPU acceleration, choose smaller models:
Device Class |
Recommended Size |
Model Examples |
|---|---|---|
Entry-level (M0+) |
100-500 params |
CLS_100, CLS_500 |
Mid-range |
500-2k params |
CLS_1k, CLS_2k |
High-performance |
2k-6k params |
CLS_4k, CLS_6k |
AM26x (Cortex-R5) |
Up to 13k params |
CLS_6k, CLS_13k |
9.3.4. CPU Inference Performance
Typical inference times (CPU-only):
Model |
F28P65 |
MSPM0G3507 |
AM263 |
CC2755 |
|---|---|---|---|---|
CLS_500 |
500 µs |
800 µs |
200 µs |
600 µs |
CLS_1k |
1000 µs |
1500 µs |
400 µs |
1200 µs |
CLS_4k |
4000 µs |
6000 µs |
1500 µs |
5000 µs |
Note: Times are approximate and depend on clock frequency.
9.3.5. Compilation Artifacts
Non-NPU compilation produces:
.../compilation/artifacts/
├── mod.a # Model library (CPU code)
├── mod.h # Model interface
├── model_config.h # Configuration
├── feature_extraction.c # Feature extraction
└── inference_example.c # Example code
9.3.6. CCS Project Setup
Import the Project:
Importing a project into CCS for non-NPU devices
Build the Project:
Building the project for non-NPU deployment
Flash and Debug:
Flashing the application to a non-NPU device
CCS Debug perspective for non-NPU deployment
9.3.7. Basic Integration
#include "mod.h"
#include "feature_extraction.h"
float input_buffer[INPUT_SIZE];
float feature_buffer[FEATURE_SIZE];
float output_buffer[NUM_CLASSES];
void run_inference(void) {
// Collect data
collect_sensor_data(input_buffer);
// Extract features
extract_features(input_buffer, feature_buffer);
// Run CPU inference
mod_inference(feature_buffer, output_buffer);
// Get result
int prediction = argmax(output_buffer, NUM_CLASSES);
handle_result(prediction);
}
9.3.8. Optimizing CPU Inference
1. Enable Compiler Optimizations:
Project Properties → Build → Compiler → Optimization
Level: 4 (Highest)
Speed vs Size: Speed
2. Use Fixed-Point When Possible:
If your model supports fixed-point:
training:
quantization: 1
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
INT8 operations are faster than float on many MCUs.
3. Place Critical Code in Fast Memory:
#pragma CODE_SECTION(mod_inference, ".TI.ramfunc")
4. Optimize Feature Extraction:
Use simpler feature extraction if possible:
data_processing_feature_extraction:
feature_extraction_name: 'Generic_256Input_RAW_256Feature_1Frame'
9.3.9. Memory Optimization
Non-NPU devices may have limited RAM:
Minimize Buffer Sizes:
data_processing_feature_extraction:
# Smaller input reduces buffers
feature_extraction_name: 'Generic_256Input_FFTBIN_32Feature_4Frame'
Use Static Allocation:
// Static allocation - size known at compile time
static float feature_buffer[FEATURE_SIZE];
static float output_buffer[NUM_CLASSES];
Memory Map Check:
Verify model fits in available memory:
After building, check .map file:
.text (code): XX KB
.const (weights): XX KB
.bss (buffers): XX KB
Compare with device memory:
Flash: XXX KB
RAM: XX KB
9.3.10. Power Optimization
For battery-powered devices:
1. Duty Cycle Inference:
void main(void) {
while (1) {
// Wake up
wake_from_sleep();
// Run inference
run_inference();
// Sleep
enter_low_power_mode();
}
}
2. Reduce Clock During Inference:
Some devices allow dynamic clocking:
// Run at lower clock for power savings
// (trades off latency for power)
set_clock_speed(CLOCK_40MHZ);
run_inference();
3. Use Smallest Sufficient Model:
training:
model_name: 'CLS_500' # Smaller = less energy
9.3.11. Real-Time Considerations
For real-time applications:
Worst-Case Execution Time (WCET):
Measure inference time to ensure deadlines are met:
// Measure WCET
uint32_t max_time = 0;
for (int i = 0; i < 1000; i++) {
uint32_t start = get_timer();
run_inference();
uint32_t elapsed = get_timer() - start;
if (elapsed > max_time) max_time = elapsed;
}
// max_time is WCET estimate
Interrupt Latency:
Inference may block interrupts:
// Option 1: Run inference at low priority
void low_priority_task(void) {
run_inference();
}
// Option 2: Split inference into chunks
void inference_chunk(int chunk_id) {
mod_inference_partial(chunk_id, feature_buffer, output_buffer);
}
9.3.12. Device-Specific Notes
C2000 (F28P65, F2837, etc.):
Strong floating-point unit
Good for signal processing
Use FPU-optimized libraries
// Enable FPU
FPU_enableModule();
MSPM0 (Cortex-M0+):
No FPU (software float)
Prefer INT8 quantization
Keep models small (<1k params)
training:
model_name: 'CLS_500'
quantization: 1
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
AM26x (Cortex-R5):
High performance
FPU available
Can handle larger models
training:
model_name: 'CLS_6k' # or larger
CC27xx/CC13xx/CC35xx (Connectivity):
Supported devices: CC2755, CC1352, CC1354, CC35X1
Balance model vs wireless stack memory
Consider inference frequency vs RF activity
9.3.13. Example: Vibration Monitoring on MSPM0G3507
common:
task_type: 'generic_timeseries_anomalydetection'
target_device: 'MSPM0G3507'
dataset:
dataset_name: 'vibration_dataset'
data_processing_feature_extraction:
feature_extraction_name: 'Generic_256Input_FFTBIN_32Feature_4Frame'
variables: 1
training:
model_name: 'AD_500' # Small model for M0+
quantization: 1
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
compilation:
enable: True
Application Code:
#include "ti_msp_dl_config.h"
#include "mod.h"
#include "feature_extraction.h"
#define SAMPLE_SIZE 256
#define FEATURE_SIZE 128
#define THRESHOLD 0.5f
float adc_buffer[SAMPLE_SIZE];
float feature_buffer[FEATURE_SIZE];
float output; // Reconstruction error
int main(void) {
SYSCFG_DL_init();
// Initialize model
mod_init();
while (1) {
// Collect vibration data
for (int i = 0; i < SAMPLE_SIZE; i++) {
DL_ADC12_startConversion(ADC0);
while (!DL_ADC12_isConversionComplete(ADC0));
adc_buffer[i] = DL_ADC12_getMemResult(ADC0, 0);
}
// Extract features
extract_features(adc_buffer, feature_buffer);
// Run anomaly detection
mod_inference(feature_buffer, &output);
// Check threshold
if (output > THRESHOLD) {
// Anomaly detected
DL_GPIO_setPins(ALERT_PORT, ALERT_PIN);
} else {
DL_GPIO_clearPins(ALERT_PORT, ALERT_PIN);
}
// Enter low power until next sample period
__WFI();
}
}
9.3.14. CCS Studio Walkthrough: F28004x
This section provides a complete step-by-step walkthrough for deploying an arc fault classification model to the LAUNCHXL-F28004X board using Code Composer Studio. The F28004x does not have an NPU, so inference runs entirely on the CPU.
Important
Unlike the F28P55x example, the F28004x arc fault project is not available in the CCS Resource Explorer. You must import it manually using File → Import Projects(s).
Requirements
LaunchPad |
LAUNCHXL-F28004X |
SDK |
C2000Ware 6.00 |
IDE |
CCS Studio 20.2.0 or later |
9.3.14.1. Step 1 – Import the Project Manually
Because this example is not listed in Resource Explorer, use the manual import flow.
Open Code Composer Studio.
Go to File → Import Projects(s).
Selecting Import Projects from the File menu.
In the import dialog, click Browse and navigate to the folder
ex_arc_fault_dataset_validation_f28004x. Click Select Folder.
Browsing to the ex_arc_fault_dataset_validation_f28004x folder.
Click Finish to import the project into your workspace.
9.3.14.2. Step 2 – Build the Project
Go to Project → Build Project(s) (or press
Ctrl+B).
Building the project from the Project menu.
Verify that the build completes without errors in the Console view.
9.3.14.3. Step 3 – Set Target Configuration
Switch the active target configuration from
TMS320F280049C.ccxmltoTMS320F280049C_LaunchPad.ccxml. Right-click the.ccxmlfile in Project Explorer and select Set as Active Target Configuration.
Selecting the LaunchPad target configuration.
9.3.14.4. Step 4 – Flash the Device
Connect the LAUNCHXL-F28004X LaunchPad to your PC via USB.
Go to Run → Flash Project.
Flashing the built project to the device.
(Optional) If a firmware update prompt appears, click Update.
Firmware update dialog – click Update if it appears.
9.3.14.5. Step 5 – Debug and Verify
After flashing, the Debug perspective opens. Click the Debug icon to start a debug session.
CCS Debug perspective after flashing.
Place a breakpoint on the line that follows the inference call in
application_main.c.
Setting a breakpoint after the inference call.
Click Resume (F8) to run the program. When the breakpoint is hit, add the variable
test_resultto the Watch window.
Adding test_result to the Watch window.
Inspect the value:
test_result == 1– model inference passed (output matches golden vector).test_result == 0– model inference failed.
Verifying the test_result value in the Watch window.
9.3.15. Required Files from ModelMaker
The CCS example ex_arc_fault_dataset_validation_f28004x requires four
files generated by a ModelMaker run. After ModelMaker finishes, copy each
file from its ModelMaker output path to the corresponding CCS project path.
File |
Purpose |
ModelMaker Source Path |
CCS Project Destination |
|---|---|---|---|
|
Compiled model library |
|
|
|
Model inference API header |
|
|
|
Golden-vector test data |
|
|
|
Feature extraction config |
|
|
The ... prefix in the source paths expands to your ModelMaker data
directory, for example:
tinyml-modelmaker/data/projects/dc_arc_fault_example_dsk/run/<run_name>/
After copying the four files, rebuild the CCS project, flash, and verify
test_result in the debugger as described above.
9.3.16. Comparison: NPU vs Non-NPU
Aspect |
NPU Devices |
Non-NPU Devices |
|---|---|---|
Inference speed |
10-25x faster |
Baseline |
Model size |
Up to 60k params |
Typically <6k params |
Power |
Lower per inference |
Higher per inference |
Model constraints |
NPU-specific rules |
More flexible |
Cost |
Higher BOM |
Lower BOM |
9.3.17. Next Steps
See CCS Integration Guide for detailed CCS setup
Review Device Overview for device selection
Check Common Errors for issues