9.3. Non-NPU Deployment
This guide covers deployment to TI devices without NPU hardware acceleration. These devices run inference entirely on the CPU.
9.3.1. Non-NPU Devices
Devices without NPU include:
C2000 Family:
F28P65, F29H85, F29P58, F29P32
F2837, F28004, F28003
F280013, F280015
MSPM0 Family:
MSPM0G3507, MSPM0G3519
MSPM33C Family:
MSPM33C32, MSPM33C34
AM26x Family:
AM263, AM263P, AM261
Connectivity:
CC2755, CC1352
9.3.2. Configuration
For non-NPU devices, use standard models:
common:
target_device: 'F28P65' # Non-NPU device
training:
model_name: 'CLS_4k' # Standard model (no _NPU suffix)
compilation:
enable: True
preset_name: 'default_preset' # Standard compilation
9.3.3. Model Selection
Without NPU acceleration, choose smaller models:
Device Class |
Recommended Size |
Model Examples |
|---|---|---|
Entry-level (M0+) |
100-500 params |
CLS_100, CLS_500 |
Mid-range |
500-2k params |
CLS_1k, CLS_2k |
High-performance |
2k-6k params |
CLS_4k, CLS_6k |
AM26x (Cortex-R5) |
Up to 13k params |
CLS_6k, CLS_13k |
9.3.4. CPU Inference Performance
Typical inference times (CPU-only):
Model |
F28P65 |
MSPM0G3507 |
AM263 |
CC2755 |
|---|---|---|---|---|
CLS_500 |
500 µs |
800 µs |
200 µs |
600 µs |
CLS_1k |
1000 µs |
1500 µs |
400 µs |
1200 µs |
CLS_4k |
4000 µs |
6000 µs |
1500 µs |
5000 µs |
Note: Times are approximate and depend on clock frequency.
9.3.5. Compilation Artifacts
Non-NPU compilation produces:
.../compilation/artifacts/
├── mod.a # Model library (CPU code)
├── mod.h # Model interface
├── model_config.h # Configuration
├── feature_extraction.c # Feature extraction
└── inference_example.c # Example code
9.3.6. CCS Project Setup
Import the Project:
Importing a project into CCS for non-NPU devices
Build the Project:
Building the project for non-NPU deployment
Flash and Debug:
Flashing the application to a non-NPU device
CCS Debug perspective for non-NPU deployment
9.3.7. Basic Integration
#include "mod.h"
#include "feature_extraction.h"
float input_buffer[INPUT_SIZE];
float feature_buffer[FEATURE_SIZE];
float output_buffer[NUM_CLASSES];
void run_inference(void) {
// Collect data
collect_sensor_data(input_buffer);
// Extract features
extract_features(input_buffer, feature_buffer);
// Run CPU inference
mod_inference(feature_buffer, output_buffer);
// Get result
int prediction = argmax(output_buffer, NUM_CLASSES);
handle_result(prediction);
}
9.3.8. Optimizing CPU Inference
1. Enable Compiler Optimizations:
Project Properties → Build → Compiler → Optimization
Level: 4 (Highest)
Speed vs Size: Speed
2. Use Fixed-Point When Possible:
If your model supports fixed-point:
training:
quantization: 1
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
INT8 operations are faster than float on many MCUs.
3. Place Critical Code in Fast Memory:
#pragma CODE_SECTION(mod_inference, ".TI.ramfunc")
4. Optimize Feature Extraction:
Use simpler feature extraction if possible:
data_processing_feature_extraction:
feature_extraction_name: 'Generic_256Input_RAW_256Feature_1Frame'
9.3.9. Memory Optimization
Non-NPU devices may have limited RAM:
Minimize Buffer Sizes:
data_processing_feature_extraction:
# Smaller input reduces buffers
feature_extraction_name: 'Generic_256Input_FFTBIN_32Feature_4Frame'
Use Static Allocation:
// Static allocation - size known at compile time
static float feature_buffer[FEATURE_SIZE];
static float output_buffer[NUM_CLASSES];
Memory Map Check:
Verify model fits in available memory:
After building, check .map file:
.text (code): XX KB
.const (weights): XX KB
.bss (buffers): XX KB
Compare with device memory:
Flash: XXX KB
RAM: XX KB
9.3.10. Power Optimization
For battery-powered devices:
1. Duty Cycle Inference:
void main(void) {
while (1) {
// Wake up
wake_from_sleep();
// Run inference
run_inference();
// Sleep
enter_low_power_mode();
}
}
2. Reduce Clock During Inference:
Some devices allow dynamic clocking:
// Run at lower clock for power savings
// (trades off latency for power)
set_clock_speed(CLOCK_40MHZ);
run_inference();
3. Use Smallest Sufficient Model:
training:
model_name: 'CLS_500' # Smaller = less energy
9.3.11. Real-Time Considerations
For real-time applications:
Worst-Case Execution Time (WCET):
Measure inference time to ensure deadlines are met:
// Measure WCET
uint32_t max_time = 0;
for (int i = 0; i < 1000; i++) {
uint32_t start = get_timer();
run_inference();
uint32_t elapsed = get_timer() - start;
if (elapsed > max_time) max_time = elapsed;
}
// max_time is WCET estimate
Interrupt Latency:
Inference may block interrupts:
// Option 1: Run inference at low priority
void low_priority_task(void) {
run_inference();
}
// Option 2: Split inference into chunks
void inference_chunk(int chunk_id) {
mod_inference_partial(chunk_id, feature_buffer, output_buffer);
}
9.3.12. Device-Specific Notes
C2000 (F28P65, F2837, etc.):
Strong floating-point unit
Good for signal processing
Use FPU-optimized libraries
// Enable FPU
FPU_enableModule();
MSPM0 (Cortex-M0+):
No FPU (software float)
Prefer INT8 quantization
Keep models small (<1k params)
training:
model_name: 'CLS_500'
quantization: 1
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
AM26x (Cortex-R5):
High performance
FPU available
Can handle larger models
training:
model_name: 'CLS_6k' # or larger
CC27xx/CC13xx (Connectivity):
Balance model vs wireless stack memory
Consider inference frequency vs RF activity
9.3.13. Example: Vibration Monitoring on MSPM0G3507
common:
task_type: 'generic_timeseries_anomalydetection'
target_device: 'MSPM0G3507'
dataset:
dataset_name: 'vibration_dataset'
data_processing_feature_extraction:
feature_extraction_name: 'Generic_256Input_FFTBIN_32Feature_4Frame'
variables: 1
training:
model_name: 'AD_500' # Small model for M0+
quantization: 1
quantization_method: 'QAT'
quantization_weight_bitwidth: 8
quantization_activation_bitwidth: 8
compilation:
enable: True
Application Code:
#include "ti_msp_dl_config.h"
#include "mod.h"
#include "feature_extraction.h"
#define SAMPLE_SIZE 256
#define FEATURE_SIZE 128
#define THRESHOLD 0.5f
float adc_buffer[SAMPLE_SIZE];
float feature_buffer[FEATURE_SIZE];
float output; // Reconstruction error
int main(void) {
SYSCFG_DL_init();
// Initialize model
mod_init();
while (1) {
// Collect vibration data
for (int i = 0; i < SAMPLE_SIZE; i++) {
DL_ADC12_startConversion(ADC0);
while (!DL_ADC12_isConversionComplete(ADC0));
adc_buffer[i] = DL_ADC12_getMemResult(ADC0, 0);
}
// Extract features
extract_features(adc_buffer, feature_buffer);
// Run anomaly detection
mod_inference(feature_buffer, &output);
// Check threshold
if (output > THRESHOLD) {
// Anomaly detected
DL_GPIO_setPins(ALERT_PORT, ALERT_PIN);
} else {
DL_GPIO_clearPins(ALERT_PORT, ALERT_PIN);
}
// Enter low power until next sample period
__WFI();
}
}
9.3.14. Comparison: NPU vs Non-NPU
Aspect |
NPU Devices |
Non-NPU Devices |
|---|---|---|
Inference speed |
10-25x faster |
Baseline |
Model size |
Up to 60k params |
Typically <6k params |
Power |
Lower per inference |
Higher per inference |
Model constraints |
NPU-specific rules |
More flexible |
Cost |
Higher BOM |
Lower BOM |
9.3.15. Next Steps
See CCS Integration Guide for detailed CCS setup
Review Device Overview for device selection
Check Common Errors for issues