9.3. Non-NPU Deployment

This guide covers deployment to TI devices without NPU hardware acceleration. These devices run inference entirely on the CPU.

9.3.1. Non-NPU Devices

Devices without NPU include:

C2000 Family (F28x):

F28003x, F28004x, F28P65x
F280013x, F280015x, F2837x

C2000 Family (F29x):

F29H85x, F29P58x, F29P32x

MSPM0 Family:

MSPM0G3507, MSPM0G3519

MSPM33 Family:

MSPM33C321Ax

Sitara MCU Family (AM26x):

AM263, AM263P, AM261

Connectivity:

CC2755, CC1352, CC1354, CC35X1

9.3.2. Configuration

For non-NPU devices, use standard models:

common:
  target_device: 'F28P65'  # Non-NPU device

training:
  model_name: 'CLS_4k'  # Standard model (no _NPU suffix)

compilation:
  enable: True
  preset_name: 'default_preset'  # Standard compilation

9.3.3. Model Selection

Without NPU acceleration, choose smaller models:

Device Class	Recommended Size	Model Examples
Entry-level (M0+)	100-500 params	CLS_100, CLS_500
Mid-range	500-2k params	CLS_1k, CLS_2k
High-performance	2k-6k params	CLS_4k, CLS_6k
AM26x (Cortex-R5)	Up to 13k params	CLS_6k, CLS_13k

9.3.4. CPU Inference Performance

Typical inference times (CPU-only):

Model	F28P65	MSPM0G3507	AM263	CC2755
CLS_500	500 µs	800 µs	200 µs	600 µs
CLS_1k	1000 µs	1500 µs	400 µs	1200 µs
CLS_4k	4000 µs	6000 µs	1500 µs	5000 µs

Note: Times are approximate and depend on clock frequency.

9.3.5. Compilation Artifacts

Non-NPU compilation produces:

.../compilation/artifacts/
├── mod.a                    # Model library (CPU code)
├── mod.h                    # Model interface
├── model_config.h           # Configuration
├── feature_extraction.c     # Feature extraction
└── inference_example.c      # Example code

9.3.6. CCS Project Setup

Import the Project:

Import Project for Non-NPU — Importing a project into CCS for non-NPU devices

Build the Project:

Build Project — Building the project for non-NPU deployment

Flash and Debug:

Flash Application — Flashing the application to a non-NPU device

Debug Screen — CCS Debug perspective for non-NPU deployment

9.3.7. Basic Integration

#include "mod.h"
#include "feature_extraction.h"

float input_buffer[INPUT_SIZE];
float feature_buffer[FEATURE_SIZE];
float output_buffer[NUM_CLASSES];

void run_inference(void) {
    // Collect data
    collect_sensor_data(input_buffer);

    // Extract features
    extract_features(input_buffer, feature_buffer);

    // Run CPU inference
    mod_inference(feature_buffer, output_buffer);

    // Get result
    int prediction = argmax(output_buffer, NUM_CLASSES);
    handle_result(prediction);
}

9.3.8. Optimizing CPU Inference

1. Enable Compiler Optimizations:

Project Properties → Build → Compiler → Optimization
Level: 4 (Highest)
Speed vs Size: Speed

2. Use Fixed-Point When Possible:

If your model supports fixed-point:

training:
  quantization: 1
  quantization_method: 'QAT'
  quantization_weight_bitwidth: 8
  quantization_activation_bitwidth: 8

INT8 operations are faster than float on many MCUs.

3. Place Critical Code in Fast Memory:

#pragma CODE_SECTION(mod_inference, ".TI.ramfunc")

4. Optimize Feature Extraction:

Use simpler feature extraction if possible:

data_processing_feature_extraction:
  feature_extraction_name: 'Generic_256Input_RAW_256Feature_1Frame'

9.3.9. Memory Optimization

Non-NPU devices may have limited RAM:

Minimize Buffer Sizes:

data_processing_feature_extraction:
  # Smaller input reduces buffers
  feature_extraction_name: 'Generic_256Input_FFTBIN_32Feature_4Frame'

Use Static Allocation:

// Static allocation - size known at compile time
static float feature_buffer[FEATURE_SIZE];
static float output_buffer[NUM_CLASSES];

Memory Map Check:

Verify model fits in available memory:

After building, check .map file:
.text (code):    XX KB
.const (weights): XX KB
.bss (buffers):   XX KB

Compare with device memory:
Flash: XXX KB
RAM:   XX KB

9.3.10. Power Optimization

For battery-powered devices:

1. Duty Cycle Inference:

void main(void) {
    while (1) {
        // Wake up
        wake_from_sleep();

        // Run inference
        run_inference();

        // Sleep
        enter_low_power_mode();
    }
}

2. Reduce Clock During Inference:

Some devices allow dynamic clocking:

// Run at lower clock for power savings
// (trades off latency for power)
set_clock_speed(CLOCK_40MHZ);
run_inference();

3. Use Smallest Sufficient Model:

training:
  model_name: 'CLS_500'  # Smaller = less energy

9.3.11. Real-Time Considerations

For real-time applications:

Worst-Case Execution Time (WCET):

Measure inference time to ensure deadlines are met:

// Measure WCET
uint32_t max_time = 0;
for (int i = 0; i < 1000; i++) {
    uint32_t start = get_timer();
    run_inference();
    uint32_t elapsed = get_timer() - start;
    if (elapsed > max_time) max_time = elapsed;
}
// max_time is WCET estimate

Interrupt Latency:

Inference may block interrupts:

// Option 1: Run inference at low priority
void low_priority_task(void) {
    run_inference();
}

// Option 2: Split inference into chunks
void inference_chunk(int chunk_id) {
    mod_inference_partial(chunk_id, feature_buffer, output_buffer);
}

9.3.12. Device-Specific Notes

C2000 (F28P65, F2837, etc.):

Strong floating-point unit
Good for signal processing
Use FPU-optimized libraries

// Enable FPU
FPU_enableModule();

MSPM0 (Cortex-M0+):

No FPU (software float)
Prefer INT8 quantization
Keep models small (<1k params)

training:
  model_name: 'CLS_500'
  quantization: 1
  quantization_method: 'QAT'
  quantization_weight_bitwidth: 8
  quantization_activation_bitwidth: 8

AM26x (Cortex-R5):

High performance
FPU available
Can handle larger models

training:
  model_name: 'CLS_6k'  # or larger

CC27xx/CC13xx/CC35xx (Connectivity):

Supported devices: CC2755, CC1352, CC1354, CC35X1
Balance model vs wireless stack memory
Consider inference frequency vs RF activity

9.3.13. Example: Vibration Monitoring on MSPM0G3507

common:
  task_type: 'generic_timeseries_anomalydetection'
  target_device: 'MSPM0G3507'

dataset:
  dataset_name: 'vibration_dataset'

data_processing_feature_extraction:
  feature_extraction_name: 'Generic_256Input_FFTBIN_32Feature_4Frame'
  variables: 1

training:
  model_name: 'AD_500'  # Small model for M0+
  quantization: 1
  quantization_method: 'QAT'
  quantization_weight_bitwidth: 8
  quantization_activation_bitwidth: 8

compilation:
  enable: True

Application Code:

#include "ti_msp_dl_config.h"
#include "mod.h"
#include "feature_extraction.h"

#define SAMPLE_SIZE 256
#define FEATURE_SIZE 128
#define THRESHOLD 0.5f

float adc_buffer[SAMPLE_SIZE];
float feature_buffer[FEATURE_SIZE];
float output;  // Reconstruction error

int main(void) {
    SYSCFG_DL_init();

    // Initialize model
    mod_init();

    while (1) {
        // Collect vibration data
        for (int i = 0; i < SAMPLE_SIZE; i++) {
            DL_ADC12_startConversion(ADC0);
            while (!DL_ADC12_isConversionComplete(ADC0));
            adc_buffer[i] = DL_ADC12_getMemResult(ADC0, 0);
        }

        // Extract features
        extract_features(adc_buffer, feature_buffer);

        // Run anomaly detection
        mod_inference(feature_buffer, &output);

        // Check threshold
        if (output > THRESHOLD) {
            // Anomaly detected
            DL_GPIO_setPins(ALERT_PORT, ALERT_PIN);
        } else {
            DL_GPIO_clearPins(ALERT_PORT, ALERT_PIN);
        }

        // Enter low power until next sample period
        __WFI();
    }
}

9.3.14. CCS Studio Walkthrough: F28004x

This section provides a complete step-by-step walkthrough for deploying an arc fault classification model to the LAUNCHXL-F28004X board using Code Composer Studio. The F28004x does not have an NPU, so inference runs entirely on the CPU.

Important

Unlike the F28P55x example, the F28004x arc fault project is not available in the CCS Resource Explorer. You must import it manually using File → Import Projects(s).

Requirements

LaunchPad	LAUNCHXL-F28004X
SDK	C2000Ware 6.00
IDE	CCS Studio 20.2.0 or later

9.3.14.1. Step 1 – Import the Project Manually

Because this example is not listed in Resource Explorer, use the manual import flow.

Open Code Composer Studio.
Go to File → Import Projects(s).

File Import Projects menu — Selecting Import Projects from the File menu.

In the import dialog, click Browse and navigate to the folder ex_arc_fault_dataset_validation_f28004x. Click Select Folder.

Browse to project folder — Browsing to the ex_arc_fault_dataset_validation_f28004x folder.

Click Finish to import the project into your workspace.

9.3.14.2. Step 2 – Build the Project

Go to Project → Build Project(s) (or press Ctrl+B).

Build Project menu — Building the project from the Project menu.

Verify that the build completes without errors in the Console view.

9.3.14.3. Step 3 – Set Target Configuration

Switch the active target configuration from TMS320F280049C.ccxml to TMS320F280049C_LaunchPad.ccxml. Right-click the .ccxml file in Project Explorer and select Set as Active Target Configuration.

Active Target Configuration — Selecting the LaunchPad target configuration.

9.3.14.4. Step 4 – Flash the Device

Connect the LAUNCHXL-F28004X LaunchPad to your PC via USB.
Go to Run → Flash Project.

Flash Project — Flashing the built project to the device.

(Optional) If a firmware update prompt appears, click Update.

Firmware update dialog – click Update if it appears.

9.3.14.5. Step 5 – Debug and Verify

After flashing, the Debug perspective opens. Click the Debug icon to start a debug session.

Place a breakpoint on the line that follows the inference call in application_main.c.

Setting breakpoint — Setting a breakpoint after the inference call.

Click Resume (F8) to run the program. When the breakpoint is hit, add the variable test_result to the Watch window.

Watch variable test_result — Adding test_result to the Watch window.

Inspect the value:
- test_result == 1 – model inference passed (output matches golden vector).
- test_result == 0 – model inference failed.

Verifying the test_result value in the Watch window.

9.3.15. Required Files from ModelMaker

The CCS example ex_arc_fault_dataset_validation_f28004x requires four files generated by a ModelMaker run. After ModelMaker finishes, copy each file from its ModelMaker output path to the corresponding CCS project path.

File	Purpose	ModelMaker Source Path	CCS Project Destination
`mod.a`	Compiled model library	`.../compilation/artifacts/mod.a`	`ex_arc_fault_dataset_validation_f28004x/artifacts/mod.a`
`tvmgen_default.h`	Model inference API header	`.../compilation/artifacts/tvmgen_default.h`	`ex_arc_fault_dataset_validation_f28004x/artifacts/tvmgen_default.h`
`test_vector.c`	Golden-vector test data	`.../training/quantization/golden_vectors/test_vector.c`	`ex_arc_fault_dataset_validation_f28004x/test_vector.c`
`user_input_config.h`	Feature extraction config	`.../training/quantization/golden_vectors/user_input_config.h`	`ex_arc_fault_dataset_validation_f28004x/user_input_config.h`

The ... prefix in the source paths expands to your ModelMaker data directory, for example:

tinyml-modelmaker/data/projects/dc_arc_fault_example_dsk/run/<run_name>/

After copying the four files, rebuild the CCS project, flash, and verify test_result in the debugger as described above.

9.3.16. Comparison: NPU vs Non-NPU

Aspect	NPU Devices	Non-NPU Devices
Inference speed	10-25x faster	Baseline
Model size	Up to 60k params	Typically <6k params
Power	Lower per inference	Higher per inference
Model constraints	NPU-specific rules	More flexible
Cost	Higher BOM	Lower BOM

9.3.17. Next Steps

See CCS Integration Guide for detailed CCS setup
Review Device Overview for device selection
Check Common Errors for issues