9.3. Non-NPU Deployment

This guide covers deployment to TI devices without NPU hardware acceleration. These devices run inference entirely on the CPU.

9.3.1. Non-NPU Devices

Devices without NPU include:

C2000 Family (F28x):

  • F28003x, F28004x, F28P65x

  • F280013x, F280015x, F2837x

C2000 Family (F29x):

  • F29H85x, F29P58x, F29P32x

MSPM0 Family:

  • MSPM0G3507, MSPM0G3519

MSPM33 Family:

  • MSPM33C321Ax

Sitara MCU Family (AM26x):

  • AM263, AM263P, AM261

Connectivity:

  • CC2755, CC1352, CC1354, CC35X1

9.3.2. Configuration

For non-NPU devices, use standard models:

common:
  target_device: 'F28P65'  # Non-NPU device

training:
  model_name: 'CLS_4k'  # Standard model (no _NPU suffix)

compilation:
  enable: True
  preset_name: 'default_preset'  # Standard compilation

9.3.3. Model Selection

Without NPU acceleration, choose smaller models:

Device Class

Recommended Size

Model Examples

Entry-level (M0+)

100-500 params

CLS_100, CLS_500

Mid-range

500-2k params

CLS_1k, CLS_2k

High-performance

2k-6k params

CLS_4k, CLS_6k

AM26x (Cortex-R5)

Up to 13k params

CLS_6k, CLS_13k

9.3.4. CPU Inference Performance

Typical inference times (CPU-only):

Model

F28P65

MSPM0G3507

AM263

CC2755

CLS_500

500 µs

800 µs

200 µs

600 µs

CLS_1k

1000 µs

1500 µs

400 µs

1200 µs

CLS_4k

4000 µs

6000 µs

1500 µs

5000 µs

Note: Times are approximate and depend on clock frequency.

9.3.5. Compilation Artifacts

Non-NPU compilation produces:

.../compilation/artifacts/
├── mod.a                    # Model library (CPU code)
├── mod.h                    # Model interface
├── model_config.h           # Configuration
├── feature_extraction.c     # Feature extraction
└── inference_example.c      # Example code

9.3.6. CCS Project Setup

Import the Project:

Import Project for Non-NPU

Importing a project into CCS for non-NPU devices

Build the Project:

Build Project

Building the project for non-NPU deployment

Flash and Debug:

Flash Application

Flashing the application to a non-NPU device

Debug Screen

CCS Debug perspective for non-NPU deployment

9.3.7. Basic Integration

#include "mod.h"
#include "feature_extraction.h"

float input_buffer[INPUT_SIZE];
float feature_buffer[FEATURE_SIZE];
float output_buffer[NUM_CLASSES];

void run_inference(void) {
    // Collect data
    collect_sensor_data(input_buffer);

    // Extract features
    extract_features(input_buffer, feature_buffer);

    // Run CPU inference
    mod_inference(feature_buffer, output_buffer);

    // Get result
    int prediction = argmax(output_buffer, NUM_CLASSES);
    handle_result(prediction);
}

9.3.8. Optimizing CPU Inference

1. Enable Compiler Optimizations:

Project Properties → Build → Compiler → Optimization
Level: 4 (Highest)
Speed vs Size: Speed

2. Use Fixed-Point When Possible:

If your model supports fixed-point:

training:
  quantization: 1
  quantization_method: 'QAT'
  quantization_weight_bitwidth: 8
  quantization_activation_bitwidth: 8

INT8 operations are faster than float on many MCUs.

3. Place Critical Code in Fast Memory:

#pragma CODE_SECTION(mod_inference, ".TI.ramfunc")

4. Optimize Feature Extraction:

Use simpler feature extraction if possible:

data_processing_feature_extraction:
  feature_extraction_name: 'Generic_256Input_RAW_256Feature_1Frame'

9.3.9. Memory Optimization

Non-NPU devices may have limited RAM:

Minimize Buffer Sizes:

data_processing_feature_extraction:
  # Smaller input reduces buffers
  feature_extraction_name: 'Generic_256Input_FFTBIN_32Feature_4Frame'

Use Static Allocation:

// Static allocation - size known at compile time
static float feature_buffer[FEATURE_SIZE];
static float output_buffer[NUM_CLASSES];

Memory Map Check:

Verify model fits in available memory:

After building, check .map file:
.text (code):    XX KB
.const (weights): XX KB
.bss (buffers):   XX KB

Compare with device memory:
Flash: XXX KB
RAM:   XX KB

9.3.10. Power Optimization

For battery-powered devices:

1. Duty Cycle Inference:

void main(void) {
    while (1) {
        // Wake up
        wake_from_sleep();

        // Run inference
        run_inference();

        // Sleep
        enter_low_power_mode();
    }
}

2. Reduce Clock During Inference:

Some devices allow dynamic clocking:

// Run at lower clock for power savings
// (trades off latency for power)
set_clock_speed(CLOCK_40MHZ);
run_inference();

3. Use Smallest Sufficient Model:

training:
  model_name: 'CLS_500'  # Smaller = less energy

9.3.11. Real-Time Considerations

For real-time applications:

Worst-Case Execution Time (WCET):

Measure inference time to ensure deadlines are met:

// Measure WCET
uint32_t max_time = 0;
for (int i = 0; i < 1000; i++) {
    uint32_t start = get_timer();
    run_inference();
    uint32_t elapsed = get_timer() - start;
    if (elapsed > max_time) max_time = elapsed;
}
// max_time is WCET estimate

Interrupt Latency:

Inference may block interrupts:

// Option 1: Run inference at low priority
void low_priority_task(void) {
    run_inference();
}

// Option 2: Split inference into chunks
void inference_chunk(int chunk_id) {
    mod_inference_partial(chunk_id, feature_buffer, output_buffer);
}

9.3.12. Device-Specific Notes

C2000 (F28P65, F2837, etc.):

  • Strong floating-point unit

  • Good for signal processing

  • Use FPU-optimized libraries

// Enable FPU
FPU_enableModule();

MSPM0 (Cortex-M0+):

  • No FPU (software float)

  • Prefer INT8 quantization

  • Keep models small (<1k params)

training:
  model_name: 'CLS_500'
  quantization: 1
  quantization_method: 'QAT'
  quantization_weight_bitwidth: 8
  quantization_activation_bitwidth: 8

AM26x (Cortex-R5):

  • High performance

  • FPU available

  • Can handle larger models

training:
  model_name: 'CLS_6k'  # or larger

CC27xx/CC13xx/CC35xx (Connectivity):

  • Supported devices: CC2755, CC1352, CC1354, CC35X1

  • Balance model vs wireless stack memory

  • Consider inference frequency vs RF activity

9.3.13. Example: Vibration Monitoring on MSPM0G3507

common:
  task_type: 'generic_timeseries_anomalydetection'
  target_device: 'MSPM0G3507'

dataset:
  dataset_name: 'vibration_dataset'

data_processing_feature_extraction:
  feature_extraction_name: 'Generic_256Input_FFTBIN_32Feature_4Frame'
  variables: 1

training:
  model_name: 'AD_500'  # Small model for M0+
  quantization: 1
  quantization_method: 'QAT'
  quantization_weight_bitwidth: 8
  quantization_activation_bitwidth: 8

compilation:
  enable: True

Application Code:

#include "ti_msp_dl_config.h"
#include "mod.h"
#include "feature_extraction.h"

#define SAMPLE_SIZE 256
#define FEATURE_SIZE 128
#define THRESHOLD 0.5f

float adc_buffer[SAMPLE_SIZE];
float feature_buffer[FEATURE_SIZE];
float output;  // Reconstruction error

int main(void) {
    SYSCFG_DL_init();

    // Initialize model
    mod_init();

    while (1) {
        // Collect vibration data
        for (int i = 0; i < SAMPLE_SIZE; i++) {
            DL_ADC12_startConversion(ADC0);
            while (!DL_ADC12_isConversionComplete(ADC0));
            adc_buffer[i] = DL_ADC12_getMemResult(ADC0, 0);
        }

        // Extract features
        extract_features(adc_buffer, feature_buffer);

        // Run anomaly detection
        mod_inference(feature_buffer, &output);

        // Check threshold
        if (output > THRESHOLD) {
            // Anomaly detected
            DL_GPIO_setPins(ALERT_PORT, ALERT_PIN);
        } else {
            DL_GPIO_clearPins(ALERT_PORT, ALERT_PIN);
        }

        // Enter low power until next sample period
        __WFI();
    }
}

9.3.14. CCS Studio Walkthrough: F28004x

This section provides a complete step-by-step walkthrough for deploying an arc fault classification model to the LAUNCHXL-F28004X board using Code Composer Studio. The F28004x does not have an NPU, so inference runs entirely on the CPU.

Important

Unlike the F28P55x example, the F28004x arc fault project is not available in the CCS Resource Explorer. You must import it manually using FileImport Projects(s).

Requirements

LaunchPad

LAUNCHXL-F28004X

SDK

C2000Ware 6.00

IDE

CCS Studio 20.2.0 or later

9.3.14.1. Step 1 – Import the Project Manually

Because this example is not listed in Resource Explorer, use the manual import flow.

  1. Open Code Composer Studio.

  2. Go to FileImport Projects(s).

File Import Projects menu

Selecting Import Projects from the File menu.

  1. In the import dialog, click Browse and navigate to the folder ex_arc_fault_dataset_validation_f28004x. Click Select Folder.

Browse to project folder

Browsing to the ex_arc_fault_dataset_validation_f28004x folder.

  1. Click Finish to import the project into your workspace.

9.3.14.2. Step 2 – Build the Project

  1. Go to ProjectBuild Project(s) (or press Ctrl+B).

Build Project menu

Building the project from the Project menu.

Verify that the build completes without errors in the Console view.

9.3.14.3. Step 3 – Set Target Configuration

  1. Switch the active target configuration from TMS320F280049C.ccxml to TMS320F280049C_LaunchPad.ccxml. Right-click the .ccxml file in Project Explorer and select Set as Active Target Configuration.

Active Target Configuration

Selecting the LaunchPad target configuration.

9.3.14.4. Step 4 – Flash the Device

  1. Connect the LAUNCHXL-F28004X LaunchPad to your PC via USB.

  2. Go to RunFlash Project.

Flash Project

Flashing the built project to the device.

  1. (Optional) If a firmware update prompt appears, click Update.

Firmware update dialog

Firmware update dialog – click Update if it appears.

9.3.14.5. Step 5 – Debug and Verify

  1. After flashing, the Debug perspective opens. Click the Debug icon to start a debug session.

Debug screen

CCS Debug perspective after flashing.

  1. Place a breakpoint on the line that follows the inference call in application_main.c.

Setting breakpoint

Setting a breakpoint after the inference call.

  1. Click Resume (F8) to run the program. When the breakpoint is hit, add the variable test_result to the Watch window.

Watch variable test_result

Adding test_result to the Watch window.

  1. Inspect the value:

    • test_result == 1 – model inference passed (output matches golden vector).

    • test_result == 0 – model inference failed.

test_result value

Verifying the test_result value in the Watch window.

9.3.15. Required Files from ModelMaker

The CCS example ex_arc_fault_dataset_validation_f28004x requires four files generated by a ModelMaker run. After ModelMaker finishes, copy each file from its ModelMaker output path to the corresponding CCS project path.

File

Purpose

ModelMaker Source Path

CCS Project Destination

mod.a

Compiled model library

.../compilation/artifacts/mod.a

ex_arc_fault_dataset_validation_f28004x/artifacts/mod.a

tvmgen_default.h

Model inference API header

.../compilation/artifacts/tvmgen_default.h

ex_arc_fault_dataset_validation_f28004x/artifacts/tvmgen_default.h

test_vector.c

Golden-vector test data

.../training/quantization/golden_vectors/test_vector.c

ex_arc_fault_dataset_validation_f28004x/test_vector.c

user_input_config.h

Feature extraction config

.../training/quantization/golden_vectors/user_input_config.h

ex_arc_fault_dataset_validation_f28004x/user_input_config.h

The ... prefix in the source paths expands to your ModelMaker data directory, for example:

tinyml-modelmaker/data/projects/dc_arc_fault_example_dsk/run/<run_name>/

After copying the four files, rebuild the CCS project, flash, and verify test_result in the debugger as described above.

9.3.16. Comparison: NPU vs Non-NPU

Aspect

NPU Devices

Non-NPU Devices

Inference speed

10-25x faster

Baseline

Model size

Up to 60k params

Typically <6k params

Power

Lower per inference

Higher per inference

Model constraints

NPU-specific rules

More flexible

Cost

Higher BOM

Lower BOM

9.3.17. Next Steps