8.8. On-Device Training (ODT)

8.8.1. Overview

On-Device Training (ODT) enables machine learning models to continue training directly on microcontrollers after deployment. Rather than deploying frozen inference-only models, ODT sends both a frozen feature-extraction backbone and trainable model head to the device, allowing it to adapt to local data in real-world operating conditions.

Traditional workflow:

Train on PC → Compile model → Deploy to MCU → Inference only

On-device training workflow:

Train on PC → Split model → Compile frozen part → Export trainable part
→ Deploy to MCU → Continue training on MCU → Inference

8.8.2. Use Cases

ODT is essential when:

1. Data Drift: Deployment environment differs from training environment. Example: Fan blade anomaly detector trained in a lab encounters different vibration characteristics when installed on a factory floor. ODT allows the model to adapt to the new operating conditions without re-deployment.
2. Privacy and Security: Raw sensor data cannot leave the device due to regulatory or security constraints (e.g., medical devices, industrial systems). ODT allows local adaptation without transmitting sensitive data.
3. Personalization: Each installation has unique characteristics. A motor vibration model trained on one motor type may need to adapt to a different motor. ODT enables per-installation customization without individual model training.
4. Reduced Re-deployment Cost: Without ODT, adapting to a new environment requires: collect data → ship to PC → retrain → recompile → reflash. ODT eliminates this round-trip entirely.
5. Zero-shot Deployment: The trainable portion can deploy with zero epochs of PC-side training, allowing the device to train entirely from scratch using locally collected data.

8.8.3. Architecture: Frozen + Trainable Split

The model is split into two parts at deployment time:

Frozen Part (Feature Extractor)

Contains pre-trained convolutional layers, embeddings, or backbone
Deployed as compiled inference code
Not modified during on-device training
Typically 70-90% of model parameters

Trainable Part (Classification Head)

Lightweight dense layers or simple linear classifier
Deployed as weights + gradient computation code
Updated during on-device training
Typically 10-30% of model parameters

Memory implications:

Frozen part: weights only (inference)
Trainable part: weights + activations + gradients (training)
Total memory ≈ smaller trainable head + accumulated gradient buffers

8.8.4. Supported Task Types

ODT is available for:

Time Series Classification — accelerometer, audio, sensor signals
Time Series Regression — forecasting, sensor readings
Time Series Anomaly Detection — detecting out-of-distribution patterns
Image Classification — visual recognition (with reduced input size)

Each task type has its own trainable architecture optimized for MCU memory constraints.

8.8.5. Workflow

Phase 1: PC-side Preparation

Train full model on PC with training dataset
Extract frozen backbone (feature extractor) and trainable head (classifier)
Compile frozen backbone to MCU code (NPU or CPU inference)
Export trainable head weights in quantized format
Generate trainable architecture code for MCU

Phase 2: MCU Deployment

Flash frozen backbone + trainable head to device
Device runs inference with frozen backbone
When adaptation needed, device collects local data
Device fine-tunes trainable head using local data (SGD, Adam, etc.)
Updated weights remain on device

Phase 3: Continuous Adaptation (Optional)

Device periodically retrains on new data batches
Frozen backbone remains unchanged
Trainable head converges to new environment characteristics
Inference accuracy improves with local adaptation

8.8.6. Configuration

Enable ODT in your config file:

common:
    task_type: "timeseries_classification"  # or regression, anomaly_detection
    model_name: "generic_timeseries_cnn"

ondevice_training:
    enabled: true
    split_layer: "before_dense"              # where to split model
    trainable_layers: 2                      # number of trainable layers
    training_method: "sgd"                   # sgd, adam, rmsprop
    learning_rate: 0.001
    epochs_per_batch: 5                      # epochs when training on device
    batch_size: 32
    optimizer_state_size: "minimal"          # minimal, full

Supported Configurations:

Parameter	Description	Example Values
`split_layer`	Which layer to split at (frozen before, trainable after)	“before_dense”, “before_classifier”
`trainable_layers`	Number of trainable layers	1, 2, 3
`training_method`	Optimizer algorithm	“sgd”, “adam”, “rmsprop”
`learning_rate`	Learning rate for on-device training	0.0001 to 0.1
`epochs_per_batch`	Epochs per training batch on MCU	1 to 20
`batch_size`	Training batch size	8, 16, 32, 64
`optimizer_state_size`	Memory mode for optimizer state	“minimal”, “full”

8.8.7. Memory Considerations

ODT requires additional MCU memory for:

Trainable weights — typically 1-10 KB
Activations — forward pass outputs (5-50 KB)
Gradients — backprop gradients (5-50 KB)
Optimizer state — learning rate schedules, momentum (2-20 KB)

Total ODT overhead: 15-130 KB depending on configuration

Devices with sufficient memory:

MSPM0G5187 (160 KB SRAM) — recommended
CC1312, CC1314, CC1352, CC1354, CC2755 (20-60 KB SRAM) — limited configs
CC35X1 (512 KB SRAM) — full support

8.8.8. Limitations

Frozen backbone is immutable — only trainable head adapts
Memory-constrained training — smaller models and batch sizes than PC training
Limited dataset — device collects data incrementally, not large static datasets
No distributed training — single-device training only
Lower numerical precision — quantized weights and activations

8.8.9. Best Practices

1. Choose the right split point

Split after feature extraction, before classification
Frozen part should be robust to environment variations
Trainable part should be small enough for MCU memory

2. Pre-train the frozen backbone thoroughly

Use large diverse training dataset on PC
Ensure backbone generalizes well
Frozen part quality determines ceiling performance

3. Start training early

Begin on-device training shortly after deployment
Device needs representative local data to converge
Don’t wait for accuracy degradation to trigger retraining

4. Monitor training convergence

Log training loss on device (periodic dumps to host)
Stop training if loss plateaus
Retrain with different learning rate if needed

5. Use minimal optimizer state when memory is tight

SGD with minimal state (just gradients, no momentum)
Adam requires full optimizer state (2× memory)
Trade optimizer capability for memory savings

8.8.10. Examples

See the following examples for ODT workflows:

Fan Blade Fault Classification — anomaly detection with on-device adaptation
Motor Bearing Fault — fault detection with environment-specific training

8.8.12. FAQ

Q: Can I train the frozen part on device?

A: No. The frozen part is compiled to MCU code and cannot be modified. Only the trainable head can be updated.

Q: How much accuracy improvement can I expect?

A: Typical improvements: 2-5% accuracy gain after 100-500 device training iterations with local data.

Q: What if device runs out of memory during training?

A: Reduce batch size, epochs per batch, or trainable layer count. Use minimal optimizer state (SGD instead of Adam).

Q: Can I update the trainable weights remotely?

A: Yes. Export trained weights from device, send to host, verify, send updated weights back to device.

Q: Is ODT compatible with NPU inference?

A: Yes. Frozen NPU inference + on-device trainable head training (CPU) both supported.

8.8.13. Troubleshooting

Training loss not decreasing:

Increase learning rate (start with 0.01)
Ensure local data is representative
Check if trainable head has sufficient parameters

Out of memory during training:

Reduce batch size to 8 or 16
Reduce epochs per batch to 1-3
Use minimal optimizer state (SGD)
Reduce trainable layer count to 1

Inference accuracy dropped after training:

Frozen backbone may not generalize to new data
Retrain frozen backbone on PC with larger dataset
Reduce learning rate to prevent overfitting on small device dataset

8.8.14. Further Reading

For in-depth documentation, see: