7.8.30. Google Speech Command Recognition

12-class keyword spotting from audio using MFCC features and DSCNN model on MSPM0G5187 with NPU.

7.8.30.1. Overview

Task: Audio Classification (12-class keyword spotting)
Application: Voice command recognition, keyword spotting
Dataset: Google Speech Commands v0.02 (12-class variant)
Model: DSCNN (Depthwise Separable CNN)
Device: MSPM0G5187 (NPU-accelerated)

7.8.30.2. Keyword Classes

Type	Labels
Known keywords (10)	`down`, `go`, `left`, `no`, `off`, `on`, `right`, `stop`, `up`, `yes`
Unknown	`_unknown_` — all non-keyword words
Silence	`_silence_` — 1-second background noise clips

7.8.30.3. Device Support

Device	Notes	Configuration File
`MSPM0G5187`	MSPM0 with NPU	`config_MSPM0.yaml`

7.8.30.4. Running the Example

Step 1: Generate dataset

The dataset must be prepared before training:

cd tinyml-modelzoo/examples/google_speech_command
python generate_dataset.py

This downloads Google Speech Commands v0.02 via TorchAudio and prepares a TensorLab-ready structure under SpeechCommands/classes/.

Step 2: Run training

cd tinyml-modelzoo
./run_tinyml_modelzoo.sh examples/google_speech_command/config_MSPM0.yaml

cd tinyml-modelzoo
run_tinyml_modelzoo.bat examples\google_speech_command\config_MSPM0.yaml

7.8.30.5. Configuration

common:
  task_type: 'audio_classification'
  target_device: 'MSPM0G5187'

dataset:
  dataset_name: 'google_speech_commands_12class'
  input_data_path: 'https://software-dl.ti.com/C2000/esd/mcu_ai/datasets/google_speech_commands_12class.zip'

data_processing_feature_extraction:
  feature_extraction_name: 'GoogleSpeechCommands_MFCC_Default'

training:
  model_name: 'DSCNN_NPU'
  training_epochs: 20
  batch_size: 64
  learning_rate: 0.1
  weight_decay: 1e-5
  quantization: 2

testing:
  enable: True

compilation:
  enable: True

7.8.30.6. Feature Extraction (MFCC)

MFCCs (Mel Frequency Cepstral Coefficients) compactly represent speech frequency characteristics for keyword spotting.

Parameter	Value
Sampling rate	16000 Hz
Audio duration	1000 ms
Frame length	30 ms
Frame step	20 ms
MFCC coefficients	10
Mel bins	40

Output feature shape: [N, 1, 49, 10] (batch × 1 channel × 49 time frames × 10 MFCCs)

7.8.30.7. Model: DSCNN

Depthwise Separable CNN splits standard convolution into depthwise (spatial filtering) + pointwise (channel mixing) operations, reducing computation while maintaining accuracy.

Model	Filters	Description
`DSCNN_NPU`	64	Depthwise separable CNN optimized for NPU, 12-class output

Architecture: Conv10×4 / stride 2 → Dropout → (Depthwise3×3 + Pointwise1×1) ×4 → Dropout → AdaptiveAvgPool → FC (12 classes)

7.8.30.8. System Components

Hardware

MSPM0G5187 microcontroller with integrated NPU
Microphone input

Software

Code Composer Studio (CCS) 12.x or later
MSPM0 SDK 2.10.04 or later
Additional Python dependencies: torch, torchaudio, scipy, pydub, numpy

7.8.30.9. Next Steps

Learn about audio task type: Supported Task Types
Deploy to device: NPU Device Deployment
Browse similar examples: Dynamic Hand Gesture Recognition