7.8.30. Google Speech Command Recognition

12-class keyword spotting from audio using MFCC features and DSCNN model on MSPM0G5187 with NPU.

7.8.30.1. Overview

  • Task: Audio Classification (12-class keyword spotting)

  • Application: Voice command recognition, keyword spotting

  • Dataset: Google Speech Commands v0.02 (12-class variant)

  • Model: DSCNN (Depthwise Separable CNN)

  • Device: MSPM0G5187 (NPU-accelerated)

7.8.30.2. Keyword Classes

Type

Labels

Known keywords (10)

down, go, left, no, off, on, right, stop, up, yes

Unknown

_unknown_ — all non-keyword words

Silence

_silence_ — 1-second background noise clips

7.8.30.3. Device Support

Device

Notes

Configuration File

MSPM0G5187

MSPM0 with NPU

config_MSPM0.yaml

7.8.30.4. Running the Example

Step 1: Generate dataset

The dataset must be prepared before training:

cd tinyml-modelzoo/examples/google_speech_command
python generate_dataset.py

This downloads Google Speech Commands v0.02 via TorchAudio and prepares a TensorLab-ready structure under SpeechCommands/classes/.

Step 2: Run training

cd tinyml-modelzoo
./run_tinyml_modelzoo.sh examples/google_speech_command/config_MSPM0.yaml

7.8.30.5. Configuration

common:
  task_type: 'audio_classification'
  target_device: 'MSPM0G5187'

dataset:
  dataset_name: 'google_speech_commands_12class'
  input_data_path: 'https://software-dl.ti.com/C2000/esd/mcu_ai/datasets/google_speech_commands_12class.zip'

data_processing_feature_extraction:
  feature_extraction_name: 'GoogleSpeechCommands_MFCC_Default'

training:
  model_name: 'DSCNN_NPU'
  training_epochs: 20
  batch_size: 64
  learning_rate: 0.1
  weight_decay: 1e-5
  quantization: 2

testing:
  enable: True

compilation:
  enable: True

7.8.30.6. Feature Extraction (MFCC)

MFCCs (Mel Frequency Cepstral Coefficients) compactly represent speech frequency characteristics for keyword spotting.

Parameter

Value

Sampling rate

16000 Hz

Audio duration

1000 ms

Frame length

30 ms

Frame step

20 ms

MFCC coefficients

10

Mel bins

40

Output feature shape: [N, 1, 49, 10] (batch × 1 channel × 49 time frames × 10 MFCCs)

7.8.30.7. Model: DSCNN

Depthwise Separable CNN splits standard convolution into depthwise (spatial filtering) + pointwise (channel mixing) operations, reducing computation while maintaining accuracy.

Model

Filters

Description

DSCNN_NPU

64

Depthwise separable CNN optimized for NPU, 12-class output

Architecture: Conv10×4 / stride 2 → Dropout → (Depthwise3×3 + Pointwise1×1) ×4 → Dropout → AdaptiveAvgPool → FC (12 classes)

7.8.30.8. System Components

Hardware

  • MSPM0G5187 microcontroller with integrated NPU

  • Microphone input

Software

  • Code Composer Studio (CCS) 12.x or later

  • MSPM0 SDK 2.10.04 or later

  • Additional Python dependencies: torch, torchaudio, scipy, pydub, numpy

7.8.30.9. Next Steps