7.8.30. Google Speech Command Recognition
12-class keyword spotting from audio using MFCC features and DSCNN model on MSPM0G5187 with NPU.
7.8.30.1. Overview
Task: Audio Classification (12-class keyword spotting)
Application: Voice command recognition, keyword spotting
Dataset: Google Speech Commands v0.02 (12-class variant)
Model: DSCNN (Depthwise Separable CNN)
Device: MSPM0G5187 (NPU-accelerated)
7.8.30.2. Keyword Classes
Type |
Labels |
|---|---|
Known keywords (10) |
|
Unknown |
|
Silence |
|
7.8.30.3. Device Support
Device |
Notes |
Configuration File |
|---|---|---|
|
MSPM0 with NPU |
|
7.8.30.4. Running the Example
Step 1: Generate dataset
The dataset must be prepared before training:
cd tinyml-modelzoo/examples/google_speech_command
python generate_dataset.py
This downloads Google Speech Commands v0.02 via TorchAudio and prepares a TensorLab-ready structure under SpeechCommands/classes/.
Step 2: Run training
cd tinyml-modelzoo
./run_tinyml_modelzoo.sh examples/google_speech_command/config_MSPM0.yaml
cd tinyml-modelzoo
run_tinyml_modelzoo.bat examples\google_speech_command\config_MSPM0.yaml
7.8.30.5. Configuration
common:
task_type: 'audio_classification'
target_device: 'MSPM0G5187'
dataset:
dataset_name: 'google_speech_commands_12class'
input_data_path: 'https://software-dl.ti.com/C2000/esd/mcu_ai/datasets/google_speech_commands_12class.zip'
data_processing_feature_extraction:
feature_extraction_name: 'GoogleSpeechCommands_MFCC_Default'
training:
model_name: 'DSCNN_NPU'
training_epochs: 20
batch_size: 64
learning_rate: 0.1
weight_decay: 1e-5
quantization: 2
testing:
enable: True
compilation:
enable: True
7.8.30.6. Feature Extraction (MFCC)
MFCCs (Mel Frequency Cepstral Coefficients) compactly represent speech frequency characteristics for keyword spotting.
Parameter |
Value |
|---|---|
Sampling rate |
16000 Hz |
Audio duration |
1000 ms |
Frame length |
30 ms |
Frame step |
20 ms |
MFCC coefficients |
10 |
Mel bins |
40 |
Output feature shape: [N, 1, 49, 10] (batch × 1 channel × 49 time frames × 10 MFCCs)
7.8.30.7. Model: DSCNN
Depthwise Separable CNN splits standard convolution into depthwise (spatial filtering) + pointwise (channel mixing) operations, reducing computation while maintaining accuracy.
Model |
Filters |
Description |
|---|---|---|
|
64 |
Depthwise separable CNN optimized for NPU, 12-class output |
Architecture: Conv10×4 / stride 2 → Dropout → (Depthwise3×3 + Pointwise1×1) ×4 → Dropout → AdaptiveAvgPool → FC (12 classes)
7.8.30.8. System Components
Hardware
MSPM0G5187 microcontroller with integrated NPU
Microphone input
Software
Code Composer Studio (CCS) 12.x or later
MSPM0 SDK 2.10.04 or later
Additional Python dependencies:
torch,torchaudio,scipy,pydub,numpy
7.8.30.9. Next Steps
Learn about audio task type: Supported Task Types
Deploy to device: NPU Device Deployment
Browse similar examples: Dynamic Hand Gesture Recognition