SimpleLink SDK Voice Detection Plugin Users Guide

Introduction
- Prerequisites
Understanding the Voice Detection Plugin
Using the VCD Library
- Implementation Requirements
- Practical System Issues
The Voice Detection Example

Introduction

The SimpleLink SDK Voice Detection Plugin enables a user to perform voice activity detection (VAD) and speaker dependent voice command detection (VCD). Easy-to-use APIs are provided that can enable voice applications and enhance existing applications with keyword recognition. The plugin is implemented in fixed-point C for ultra-low power. The VCD library is also configurable to each application, allowing the user to control or minimize resources used. This plugin can be utilized in a variety of applications that require speaker dependent keyword recognition or a hands free user interface. The range of applications can be further expanded by using this plugin in conjunction with other SDK plugins, such as for example using the SimpleLink MSP432 SDK Bluetooth Plugin for keyword recognition in an IoT application.

Prior to using a speaker-dependent speech recognizer, the user must train the recognizer with each phrase to recognize by speaking the phrase several times. During training, the recognizer creates a model of each phrase to use during the recognition process. The performance of the recognizer is thus tied to the speaker that trained each phrase. If another speaker tries to use the recognizer, performance will likely degrade due to differences in the way the phrase is spoken.

Resources in addition to the plugin include an example project illustrating the use of the libraries, which is explained further in the Understanding the Voice Detection Example section.

Prerequisites

SimpleLink MSP432 SDK
A computer running a supported operating system listed in the Release Notes
At least one of the following IDEs supported by the SimpleLink MSP432 SDK:
FreeRTOSv9.0.0 (Optional)
- If the User wants to use FreeRTOS with the SDK, the user is expected to download the FreeRTOS sources from www.freertos.org
- Please refer to the SimpleLink MSP432 SDK documentation for more information about setting up FreeRTOS

Understanding the Voice Detection Plugin

Folder Structure

The folder structure for the SimpleLink MSP432 SDK Voice Detection Plugin is made to compliment the standard folder structure that the platform SDK adheres to. From the root directory, the following folders are available:

.metadata/ - Internal folder for product integration into Code Composer Studio. You do not have to edit any files in this folder.
docs/ - Contains all documentation related to the SDK plugin
examples/ - Contains all of the code examples and their IDE project files
source/ - Contains generic/reusable code that can be imported into end applications
tools/ - Contains various tools/utilities that supplement the release

VCD Features

Library Features

The VCD Library software provides a comprehensive set of APIs for speaker-dependent speech recognition for a wide variety of embedded applications. VCD APIs allow an application to perform the following operations.

Create speaker-defined phrase models
Train a phrase model with additional audio data to improve performance
Delete a model
Enable or disable specific models during recognition search
Obtain information about a model
- VCD provides the size, name, amount of training, and whether a model is enabled, along with a count of the total number of models.
Run recognition search
- The application can utilize VCD APIs to run the search continuously, or the search can be suspended after a phrase is recognized. The latter method can be used to implement a “push-to-talk” mode of operation.
Set parameters to tune recognizer performance

Performance Features

VCD contains features to balance performance with available processing resources.

Audio activity detection
- This limits processing when only background noise is present
Background compensation
- VCD dynamically adjusts for slowly changing background noise levels
Adjustable parameters to tune recognition performance
- Default parameter settings are acceptable for most situations. However, VCD provides parameters that can be set to tune for recognition performance based on available processing capabilities

VCD Concepts

Using VCD in an application typically involves four main steps:

1) Initializing the VCD instance

2) Enrolling (creating) phrase models

3) Training existing phrase models by updating the model parameters with additional data

4) Performing recognition

Initializing VCD

The application starts using VCD by initializing the recognizer. Initialization sets up the recognizer instance for use. It requires the user to provide the number of models, if any, that are stored in memory and a pointer array specifying the memory address of where model data is stored. Model memory will usually be non-volatile, such as flash or FRAM. If flash memory is used the starting location must be on a flash sector boundary and the maximum size of a model must be a multiple of the sector size.

Initialization also requires the user to provide the location and size of RAM that VCD can use for processing. This memory is divided into a small amount of memory that must persist during the use of VCD, and a larger amount of processing memory used only during model creation, update, and recognition. The application can reuse the processing memory when VCD is not performing one of these operations.

The example project includes code that illustrates and assists in initialization the models and setting up the RAM memory and properly. This is explained in more detail in the Implementation Requirements section of this document.

Enrolling Phrase Models

The application must create models of the phrases to be recognized. A block diagram of the process is shown below. It does this by prompting the user to speak a phrase and passing the speech data to the VCD API to create an initial model for the phrase. Within VCD the front end processing converts audio sampled data into a representation of the signal. When the audio activity detector locates speech it triggers the creation of the model. A delay provides compensation for the time it takes the activity detector to respond. The created model is stored in model memory through an application-defined function.

This process must be repeated to create each phrase model to be recognized. The VCD library provides APIs to check the duration of a potential model and a quality measure that provides an estimate of how well the model will perform. These can be used by an application to choose to actually create the model, or guide the user to provide a better model.

Creating a Model

Updating Phrase Models

The application can update a specific model by prompting the user for the phrase to update and then to speak that phrase again. The speech data is passed to the VCD API to update the phrase model. The processing, as shown below is similar to creating a model, except that the phrase model is retrieved from memory and merged with the new speech data to provide an improved model. Updating the model can be performed any number of times to further improve the model. Typically at least two or three updates should be performed.

Updating a Model

Recognizing Phrases

The application uses VCD with the trained models to perform recognition of the phrases as shown below. The application does this by capturing audio data and passing it to the VCD API that searches for the presence of the trained phrases in the audio data. Upon locating one of the phrases in the audio data, VCD notifies the application. After notification, the application can take action based on the recognition result. It can also use VCD APIs to immediately continue the recognition search for another phrase.

Recognizing Speech

Recognition Search

The VCD recognizer searches the incoming audio data for occurrences of one of the phrase models. The search engine block diagram below illustrates how VCD performs the search. Search starts when the audio activity detector finds a signal other than background noise. Due to delay, the audio signal at the beginning of the search contains a small amount of background noise. At the end of the initial background noise the search engine checks for an audio signal of any duration that does not match any of the model phrases, for example some speech other than the model phrases. It then searches for speech that matches one of the model phrases. At the same time, as illustrated at the bottom of the figure, it also searches for an audio signal that does not contain any of the model phrases. The search engine also determines if there is additional audio after one of the model phrases. The search will terminate when the input signal returns to background noise, or, depending on option settings it may terminate when it finds the end of a portion of the signal corresponding to one of the model phrases. There are several options for search that are described in Run-time Options.

Search Engine Block Diagram

Using the VCD Library

As with any speech recognizer, the application designer must carefully consider many system level isssues to provide best performance. This section provides some details about the VCD Library and discusses pratical steps to obtain good recognition performance. It also provides specific design requirements that an application must meet to use VCD.

Implementation Requirements

The VCD Library requires the user to control how models will be stored and managed in memory. The application must handle the memory management of models during VCD initialization, enrollment, update, and deletion. The demo project includes a source file (vcd_user.c) with implementations for these APIs using flash memory. For more detail on how vcd_user.c implements these functions, please read the Understanding the Voice Detection Examplesection.

For more detail on the specific requirements for the user-defined APIs, see the VCD API.

Initialization

The user must create an instance of and maintain a VCD_Handle object that holds VCD recognizer state data. All VCD operations must be performed through this handle.

A VCD_ConfigStruct must also be created and filled with model and memory information. For model data to persist between power cycles, the user must provide the configuration structure with the number of models that are stored in memory along with a list of addresses that point to where the model data is stored.

Pointers to the recognizer instance and the configuration structure must be passed to the VCD_init() function, which will initialize all parameters of the recognizer instance to default values. This must be the first frunction called when starting to use VCD.

Enrollment

The prototype for the function that VCD calls during model enrollement is:

VCD_MessageEnum VCD_writeModel( VCD_Handle vh, int_least16_t *ptrModel, int_least32_t mSize, int_least16_t **mAddress);

VCD_writeModel() will be called by the VCD Library during model enrollment to handle storing the model data in memory (usually non-volatile memory, such as flash or FRAM). VCD will provide the function with ptrModel, a pointer to the beginning of the model data to be written and mSize, the size of the model data. VCD requires that the user return where in memory they wrote the model to in mAddress. The function should return a VCD_MessageEnum value depending on the results of writing the model to memory.

Update

The prototype for the function that VCD calls during model update is:

VCD_MessageEnum VCD_updateModel(VCD_Handle vh, int_least16_t *ptrModel, int_least32_t mSize, int_least16_t *mAddress);

VCD_updateModel() will be called during VCD Library during model update to handle storing the updated model data. VCD expects the user to overwite the original model with the new updated model data so the address of the model stays the same. VCD will provide the function a pointer to the beginning of the new model data to be written, ptrModel, the size of the model data, mSize, and the address of where the user has stored the original model data, mAddress. The function should return a VCD_MessageEnum value depending on the results of updating the model in memory.

Delete

The prototype for the function that VCD calls for deleting a model is:

VCD_MessageEnum VCD_clearModel(int_least16_t *mAddress);

VCD_clearModel() will be called during model deletion to handle clearing the model data from memory. The VCD library will provide it the address of the model to delete, mAddress. The function should return a VCD_MessageEnum value depending on the results of clearing the model from memory.

Practical System Issues

Libraries Used

The Voice Detection plugin includes and uses a separate Voice Activity Detection (VAD) detection library. The VAD library can be used as a stand-alone library to detect changes in audio signal levels. API documentation covers both of these libraries.

The VCD Library utilizes the Driver Library (DriverLib). This is included in the SimpleLink MSP432 SDK.

The VCD demo project illustrates how to use these libraries when building an application. The demo also includes source files (vcd_user.c and vcd_user.h) that can aid an application to easily set up and specify RAM allocated to the recognizer.

Power Consumption

VCD relies on the application to capture audio samples, typically from an ADC peripheral, and pass them to VCD APIs during model creation, update, and recognition. This data capture can represent a large portion of the power consumption. If an application does not require VCD processing during part of its operation, then it can reduce power by stopping audio data capture during that time. This is often done in a “push-to-talk” scenario, requiring the user to indicate when recognition should be performed.

VCD also reduces processor power consumption by using audio activity detection to limit processing when only background noise is present.

Estimates of current memory and power requirements of the example demo application can be found in the Release Notes.

Recognition Performance

Speech recognition in real-world environments is always a challenge. While the VCD recognizer implements several algorithms to address these issues, there are many practical steps to take that can improve performance.

Good recognition performance depends on the quality of the speech signal. Prior to using VCD, ensure that the captured audio data is not corrupted by interference or electrical noise, and that the audio signal is not clipped or distorted in some other way. The gain of the audio channel should be set to a level in which the loudest expected speech signal rarely causes clipping. Introducing automatic gain control is not recommended.

Users should be instructed to speak clearly at a comfortable speaking rate.

Phrase model creation and training should be done in what one considers a typical fairly quiet environment.

Sometimes a user may find that the recognizer has difficulty recognizing a specific phrase. Often this can be improved by additional training of the phrase. If this does not correct the issue, the application should use the VCD APIs to allow the user to delete the model for that phrase, and re-create and re-train the model.

Speech recognizers typically have difficulty with confusion between similar sounding or rhyming words. To improve performance, the recognition phrases should be as distinct as possible. If possible, encourage use of phrases that contain several syllables with different sounding vowels. Longer phrases will work better. VCD checks during model creation to make sure a phrase is not too short, and provides a quality measure to indicate the likelihood of good performance of a phrase. The minimum length of a valid phrase is an adjustable parameter, which an application can increase. The maximum length of a phrase is set during the application build process, and depends on the amount of memory allocated to hold model data. If a phrase is provided during model creation that has a marginal quality measure, the user should be encouraged to try enunciating the phrase more carefully, or to choose a different phrase that provides a better quality measure.

Audio Data Capture

Audio data must be captured at an 8kHz sampling rate. The sampled data must be in 2’s complement int_least16_t format. VCD expects an array of captured data to be provided every 20 milliseconds (160 samples in the array). Typically this capture is implemented using an ADC peripheral and a DMA channel. The example application code contains an audio collection API that illustrates data capture.

Model Data Storage Requirements

Phrase model data is stored in units of int_least16_t words. Each model requires overhead of 6 words plus the number of words needed to store the model name. The model name is a string of up to 16 ASCII characters including the trailing NULL. The model phrase data requires 16 words per 20 milliseconds of speech in the phrase. The application is responsible for providing the memory to hold the model data, as well as information regarding the maximum number of models and maximum model size that VCD should support.

If models are stored in flash, each model must start on a sector boundary and each model consumes a multiple of the sector size. The number of sectors needed for each model depends on the maximum size of a phrase model.

The example application provided with the VCD Library includes header and source files (vcd_user.h and vcd_user.c) that can be used to configure the number of models, model size, and allocate model memory.

Processing Memory Requirements

During operation, VCD requires some processing memory which is dependent on the number and size of models. The application is responsible for providing this memory. The memory is divided into two types. Persistent processing memory must not be altered outside of the VCD Library function calls as long as the library is being used. Temporary memory is only needed while the application is performing enrollment, update, or recognition. Otherwise the application may reuse the temporary memory. The amount of persistent memory required is fairly small, but the temporary memory can become large depending on the size and number of models. The release notes describe the processing memory requirements.

The example application provided with the VCD Library includes header and source files (vcd_user.h and vcd_user.c) that can be used to easily configure and allocate the processing memory.

Model Naming and Indices

VCD requires each model to be labeled with a unique C name string when created. The maximum length of a name string is 16 characters including a trailing NULL. Names are case sensitive. The model name “Filler” is reserved for a unique model.

VCD assigns each model an index number. The unique “Filler” model is always index 0. This model cannot be deleted. All other models always have sequential indices starting at 1. This means that the index associated with a model may change. For example, if a model is deleted all indices will be adjusted to remain sequential. Indices are typically used to enumerate all existing models while model names are used to access a particular model.

Number of Phrase Models Supported

The number of phrase models that VCD can support is limited by the available memory for models and processing, as well as the platform processing power. The processing load is a complex function of the size and similarity of the phrase models, the environment in which the recognition takes place, how closely the input signal matches the models, and the settings of run-time parameters. Some experimentation usually must be performed to ensure that the processing can be accomplished in real-time.

Run-time Options

VCD allows setting several options at run-time to configure the recognizer for a particular application and to tune the recognizer to balance recognition performance with processing capabilities. All parameters are initially set to defaults, which should be acceptable in the majority of applications. Altering most of the parameters requires an understanding of speech recognition and in particular the VCD processing algorithms. However, some of the parameters which may be of interest to most application designers are described in this section. For further information on all options, see the VCD API guide.

Parameter	Default Value	Description
minModelStates	20 (400 milliseconds)	Sets the minimum duration of a phrase module in 20 milisecond units. VCD will not create a model if the phrase length is shorter than this parameter.
maxModelStates	125 (2500 milliseconds)	Sets the maximum duration of a phrase module in 20 milisecond units, which can be used to limit memory storage for phrase models. It must not be greater than the the value set when configuring and allocating phrase model memory.
waitCommit	False	If set, VCD will wait until it detects that speech has finished before it outputs a recognition result. Otherwise, VCD will output a recognition result as soon as a model phrase is located that matches the input speech during recognition search.
keyword	N/A	This parameter is a pointer to the name of a model that should be used as a special “keyword”. If this parameter points to a valid phrase model name, and waitCommit is not set, then VCD treats this model specially during recognition search. A match to the keyword model is the only recognition result that will be output as soon as located if the recognizer is still searching for other models. This can be valuable when VCD is used in a keyword spotting application. For example, if the keyword phrase is “Hello My Application”, and another model phrase is “Hello”, then VCD will not immediately terminate search after “Hello” is located, but will continue the search until it locates “Hello My Application”, or only one other model has matched the input speech, or speech has ended.
iMaxDelta	4000	While the implementation of this parameter is complex, it is basically used to adjust a trade-off between amount of processing used during recognition search and recognition accuracy performance. The best value to use depends on the application and phrase models, so some experimentation is necessary. Larger values than the default reduce processor load but can also decrease recognition accuracy. Smaller values increase processor load but may improve performance.
vadAlpha	31128	A parameter of the VAD library that adjusts sensitivity to the audio signal. This is a positive number no greater than 32767. Setting this parameter to a larger value will require a speech signal to be consistently above the background noise level for a longer period of time in order to start the recognizer search.
vadSigSNRThresh	1532	A parameter of the VAD library that adjusts sensitivity to the audio signal. It is a positive number no greater than 32767. Setting this parameter to a larger value will require a speech signal to be further above the background noise level in order to start the recognizer search.

Timeouts

During model creation VCD will timeout after four seconds if no speech is found.

During model update VCD will report an update failure if it is taking too long to complete enrollment. This length of time is the length of the model plus four seconds. Generally this will occur if speech is not found within the first four seconds.

During recognition processing, if the search engine runs for ten seconds without finding any recognition result, it will be forced to commit a result as a type of watchdog timeout. The search engine can then be restarted to continue recognition if desired.

The Voice Detection Example

The Voice Detection Plugin includes a complete example program that illustrates the use of the VCD and VAD libraries through a menu-driven application that can enroll, update, and delete models, and can recognize speech in a continuous mode. This section describes the hardware and software operation of the example.

Example Software

It is assumed that the SimpleLink MSP432 SDK has already been installed in a directory we will refer to as MSP432_SDK_INSTALL_DIR. If you have not already done so, refer to the SimpleLink MSP432 SDK User’s Guide for instructions.

In addition to the libraries used by the plugin, the VCD example application utilizes the Graphics Library for creating graphical interfaces on the display. The Graphics Library is already included with the MSP432 SDK and the files that are needed are included in the example applications, so no further files need to be downloaded.

Example Hardware

The example code runs on an MSP432P401R LaunchPad

The example can use one of two displays (the example code must be built separately for each display type):

1) 430BOOST-SHARP96 Sharp Memory LCD BoosterPack. The user interface menus are shown on the display and the two user buttons on the Launchpad are used for navigating the menus and selecting menu actions. The hardware configuration is shown below:

Sharp Stack Configuration

2) BOOSTXL-K350QVG-S1 Kentec QVGA Display BoosterPack. The user interface menus may be navigated and selected either by touch or using the two Launchpad user buttons. The hardware configuration is shown below:

The following steps must be taken to use the Kentec display and touch control:

Two predefined symbols, KENTEC and KENTEC_TOUCH (if using touch control), must be defined when building the demo application. Note: These predefined symbols should be defined for use with the Kentec display only. If they are defined while using the Sharp display, the application will not work as intended.
- To define these variables in CCS:
  - Right-click on the voice_detection_demo project and select Properties
  - Navigate to Buid > MSP432 Compiler > Predefined Symbols
  - Select the Add… button and enter ‘KENTEC’ into the box, and then select OK
  - Repeat the previous step but add ‘KENTEC_TOUCH’. Your Predefined Symbols screen will look similar to this:
  - Select OK to return to the main window in CCS
- To define these variables in IAR:
  - Right-click on the voice_detection_demo project and select Options…
  - Select C/C++ Compiler in the lefthand column
  - Select Preprocessor and in the Defined Symbols box type in ‘KENTEC’ and ‘KENTEC_TOUCH’
  - Select OK to return to the main window in IAR
To use touch control of the menus a jumper wire must be connected from J2-11 (labeled P3.6) to J2-17 (labeled P5.7) on the bottom of the Launchpad, as shown in the picture below.

$Kentec Jumper Wire Configuration$

Audio input is provided by the microphone and preamplifier on a BOOSTXL_AUDIO Audio BoosterPack. Hardware configuration of the mic power and mic out connections is necessary based on the display used.

If the Kentec display is used, two zero-ohm resistors should be installed in R1 and R4 on the Audio BoosterPack and no resistors should be placed in R3 and R5. This is the normal configuration of the Audio BoosterPack.

Sorry! We're currently doing site maintenance.

Sorry! We are currently doing site maintenance.

This part of the TI website is currently undergoing scheduled system maintenance.

We apologize for the inconvenience and ask that you try again later.