SimpleLink SDK Voice Detection Plugin Users Guide

Table of Contents

Introduction

The SimpleLink SDK Voice Detection Plugin enables a user to perform voice activity detection (VAD) and speaker dependent voice command detection (VCD). Easy-to-use APIs are provided that can enable voice applications and enhance existing applications with keyword recognition. The plugin is implemented in fixed-point C for ultra-low power. The VCD library is also configurable to each application, allowing the user to control or minimize resources used. This plugin can be utilized in a variety of applications that require speaker dependent keyword recognition or a hands free user interface. The range of applications can be further expanded by using this plugin in conjunction with other SDK plugins, such as for example using the SimpleLink MSP432 SDK Bluetooth Plugin for keyword recognition in an IoT application.

Prior to using a speaker-dependent speech recognizer, the user must train the recognizer with each phrase to recognize by speaking the phrase several times. During training, the recognizer creates a model of each phrase to use during the recognition process. The performance of the recognizer is thus tied to the speaker that trained each phrase. If another speaker tries to use the recognizer, performance will likely degrade due to differences in the way the phrase is spoken.

Resources in addition to the plugin include an example project illustrating the use of the libraries, which is explained further in the Understanding the Voice Detection Example section.

Prerequisites

Understanding the Voice Detection Plugin

Folder Structure

The folder structure for the SimpleLink MSP432 SDK Voice Detection Plugin is made to compliment the standard folder structure that the platform SDK adheres to. From the root directory, the following folders are available:

VCD Features

Library Features

The VCD Library software provides a comprehensive set of APIs for speaker-dependent speech recognition for a wide variety of embedded applications. VCD APIs allow an application to perform the following operations.

Performance Features

VCD contains features to balance performance with available processing resources.

VCD Concepts

Using VCD in an application typically involves four main steps:

1) Initializing the VCD instance

2) Enrolling (creating) phrase models

3) Training existing phrase models by updating the model parameters with additional data

4) Performing recognition

Initializing VCD

The application starts using VCD by initializing the recognizer. Initialization sets up the recognizer instance for use. It requires the user to provide the number of models, if any, that are stored in memory and a pointer array specifying the memory address of where model data is stored. Model memory will usually be non-volatile, such as flash or FRAM. If flash memory is used the starting location must be on a flash sector boundary and the maximum size of a model must be a multiple of the sector size.

Initialization also requires the user to provide the location and size of RAM that VCD can use for processing. This memory is divided into a small amount of memory that must persist during the use of VCD, and a larger amount of processing memory used only during model creation, update, and recognition. The application can reuse the processing memory when VCD is not performing one of these operations.

The example project includes code that illustrates and assists in initialization the models and setting up the RAM memory and properly. This is explained in more detail in the Implementation Requirements section of this document.

Enrolling Phrase Models

The application must create models of the phrases to be recognized. A block diagram of the process is shown below. It does this by prompting the user to speak a phrase and passing the speech data to the VCD API to create an initial model for the phrase. Within VCD the front end processing converts audio sampled data into a representation of the signal. When the audio activity detector locates speech it triggers the creation of the model. A delay provides compensation for the time it takes the activity detector to respond. The created model is stored in model memory through an application-defined function.

This process must be repeated to create each phrase model to be recognized. The VCD library provides APIs to check the duration of a potential model and a quality measure that provides an estimate of how well the model will perform. These can be used by an application to choose to actually create the model, or guide the user to provide a better model.

Creating a Model

Updating Phrase Models

The application can update a specific model by prompting the user for the phrase to update and then to speak that phrase again. The speech data is passed to the VCD API to update the phrase model. The processing, as shown below is similar to creating a model, except that the phrase model is retrieved from memory and merged with the new speech data to provide an improved model. Updating the model can be performed any number of times to further improve the model. Typically at least two or three updates should be performed.

Updating a Model

Recognizing Phrases

The application uses VCD with the trained models to perform recognition of the phrases as shown below. The application does this by capturing audio data and passing it to the VCD API that searches for the presence of the trained phrases in the audio data. Upon locating one of the phrases in the audio data, VCD notifies the application. After notification, the application can take action based on the recognition result. It can also use VCD APIs to immediately continue the recognition search for another phrase.

Recognizing Speech

The VCD recognizer searches the incoming audio data for occurrences of one of the phrase models. The search engine block diagram below illustrates how VCD performs the search. Search starts when the audio activity detector finds a signal other than background noise. Due to delay, the audio signal at the beginning of the search contains a small amount of background noise. At the end of the initial background noise the search engine checks for an audio signal of any duration that does not match any of the model phrases, for example some speech other than the model phrases. It then searches for speech that matches one of the model phrases. At the same time, as illustrated at the bottom of the figure, it also searches for an audio signal that does not contain any of the model phrases. The search engine also determines if there is additional audio after one of the model phrases. The search will terminate when the input signal returns to background noise, or, depending on option settings it may terminate when it finds the end of a portion of the signal corresponding to one of the model phrases. There are several options for search that are described in Run-time Options.

Search Engine Block Diagram

Using the VCD Library

As with any speech recognizer, the application designer must carefully consider many system level isssues to provide best performance. This section provides some details about the VCD Library and discusses pratical steps to obtain good recognition performance. It also provides specific design requirements that an application must meet to use VCD.

Implementation Requirements

The VCD Library requires the user to control how models will be stored and managed in memory. The application must handle the memory management of models during VCD initialization, enrollment, update, and deletion. The demo project includes a source file (vcd_user.c) with implementations for these APIs using flash memory. For more detail on how vcd_user.c implements these functions, please read the Understanding the Voice Detection Examplesection.

For more detail on the specific requirements for the user-defined APIs, see the VCD API.

Initialization

The user must create an instance of and maintain a VCD_Handle object that holds VCD recognizer state data. All VCD operations must be performed through this handle.

A VCD_ConfigStruct must also be created and filled with model and memory information. For model data to persist between power cycles, the user must provide the configuration structure with the number of models that are stored in memory along with a list of addresses that point to where the model data is stored.

Pointers to the recognizer instance and the configuration structure must be passed to the VCD_init() function, which will initialize all parameters of the recognizer instance to default values. This must be the first frunction called when starting to use VCD.

Enrollment

The prototype for the function that VCD calls during model enrollement is:

VCD_MessageEnum VCD_writeModel( VCD_Handle vh, int_least16_t *ptrModel, int_least32_t mSize, int_least16_t **mAddress);

VCD_writeModel() will be called by the VCD Library during model enrollment to handle storing the model data in memory (usually non-volatile memory, such as flash or FRAM). VCD will provide the function with ptrModel, a pointer to the beginning of the model data to be written and mSize, the size of the model data. VCD requires that the user return where in memory they wrote the model to in mAddress. The function should return a VCD_MessageEnum value depending on the results of writing the model to memory.

Update

The prototype for the function that VCD calls during model update is:

VCD_MessageEnum VCD_updateModel(VCD_Handle vh, int_least16_t *ptrModel, int_least32_t mSize, int_least16_t *mAddress);

VCD_updateModel() will be called during VCD Library during model update to handle storing the updated model data. VCD expects the user to overwite the original model with the new updated model data so the address of the model stays the same. VCD will provide the function a pointer to the beginning of the new model data to be written, ptrModel, the size of the model data, mSize, and the address of where the user has stored the original model data, mAddress. The function should return a VCD_MessageEnum value depending on the results of updating the model in memory.

Delete

The prototype for the function that VCD calls for deleting a model is:

VCD_MessageEnum VCD_clearModel(int_least16_t *mAddress);

VCD_clearModel() will be called during model deletion to handle clearing the model data from memory. The VCD library will provide it the address of the model to delete, mAddress. The function should return a VCD_MessageEnum value depending on the results of clearing the model from memory.

Practical System Issues

Libraries Used

The Voice Detection plugin includes and uses a separate Voice Activity Detection (VAD) detection library. The VAD library can be used as a stand-alone library to detect changes in audio signal levels. API documentation covers both of these libraries.

The VCD Library utilizes the Driver Library (DriverLib). This is included in the SimpleLink MSP432 SDK.

The VCD demo project illustrates how to use these libraries when building an application. The demo also includes source files (vcd_user.c and vcd_user.h) that can aid an application to easily set up and specify RAM allocated to the recognizer.

Power Consumption

VCD relies on the application to capture audio samples, typically from an ADC peripheral, and pass them to VCD APIs during model creation, update, and recognition. This data capture can represent a large portion of the power consumption. If an application does not require VCD processing during part of its operation, then it can reduce power by stopping audio data capture during that time. This is often done in a “push-to-talk” scenario, requiring the user to indicate when recognition should be performed.

VCD also reduces processor power consumption by using audio activity detection to limit processing when only background noise is present.

Estimates of current memory and power requirements of the example demo application can be found in the Release Notes.

Recognition Performance

Speech recognition in real-world environments is always a challenge. While the VCD recognizer implements several algorithms to address these issues, there are many practical steps to take that can improve performance.

Good recognition performance depends on the quality of the speech signal. Prior to using VCD, ensure that the captured audio data is not corrupted by interference or electrical noise, and that the audio signal is not clipped or distorted in some other way. The gain of the audio channel should be set to a level in which the loudest expected speech signal rarely causes clipping. Introducing automatic gain control is not recommended.

Users should be instructed to speak clearly at a comfortable speaking rate.

Phrase model creation and training should be done in what one considers a typical fairly quiet environment.

Sometimes a user may find that the recognizer has difficulty recognizing a specific phrase. Often this can be improved by additional training of the phrase. If this does not correct the issue, the application should use the VCD APIs to allow the user to delete the model for that phrase, and re-create and re-train the model.

Speech recognizers typically have difficulty with confusion between similar sounding or rhyming words. To improve performance, the recognition phrases should be as distinct as possible. If possible, encourage use of phrases that contain several syllables with different sounding vowels. Longer phrases will work better. VCD checks during model creation to make sure a phrase is not too short, and provides a quality measure to indicate the likelihood of good performance of a phrase. The minimum length of a valid phrase is an adjustable parameter, which an application can increase. The maximum length of a phrase is set during the application build process, and depends on the amount of memory allocated to hold model data. If a phrase is provided during model creation that has a marginal quality measure, the user should be encouraged to try enunciating the phrase more carefully, or to choose a different phrase that provides a better quality measure.

Audio Data Capture

Audio data must be captured at an 8kHz sampling rate. The sampled data must be in 2’s complement int_least16_t format. VCD expects an array of captured data to be provided every 20 milliseconds (160 samples in the array). Typically this capture is implemented using an ADC peripheral and a DMA channel. The example application code contains an audio collection API that illustrates data capture.

Model Data Storage Requirements

Phrase model data is stored in units of int_least16_t words. Each model requires overhead of 6 words plus the number of words needed to store the model name. The model name is a string of up to 16 ASCII characters including the trailing NULL. The model phrase data requires 16 words per 20 milliseconds of speech in the phrase. The application is responsible for providing the memory to hold the model data, as well as information regarding the maximum number of models and maximum model size that VCD should support.

If models are stored in flash, each model must start on a sector boundary and each model consumes a multiple of the sector size. The number of sectors needed for each model depends on the maximum size of a phrase model.

The example application provided with the VCD Library includes header and source files (vcd_user.h and vcd_user.c) that can be used to configure the number of models, model size, and allocate model memory.

Processing Memory Requirements

During operation, VCD requires some processing memory which is dependent on the number and size of models. The application is responsible for providing this memory. The memory is divided into two types. Persistent processing memory must not be altered outside of the VCD Library function calls as long as the library is being used. Temporary memory is only needed while the application is performing enrollment, update, or recognition. Otherwise the application may reuse the temporary memory. The amount of persistent memory required is fairly small, but the temporary memory can become large depending on the size and number of models. The release notes describe the processing memory requirements.

The example application provided with the VCD Library includes header and source files (vcd_user.h and vcd_user.c) that can be used to easily configure and allocate the processing memory.

Model Naming and Indices

VCD requires each model to be labeled with a unique C name string when created. The maximum length of a name string is 16 characters including a trailing NULL. Names are case sensitive. The model name “Filler” is reserved for a unique model.

VCD assigns each model an index number. The unique “Filler” model is always index 0. This model cannot be deleted. All other models always have sequential indices starting at 1. This means that the index associated with a model may change. For example, if a model is deleted all indices will be adjusted to remain sequential. Indices are typically used to enumerate all existing models while model names are used to access a particular model.

Number of Phrase Models Supported

The number of phrase models that VCD can support is limited by the available memory for models and processing, as well as the platform processing power. The processing load is a complex function of the size and similarity of the phrase models, the environment in which the recognition takes place, how closely the input signal matches the models, and the settings of run-time parameters. Some experimentation usually must be performed to ensure that the processing can be accomplished in real-time.

Run-time Options

VCD allows setting several options at run-time to configure the recognizer for a particular application and to tune the recognizer to balance recognition performance with processing capabilities. All parameters are initially set to defaults, which should be acceptable in the majority of applications. Altering most of the parameters requires an understanding of speech recognition and in particular the VCD processing algorithms. However, some of the parameters which may be of interest to most application designers are described in this section. For further information on all options, see the VCD API guide.

Parameter Default Value Description
minModelStates 20 (400 milliseconds) Sets the minimum duration of a phrase module in 20 milisecond units. VCD will not create a model if the phrase length is shorter than this parameter.
maxModelStates 125 (2500 milliseconds) Sets the maximum duration of a phrase module in 20 milisecond units, which can be used to limit memory storage for phrase models. It must not be greater than the the value set when configuring and allocating phrase model memory.
waitCommit False If set, VCD will wait until it detects that speech has finished before it outputs a recognition result. Otherwise, VCD will output a recognition result as soon as a model phrase is located that matches the input speech during recognition search.
keyword N/A This parameter is a pointer to the name of a model that should be used as a special “keyword”. If this parameter points to a valid phrase model name, and waitCommit is not set, then VCD treats this model specially during recognition search. A match to the keyword model is the only recognition result that will be output as soon as located if the recognizer is still searching for other models. This can be valuable when VCD is used in a keyword spotting application. For example, if the keyword phrase is “Hello My Application”, and another model phrase is “Hello”, then VCD will not immediately terminate search after “Hello” is located, but will continue the search until it locates “Hello My Application”, or only one other model has matched the input speech, or speech has ended.
iMaxDelta 4000 While the implementation of this parameter is complex, it is basically used to adjust a trade-off between amount of processing used during recognition search and recognition accuracy performance. The best value to use depends on the application and phrase models, so some experimentation is necessary. Larger values than the default reduce processor load but can also decrease recognition accuracy. Smaller values increase processor load but may improve performance.
vadAlpha 31128 A parameter of the VAD library that adjusts sensitivity to the audio signal. This is a positive number no greater than 32767. Setting this parameter to a larger value will require a speech signal to be consistently above the background noise level for a longer period of time in order to start the recognizer search.
vadSigSNRThresh 1532 A parameter of the VAD library that adjusts sensitivity to the audio signal. It is a positive number no greater than 32767. Setting this parameter to a larger value will require a speech signal to be further above the background noise level in order to start the recognizer search.

Timeouts

During model creation VCD will timeout after four seconds if no speech is found.

During model update VCD will report an update failure if it is taking too long to complete enrollment. This length of time is the length of the model plus four seconds. Generally this will occur if speech is not found within the first four seconds.

During recognition processing, if the search engine runs for ten seconds without finding any recognition result, it will be forced to commit a result as a type of watchdog timeout. The search engine can then be restarted to continue recognition if desired.

The Voice Detection Example

The Voice Detection Plugin includes a complete example program that illustrates the use of the VCD and VAD libraries through a menu-driven application that can enroll, update, and delete models, and can recognize speech in a continuous mode. This section describes the hardware and software operation of the example.

Example Software

It is assumed that the SimpleLink MSP432 SDK has already been installed in a directory we will refer to as MSP432_SDK_INSTALL_DIR. If you have not already done so, refer to the SimpleLink MSP432 SDK User’s Guide for instructions.

In addition to the libraries used by the plugin, the VCD example application utilizes the Graphics Library for creating graphical interfaces on the display. The Graphics Library is already included with the MSP432 SDK and the files that are needed are included in the example applications, so no further files need to be downloaded.

Example Hardware

The example code runs on an MSP432P401R LaunchPad

The example can use one of two displays (the example code must be built separately for each display type):

1) 430BOOST-SHARP96 Sharp Memory LCD BoosterPack. The user interface menus are shown on the display and the two user buttons on the Launchpad are used for navigating the menus and selecting menu actions. The hardware configuration is shown below:

Sharp Stack Configuration

2) BOOSTXL-K350QVG-S1 Kentec QVGA Display BoosterPack. The user interface menus may be navigated and selected either by touch or using the two Launchpad user buttons. The hardware configuration is shown below:

Sharp Kentec Configuration

The following steps must be taken to use the Kentec display and touch control:

Kentec Jumper Wire Configuration

Audio input is provided by the microphone and preamplifier on a BOOSTXL_AUDIO Audio BoosterPack. Hardware configuration of the mic power and mic out connections is necessary based on the display used.

Sorry! We're currently doing site maintenance.

Sorry! We are currently doing site maintenance.


This part of the TI website is currently undergoing scheduled system maintenance.

We apologize for the inconvenience and ask that you try again later.