As with any speech recognizer, the application designer must carefully consider many system level isssues to provide best performance. This chapter provides some details about the MinHMM Library and discusses pratical steps to obtain good recognition performance. It also provides specific design requirements that an application must meet to use MinHMM.

Practical System Issues

Libraries Used

The MinHMM Library utilizes the Driverlib Library. The MinHMMDemo example application utilizes the Graphics Library.

The MinHMM Library project includes and uses a separate audio activity detection library. The detection library can be used as a stand-alone library to detect changes in audio signal levels. API documentation covers both of these libraries.

The MinHMMDemo example application project illustrates how to use these libraries when building an application. The example also includes source files that can aid an application to easily set up and specify RAM allocated to the recognizer.

Power Consumption

MinHMM relies on the application to capture audio samples, typically from an ADC peripheral, and pass them to MinHMM APIs during model creation, update, and recognition. This data capture can represent a large portion of the power consumption. If an application does not require MinHMM processing during part of its operation, then it can reduce power by stopping audio data capture during that time. This is often done in a "push-to-talk" scenario, requiring the user to indicate when recognition should be performed.

MinHMM also reduces processor power consumption by using audio activity detection to limit processing when only background noise is present.

Estimates of current memory and power requirements of the example demo application can be found in the release notes.

Recognition Performance

Speech recognition in real-world environments is always a challenge. While the MinHMM recognizer implements several algorithms to address these issues, there are many practical steps to take that can improve performance.

Good recognition performance depends on the quality of the speech signal. Prior to using MinHMM, ensure that the captured audio data is not corrupted by interference or electrical noise, and that the audio signal is not clipped or distorted in some other way. The gain of the audio channel should be set to a level in which the loudest expected speech signal rarely causes clipping. Introducing automatic gain control is not recommended.

Users should be instructed to speak clearly at a comfortable speaking rate.

Phrase model creation and training should be done in what one considers a typical fairly quiet environment.

Sometimes a user may find that the recognizer has difficulty recognizing a specific phrase. Often this can be improved by additional training of the phrase. If this does not correct the issue, the application should use the MinHMM APIs to allow the user to delete the model for that phrase, and re-create and re-train the model.

Speech recognizers typically have difficulty with confusion between similar sounding or rhyming words. To improve performance, the recognition phrases should be as distinct as possible. If possible, encourage use of phrases that contain several syllables with different sounding vowels. Longer phrases will work better. MinHMM checks during model creation to make sure a phrase is not too short, and provides a quality measure to indicate the likelihood of good performance of a phrase. The minimum length of a valid phrase is an adjustable parameter, which an application can increase. The maximum length of a phrase is set during the application build process, and depends on the amount of memory allocated to hold model data. If a phrase is provided during model creation that has a marginal quality measure, the user should be encouraged to try enunciating the phrase more carefully, or to choose a different phrase that provides a better quality measure.

Usage Details

This section provides important details the application designer must know when using the MinHMM Library.

Audio Data Capture

Audio data must be captured at an 8kHz sampling rate. The sampled data must be in 2's complement int_least16_t format. MinHMM expects an array of captured data to be provided every 20 milliseconds (160 samples in the array). Typically this capture is implemented using an ADC peripheral and a DMA channel. The example application code contains an audio collection API that illustrates data capture.

Model Data Storage Requirements

Phrase model data is stored in units of int_least16_t words. Each model requires overhead of 5 words plus the number of words needed to store the model name. The model name is a string of up to 16 ASCII characters including the trailing NULL. The model phrase data requires 16 words per 20 milliseconds of speech in the phrase. The application is responsible for providing the memory to hold the model data, as well as information regarding the maximum number of models and maximum model size that MinHMM should support.

If models are stored in flash, each model must start on a sector boundary and each model consumes a multiple of the sector size. The number of sectors needed for each model depends on the maximum size of a phrase model.

The example application provided with the MinHMM Library includes header and source files (minhmm_user.h and minhmm_user.c) that can be used to configure the number of models, model size, and allocate model memory.

Processing Memory Requirements

During operation, MinHMM requires some processing memory which is dependent on the number and size of models. The application is responsible for providing this memory. The memory is divided into two types. Persistent processing memory must not be altered outside of the MinHMM Library function calls as long as the library is being used. Temporary memory is only needed while the application is performing enrollment, update, or recognition. Otherwise the application may reuse the temporary memory. The amount of persistent memory required is fairly small, but the temporary memory can become large depending on the size and number of models. The release notes describe the processing memory requirements.

The example application provided with the MinHMM Library includes header and source files (minhmm_user.h and minhmm_user.c) that can be used to easily configure and allocate the processing memory.

Model Naming and Indices

MinHMM requires each model to be labeled with a unique C name string when created. The maximum length of a name string is 16 characters including a trailing NULL. Names are case sensitive. The model name "Filler" is reserved for a unique model.

MinHMM assigns each model an index number. The unique "Filler" model is always index 0. This model cannot be deleted. All other models always have sequential indices starting at 1. This means that the index associated with a model may change. For example, if a model is deleted all indices will be adjusted to remain sequential. Indices are typically used to enumerate all existing models while model names are used to access a particular model.

Number of Phrase Models Supported

The number of phrase models that MinHMM can support is limited by the available memory for models and processing, as well as the platform processing power. The processing load is a complex function of the size and similarity of the phrase models, the environment in which the recognition takes place, how closely the input signal matches the models, and the settings of run-time parameters. Some experimentation usually must be performed to ensure that the processing can be accomplished in real-time.

Run-time Options

MinHMM allows setting several options at run-time to configure the recognizer for a particular application and to tune the recognizer to balance recognition performance with processing capabilities. All parameters are initially set to defaults, which should be acceptable in the majority of applications. Altering most of the parameters requires an understanding of speech recognition and in particular the MinHMM processing algorithms. However, some of the parameters which may be of interest to most application designers are described in this section. For further information on all options, see the MinHMM API guide.

minModelStates

This parameter sets the minimum duration of a phrase model in 20 millisecond units. MinHMM will not create a model if the phrase length is shorter than this parameter. The default is 20 (400 milliseconds).
maxModelStates

This parameter sets the maximum duration of a phrase model in 20 millisecond units, which can be used to limit memory storage for phrase models. It must not be greater than the the value set when configuring and allocating phrase model memory.
waitCommit

This parameter is a boolean value. If set, MinHMM will wait until it detects that speech has finished before it outputs a recognition result. Otherwise, MinHMM will output a recognition result as soon as a model phrase is located that matches the input speech during recognition search.
keyword

This parameter is a pointer to the name of a model that should be used as a special "keyword". If this parameter points to a valid phrase model name, and waitCommit is not set, then MinHMM treats this model specially during recognition search. A match to the keyword model is the only recognition result that will be output as soon as located if the recognizer is still searching for other models. This can be valuable when MinHMM is used in a keyword spotting application. For example, if the keyword phrase is "Hello My Application", and another model phrase is "Hello", then MinHMM will not immediately terminate search after "Hello" is located, but will continue the search until it locates "Hello My Application", or only one other model has matched the input speech, or speech has ended. By default keyword does not point to any valid model.
iMaxDelta

While the implementation of this parameter is complex, it is basically used to adjust a trade-off between amount of processing used during recognition search and recognition accuracy performance. The best value to use depends on the application and phrase models, so some experimentation is necessary. It is a positive number set to 4000 by default. Larger values than the default reduce processor load but can also decrease recognition accuracy. Smaller values increase processor load but may improve performance.
vadAlpha

This is a parameter of the voice activity detector library that adjusts sensitivity to the audio signal. This is a positive number no greater than
1. By default it is set to 31128. Setting this parameter to a larger value will require a speech signal to be consistently above the background noise level for a longer period of time in order to start the recognizer search.
vadSigSNRThresh

This is a parameter of the voice activity detector library that adjusts sensitivity to the audio signal. It is a positive number no greater than
1. By default it is set to 1532. Setting this parameter to a larger value will require a speech signal to be further above the background noise level in order to start the recognizer search.

Timeouts

During model creation MinHMM will timeout after four seconds if no speech is found.

During model update MinHMM will report an update failure if it is taking too long to complete enrollment. This length of time is the length of the model plus four seconds. Generally this will occur if speech is not found within the first four seconds.

During recognition processing, if the search engine runs for ten seconds without finding any recognition result, it will be forced to commit a result as a type of watchdog timeout. The search engine can then be restarted to continue recognition if desired.