I have created a stand-alone Matlab demo of the noise-robust Automatic Speech Recognition (ASR) techniques I worked on over the past few years. All these techniques rely on finding a sparse, linear combination of noise-free speech exemplars, which is then either used to make an estimate of the clean speech, or to perform exemplar based ASR. For an overview of the methods and the relevant background, my thesis [1] is a good starting point. The demo works on a noise robust digit recognition task, AURORA-2 [2].

Download

You can grab the full archive here (81 Mb, includes exemplar dictionary and example noisy speech files), or an archive with just the Matlab codes here. To get started quickly, simply execute the top-level Matlab script sparse_ASR.m.

UPDATE (September 2014): A version of the Matlab code package which includes feature extraction of a user-provided sentence (use in conjunction with the full version which contains the exemplar dictionary). Grab it here.

The techniques implemented in the demo
  1. Sparse Imputation (SI) [3,4]. Uses a missing data mask to find a linear combination of clean speech exemplars using only reliable (noise-free) noisy speech features.
  2. Feature Enhancement (FE) [5]. Takes a source separation approach by decomposing noisy speech into a linear combination of speech and noise exemplars.
  3. Sparse Classification (SC) [5]. As FE, but associates each speech exemplar with HMM-state labels, and uses the exemplar activations directly as evidence for the underlying states.
  4. Hybrid SC/FE (SCFE) [6]. Combination of SC and FE at the state posterior level using the product rule, with a conventional GMM generating the FE posteriors.

What is included in the demo
  • A few AURORA-2 example files (only the extracted Mel features, not the original data)
  • A speech and noise exemplar dictionary, created using the multi-condition AURORA-2 training set.
  • A simple Matlab implementation of a conventional GMM-HMM speech recognizer for use with SI/FE/SCFE. It is a straightforward implementation of a word-model based, 16-state-per-word HMM and a GMM with 32 mixtures per state operating on per-file mean/variance normalized MFCC features. Two acoustic models are included: one trained on the clean speech training set of AURORA-2 and one trained on the multi-condition training set of AURORA-2.
  • Visualisations of clean, noisy, noise and enhanced spectrograms, with optional mean (or mean&variance) normalization to better gauge the effect on a speech recognizer employing these normalizations. Additionally. visualisations of the state posteriorgrams obtained with SC/GMM.
  • FE/SC/SCFE is GPU-accelerated using GPUmat as described in [7], provided 1) a suitable GPU is available 2) GPUmat is installed and 3) the corresponding flag nmf.usegpu in the demo code is set to one.

What is NOT included in the demo
  • Mel-feature extraction. The provided dictionary and example files are log-Mel scaled spectrogram features - for using your own files you can use for example use voicebox or HTK. UPDATE: Feature extraction is included in the separate September 2014 version listed above.
  • Missing data mask estimation. SI relies on a missing data mask that indicates which noisy speech features remain (approximately) uncorrupted by noise. In the demo, the so-called oracle mask is used that is created from knowledge of the underlying clean speech and noise. For more realistic mask estimation methods, see [1,3,4,8,9] and the references therein.
  • More recent advances such as artificial noise exemplars [5], noise sniffing [6], convolutive NMF [10], hierarchical exemplar dictionaries [11], observation uncertainties [12,13] or soft missing data masks [14].
  • Speech enhancement. Although simple to implement [15], the current demo does only feature enhancement. This is mainly due to the fact I have opted to include extracted log-Mel spectrogram features rather than raw waveforms. For samples of speech enhancement with this framework, see Tuomas Virtanen's website.

Getting started

To get started quickly, simply execute the top-level Matlab script sparse_ASR.m. This will execute all methods in turn on 4 SNR conditions: clean speech, 5 dB., 0 dB. and -5 dB. You can vary the noise type and test sets by modifying the setlist and condlist variables.

You then may want to examine the effect of various parameters, such as the number of speech (and noise) exemplars in the dictionary (numspeechexemplars and numnoiseexemplars), the sliding window size (windowlength) and the windowshift (Delta). For SI, interesting parameters are the mask threshold that governs the balance between features labelled as uncorrupted by noise and corrupted by noise (si.mask_thr), the minimum number of uncorrupted features in a window (si.num_reliables_thr) and the number or iterations used to obtain a sparse representation (si.numiterations). For SC/FE/SCFE, interesting parameters are the speech and noise sparsity regularisation (nmf.sparsity_speech and nmf.sparsity_noise) and the number of iterations used to obtain a sparse representation (nmf.numiterations). For SCFE, interesting parameters that govern the balance between the SC and FE-GMM stream (scfe.streamweight.sc and scfe.streamweight.gmm)

Another interesting aspect is the impact of the acoustic model that is used to evaluate the clean speech estimates. With the variable gmm_modeltype you can choose between a model trained on clean speech and a model trained on noisy speech (several noise conditions down to 5 dB.). Finally, it may be worthwhile comparing the different visualization options governed by visualisation.meanvar: when set to 1 or 2, the spectrograms are (1) mean or (2) mean&variance normalized, which gives a more accurate picture of how the subsequent speech recognizer (which employs mean&variance normalization) interprets the features.

Related work

The Technical University Munich has released openBliSSART, a C++ toolbox for NMF-based blind source separation. On his website, Emanuel Vincent also has a number of software tools for performing and evaluating source separation. The SMALL project has released a toolbox for audio inpainting, a field that has some similarities with missing data imputation for ASR.

Licensing
The demo Matlab code is released under the GNU General Public License. The included Matlab codes from other authors fall under the licenses indicated in their respective headers. 

Acknowlegements

I acknowledge my collaborators Tuomas Virtanen, Antti Hurmalainen and Hugo Van hamme for their contributions to the code. I also acknowledge the authors of the included Matlab codes solvelasso.m (from Sparselab), strdist.m and hmmViterbiC.m (from PMTK3).

References

[1] Noise robust ASR: Missing data techniques and beyond (Jort F. Gemmeke)PhD thesis, Radboud Universiteit Nijmegen, The Netherlands, 2011. [bib][pdf]
[2] The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy condition (H.G.Hirsch, D. Pearce)Proceedings of the ISCA workshop ASR2000, Paris, France, 2000. [pdf]
[3] Compressive Sensing for Missing Data Imputation in Noise Robust Speech Recognition (Jort F. Gemmeke, Hugo Van hamme, Bert Cranen, Lou Boves)In IEEE Journal of Selected Topics in Signal Processing, volume 4, 2010. ([pdf][bib]
[4] Sparse imputation for large vocabulary noise robust ASR (Jort F. Gemmeke, B. Cranen, U. Remes)In Computer Speech & Language, volume 25, 2011. ([pdf])[bib] [doi]
[5] Exemplar-based sparse representations for noise robust automatic speech recognition (Jort F. Gemmeke, Tuomas Virtanen, Antti Hurmalainen)In IEEE Transactions on Audio, Speech and Language processing, volume 19, 2011. ([pdf][bib] [doi]
[6] Advances in noise robust digit recognition using hybrid exemplar-based techniques (Jort F. Gemmeke, Hugo Van hamme)In Proc. INTERSPEECH, 2012. ([pdf][bib]
[7] Toward a practical implementation of exemplar-based noise robust ASR (Jort F. Gemmeke, Antti Hurmalainen, Tuomas Virtanen, Yang Sun)In Proc. EUSIPCO, 2011. ([pdf][bib]
[8] Automatic speech recognition using missing data techniques: Handling of real-world data (Jort F. Gemmeke, Maarten Van Segbroeck, Yujun Wang, Bert Cranen, Hugo Van hamme)Chapter in Robust Speech Recognition of Uncertain or Missing Data (Dorothea Kolossa, Reinhold Haeb-Umbach, eds.)Springer Verlag, 2011. ([pdf][bib] [pdf]
[9] Mask estimation and imputation methods for missing data speech recognition in a multisource reverberant environment (Sami Keronen, Heikki Kallasjoki, Ulpu Remes, Guy J Brown, Jort F. Gemmeke, Kalle J. Palomäki)In Accepted for publication in: Computer Speech & Language, 2012. ([pdf][bib]
[10] Non-negative matrix deconvolution in noise robust speech recognition (Antti Hurmalainen, Jort F. Gemmeke, Tuomas Virtanen)In Proc. International Conference on Acoustics, Speech and Signal Processing, 2011. ([pdf][bib]
[11] An Hierarchical Exemplar-based Sparse Model of Speech, with an Application to ASR (Jort F. Gemmeke, Hugo Van hamme)In Automatic speech recognition and understanding Workshop, 2011. ([pdf][bib]
[12] Observation uncertainty measures for sparse imputation (Jort F. Gemmeke, Ulpu Remes, Kalle J. Palomäki)In Proc. INTERSPEECH, 2010. ([pdf][bib]
[13] Uncertainty measures for improving exemplar-based source separation (Heikki Kallasjoki, Ulpu Remes, Jort F. Gemmeke, Tuomas Virtanen, Kalle J. Palomäki)In Proc. INTERSPEECH, 2011. ([pdf][bib]
[14] Sparse Imputation for noise robust speech recognition using soft masks (Jort F. Gemmeke, B. Cranen)In Proc. International Conference on Acoustics, Speech and Signal Processing, 2009. ([pdf][bib]
[15] Exemplar-based speech enhancement and its application to noise-robust automatic speech recognition (Jort F. Gemmeke, Tuomas Virtanen, Antti Hurmalainen)In International Workshop on Machine Listening in Multisource Environments, 2011. ([pdf][bib]