Speech recognition

"…whoever has spoken…into the mouthpiece of the phonograph, and whose words are recorded by it, has the assurance that his speech may be reproduced audibly in his own tones long after he himself has tuned to dust… Speech has become, as it were, immortal."

-Scientific American, Nov. 17, 1877 (Early Voices, Death by Gramophone)

Wordcloud image generated by research papers on speech signal processing, Machine Intelligence Lab

MILAB_intro_220428_short.pdf

Introduction to audio and speech signal processing, Machine Intelligence Lab

Presented at BK Colloquium, School of Electronics, KNU, April 28, 2022.

How do humans do it?:

Articulation produces sound waves which the ear conveys to the brain for processing

Automatic speech recognition aims at getting a machine to understand spoken language. By understand we might mean react appropriately. There are a lot of prerequisites that are required to achieve automatic speech recognition, such as:

  • Digital signal processing: acoustic features for speech

  • Machine learning: convert acoustic features to symbols

  • Language processing: into understanding

Research topics:

  • Speech recognition under noisy conditions

  • Speaker model adaptation

  • Speech applications

how speech recognition works in under 4 minutes

Short and fast but giving a good insight into automatic speech recognition

How might computers do it?:

Digitization; Acoustic analysis of the speech signal; Linguistic interpretation

From Why is Automatic Speech Recognition complex?, Vivoka (https://vivoka.com/automatic-speech-recognition-complexity/)

Automatic speech recognition is hard and complex

Speech recognition is one of the largest challenges in machine learning. Although speech is very easy and natural interface to human, it is not that simple task for computers because it requires high level expertise from many research fields: signal processing, probability and statistics, computer algorithms, optimization, natural language processing, etc.

Speaker Adaptation Using i-Vector Based Clustering

Quick summary: a number of speaker group models improve speech recognition performance.

Method: using i-vector (intermediate vector), speakers are clustered into several groups, and group-dependent models are used for the unknown speakers to improve the speech recognition performance.

  • (SCIE) Minsoo Kim, Gil-Jin Jang, Ji-Hwan Kim, Minho Lee. Speaker Adaptation Using i-Vector Based Clustering. KSII Transactions on Internet and Information Systems, Vol.14, No.7, pp. 2785-2799, July 31, 2020

  • Minsu Kim, Gil-Jin Jang, Ji-Hwan Kim. i-Vector-based Speaker Clustering and Model Adaptation for Large Scale Speech Recognition. International Conference on Electronics, Information and Communication (ICEIC 2018). pp. 1146-1150. Sheraton Waikiki Hotel, Honolulu, Hawaii, USA. 26 January 26, 2018.

Training

Speakers are divided into C groups (clusters), and the speaker independent (SI) model is retrained by the training data of divided speaker groups.

Testing

Once an unknown sentence is entered, the closest speaker group (cluster) is found by i-vector similarity, and the selected model is used for the recognition.

Model

Input: short-time spectrogram

Acoustic model: HMM-DNN hybrid with bi-directional LSTM

Toolkit: Kaldi

Hierarchical Phoneme Classification for Improved Speech Recognition

Quick summary: using phoneme group models to improve phoneme classification performance

Method: 39 phonemes from TIMIT database is grouped into 5 groups (fricatives, affricates, stops, nasals, and vowels) using hierarchical clustering methods, and different model configurations are used to classify phonemes within the phoneme groups.

  • (SCIE) Donghoon Oh, Jeong-Sik Park, Ji-Hwan Kim, Gil-Jin Jang*. Hierarchical Phoneme Classification for Improved Speech Recognition. Applied Sciences-Basel (MDPI). Appl. Sci. 2021, 11:428. pp. 1-17.

Overall model structure

BLSTM+FCN


BLSTM+BLSTM+FCN


Improved performance