Speech Recognition is typically recognized as the common ability to identify and respond to the sounds produced in human speech. On more "sciencey" terms, it is the inter-disciplinary sub-field of computational linguistics that develops methodologies and technologies that therefore enable the recognition and translation of the specific language into text by computers.
Math is the study of topics such as quantity, structure, space, and change. We can see math in our everyday lives so easily. You use math to build things, it helps at the grocery store, in baking, and importantly, when you deal with money.
Speech recognition is a supervised machine learning problem. Statistical models learn the specific and different patterns of the waves coming through the audio, which make up the sounds in speech. Models automatically transcribe new speech by detecting the two different models: language model and acoustic model. This assists in finding what sequences are common and which ones are rare.
Our speech is composed of phonemes, which are individual sounds or syllables. Much like the consonant ‘d’ sound at the beginning of ‘dog’, or the ‘oo’ sound in the middle of ‘cook’. "A dictionary maps words in the language to their phonetic pronunciation, and enables us to model acoustics at the phoneme level." There are approximately 50 phonemes in the English language, compared to hundreds of thousands of millions of words in our day to day vocabulary. In actuality, it makes more sense to record spoken examples of the 50 phonemes than it would be to record the wide range of words the English language has.
The Language Model shows us the common word sequences. "For example, the phrase “How are you today?” is a typical common question asked in the English language, on the other hand, “Noise proves problematic for speech recognition” is quite rare. The model assigns appropriate probabilities to the phrases, and all other sentences used, by learning from a text corpus of billions, of words.
The acoustic model is to show what sounds made when we speak. For a small range in vocabulary, saying the digits 0-9, alone, makes it possible to model the acoustic details of the individual words. As the vocabulary range expands, it becomes more difficult and even impossible to record enough "spoken samples" of all words. Therefore, modeling acoustics needs to be done at a lower granularity.
This page by Kayla G. ('19)