My publications in this field are (link) -
B. Choudhury, C. C. Bhanja, T. S. Choudhury, A. Pramanik and R. H. Laskar, “A Comparative Study of Discriminative Approaches for Classifying Languages into Tonal and Non-Tonal Categories at Syllabic Level,” 10th INDIACom New Delhi, 2016, pp. 1260-1264.
content incomplete, more to come
Signal processing is the process of extracting relevant information from the speech signal in an efficient, robust manner. A speech recognition system comprises a collection of algorithms drawn from a wide variety of disciplines, including statistical pattern recognition, communication theory, signal processing, combinational mathematics, and linguistics, among others. Although each of these areas is relied on to varying degrees in different recognizers, perhaps the greatest common denominator of all recognition systems is the signal processing front end, which converts the speech waveform to some type of parametric representation for further analysis and processing.
Speech recognition can be defined as the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. After text-to-speech (TTS) and interactive voice response (IVR) systems, automatic speech recognition (ASR) is one of the fastest developing fields in the framework of speech science and engineering. As the new generation of computing technology, it comes as the next major innovation in man-machine interaction. Speech recognition systems can recognize thousands of words. The evolution of ASR has a lot of applications in many aspects of our daily life, for example, telephone applications, applications for the physically handicapped and illiterates and many others in the field of computer science. Speech recognition is considered as an input as well as an output during the Human Computer Interaction (HCI) design. HCI involves the design implementation and evaluation of interactive systems in the context of the users’ task and work.
It is commonly used in Sound Mixer: Music Recording, Audio Processor: FM Broadcasting, Synthesizer: Sound Synthesis, Voice call: Noise reduction and Speech Codecs etc.
Why it is difficult
1. Acoustic patterns vary from instance to instance
– Natural variations: Even the same person never speaks anything exactly the same way twice
– Systematic variations
• Human physiology: squeaky voice vs deep voice
• Speaking style: clear, spontaneous, slurred or sloppy
• Speaking rate: fast or slow speech. Speaking rate can change within a single sentence
• Emotional state: happy, sad, etc.
• Emphasis: stressed speech vs unstressed speech
• Accents, dialects, foreign words
– Environmental or background noise
2. Linguistic patterns are hard to characterize
– Large vocabulary and infinite language
– Absence of word boundary markers in continuous speech
– Inherent ambiguities: “I scream” or “Ice cream”? Both are linguistically plausible; other context cues are needed
Speech Recognition Systems
Speech Recognition is a technology, which allows control of machines by voice in the form of isolated or connected word sequences. It involves the recognition and understanding of spoken language by machine. Speech Recognition is based on a pattern recognition technology. The objective is to take an input pattern, the speech signal and classify it as a sequence of stored patterns that have precisely been defined. These stored patterns may be made of units, which we call phonemes. If speech patterns were invariant and unchanging, there would be no problem; simply compare sequences of features with the stored patterns, and find exact matches when they occur. But the fundamental difficulty of speech recognition is that the speech signal is highly variable due to different speakers, different speaking rates, different contents and different acoustic conditions. The task is to determine which of the variations in the speech are relevant to speech recognition and which variations are not relevant.
Feature Extraction and Matching
Feature extraction is the process that extracts data from the voice that can later be used to represent each word. Feature matching involves the actual procedure to identify the new word by comparing extracted features from his/her voice input with the ones from a set of known words. All speech recognition systems have to serve two distinct phases. The first one is the enrollment sessions or training phase while the other is the testing phase.
Sources of Information Useful for Language ID
What makes this problem so challenging and interesting? In mono-lingual spoken language systems, the objective is to determine the content of the speech, typically implemented by phoneme recognition coupled with word recognition and sentence recognition. This requires that researchers cue in on small portions of the speech frames, phonemes, syllables, sub-word units, and so on, to determine what the speaker said. In contrast, in text-independent language identification, phonemes and other sub-word units alone are not sufficient cues, since several phonemes and syllables and even words are common across different languages. One also needs to examine the sentence as a whole to determine the acoustic signature of the language, the unique characteristics that make one language sound distinct from another.
Decoding this "acoustic signature" requires information from several sources:
.Acoustic Phonetics: Phonetic inventories differ from language to language. Even when languages have identical phones, the frequencies of occurrence of phones differ across languages.
.Prosodics: Languages vary in terms of the duration of phones, speech rate, and intonation (pitch contour). Tonal languages (i.e. languages in which the intonation of a word determines its meaning) such as Mandarin and Vietnamese have very different intonation characteristics than stress languages such as English.
.Phonotactics: Phonotactics refers to the rules that govern the combinations of the different phones in a language. There is a wide variance in phonotactic rules across languages. For example, the phone cluster /sr/ is very common in the Dravidian language Tamil, whereas it is not a legal cluster in English.
.Vocabulary: Conceptually the most important difference between languages is that they use different sets of words. Thus, a non-native speaker of English is likely to use the phonemic inventory, prosodic patterns, and even (approximately)2the phonotactics of her/his native language, but will be judged to speak English if the vocabulary used is that of English.
Language specific features of speech: Acoustic-phonetics, phonotactics, prosody, vocabulary, and lexical structure.
Speaker specific features of speech: Prosody, idiolect, semantic, excitation characteristic, vocal tract excitation, and size.
Types of Language Identification Schemes
Utilizing Acoustic and Phonetic Information- The purely acoustic LID approach aims at capturing the essential differences between languages by modeling the distributions of spectral features directly. In the earliest automatic LID systems, due to the fact that different languages contain different phonemes, LID systems were mainly developed based on the differences in spectral content among languages. During the training phase, a set of prototypical short-term spectral features were computed and extracted from the raw speech. During the recognition, the same type of spectral features of the testing utterances were computed and compared with the training prototypes. The language of the speech that was used to training the model yielding the maximum likelihood is hypothesized as the language of the utterance.
Utilizing Phonotactic Information- Phonotactics is one of the most widely used information sources in the LID task, and also the Phoneme Recognition followed by Language Modeling (PRLM) based LID system is one the best-performed LID systems in the recent NIST Language Recognition Evaluations (LRE). The phoneme recognizer is required by the phonotactic approach and thus one of the limitations of this approach is that the phonetically transcribed speech data must be available in order to develop the front-end for the phoneme recognition.
Utilizing Prosodic Information- Prosodic information is primarily encoded in two signal components in human speech: fundamental frequency (F0) and amplitude. Thus, properties of F0 and amplitude contours can be expected to be used in the LID task. Prosodic information contains duration, the pitch pattern, and stress pattern in human linguistics. Thus different prosodic-based LID systems may rely on different combinations of the prosodic features.
Utilizing Word Level Information- The most effective approach to LID ideally utilizes complete knowledge of the lexical and grammatical information of a language. Thus the decoding of an incoming utterance into strings of words with a subsequent analysis is necessary, and the large vocabulary continuous speech recognizers (LVCSR) may be of use. This type of speech recognizer has implicitly incorporated both the acoustic and phonetic features into the speech recognition process. Also with the incorporation of language specific vocabulary and grammar rules for determining the correct word sequence, the LVCSR based LID system can be expected to produce very high accuracies as it utilizes many levels of speech information.
The acoustic features are easier to obtain (low cost in training and testing) but volatile because of the speech variations such as speaker or channel variations. The phonotactic approaches seem to provide the best performance-to-cost ratio, and also capable of modeling an arbitrary number of languages without the linguistic knowledge for all the target languages during the training phase. But since the phonotactic approaches are to model the phonotactic constraints, they are not performing well for
utterance durations less than 5 seconds. Though the prosodic features are robust in channel mismatch, LID systems based on prosodic features are relatively rare. The reason is the lack of efficiency in modeling the prosody characteristics properly. It should be noted that however, prosodic information may be used for distinguishing language groups (such as tonal and non-tonal languages). While LVCSR based LID systems can be considered the most accurate solution for LID task, the main problem of these types of LID systems is that it cannot identify languages without having enough transcribed speech data to develop the phoneme recognizer and also the word transcribed training data for the large vocabulary continuous speech recognizer.
A brief summary of the varuious features
Type of Spoken
Language Features
Advantage
Constraint
Representation
Spectral features like MFCC, LFCC
Sequences of sub-word labels
Features from amplitude, F0 and amplitude
Sequences of word transcriptions
Acoustic
Low cost in training and testing (in terms of both the
data required and computation
complexity)
Good performance-to-cost ratio. No linguistic knowledge
for all target languages
required to train
Can be used in
combination with other
spoken language features
(can be achieved by
using GMMs and/or
SVMs)
Phonotactic
Useful for tests with
duration > 5sec
Prosodic
Robust for channel variation
Mostly suitable for
distinguishing language
groups.
Word Level (LVCSR or
Keyword Spotting)
High accuracy. Can be used
for testing short utterances.
Significant training effort
and linguistic input
required.
Read this book for a detailed analysis of everything related to speech signal processing.