Abhijeet Sangwan

General Automatic Speech Recognition

 Improved Parcel Sorting by Combining Automatic Speech and Character Recognition  IEEE ESPA 2012
Abstract: Automatic postal sorting systems have traditionally relied on optical character recognition (OCR) technology. While OCR systems perform well for flat mail items such as en-
velopes, the performance deteriorates for parcels. In this study, we propose a new multimodal solution for parcel sorting which combines automatic speech recognition (ASR) technology with OCR in order to deliver better performance. Our multimodal approach is based on estimating OCR output confidence, and then optionally using ASR system output when OCR results show low confidence. Particularly, we proposed a Levenshtein edit distance (LED) based measure to compute OCR confidence. Based on the OCR confidence measure, a dynamic fusion strategy is developed that forms its final decision based on (i) OCR output alone, (ii) ASR output alone, and (iii) combination of ASR and OCR outputs. The proposed system is evaluated on speech and image data collected in real-world conditions. Our experiments show that the proposed multimodal solution achieves an overall zip code recognition rate of 90.2%, which is a substantial improvement over ASR alone (81%) and OCR alone (80.6%) systems. This advancement represents an important contribution that leverages OCR and ASR technologies to improve address recognition in parcels.

Automatic Multimedia Content Analysis and Detection

 Automatic Excitement-Level Detection for Sports Highlights Generation  Interspeech 2010
 Abstract: The problem of automatic excitement detection in baseball videos is considered and applied for highlight generation. This paper focuses on detecting exciting events in video using complementary information from the audio and video domains. First, a new measure for non-stationarity which is extremely effective in separating background from speech is proposed. This new feature is employed in an unsupervised GMM-based segmentation algorithm that identifies the sports commentators speech within the crowd background. Thereafter, the “level-of-excitement” is measured using features such as pitch, F1–F3 center frequencies, and spectral center of gravity extracted from the commentators speech. Our experiments using actual baseball videos show that these features are well correlated with human assessment of excitability. Furthermore, slow-motion replay and baseball pitching-scenes from the video are also detected to estimate scene end-points. Finally, audio/video information is fused to rank-order scenes by “excitability” in order to generate highlights of user-defined timelengths. The techniques described in this paper are generic and applicable to a variety of topic and video/acoustic domains.

Phonological Features Based Speech Modeling: Applications in Speech Recognition, Language Identification, and Accent Assessment Analysis

 Automatic analysis of Mandarin accented English using phonological features Speech Communication, Elsevier
 Abstract: The problem of accent analysis and modeling has been considered from a variety of domains, including linguistic structure, statistical analysis of speech production features, and HMM/GMM (Hidden Markov Model / Gaussian Mixture Model) model classification. These studies however fail to connect speech production from a temporal perspective through a final classification strategy. Here, a novel accent analysis system and methodology which exploits the power of phonological features (PFs) is presented. The proposed system exploits the knowledge of articulation embedded in phonology by building Markov models (MMs) of PFs extracted from accented speech. The Markov models capture information in the PF space along two dimensions of articulation: PF state-transitions and state-durations. Furthermore, by utilizing MMs of native and non-native accents, a new statistical measure of “accentedness” is developed which rates the articulation of a word by a speaker on a scale of native-like (+1) to non-native like (-1). The proposed methodology is then used to perform an automatic cross-sectional study of accented English spoken by native speakers of Mandarin Chinese (N-MC). The experimental results demonstrate the capability of the proposed system to perform
quantitative as well as qualitative analysis of foreign accents. The work developed in this study can be easily expanded into language learning systems, and has potential impact in the areas of speaker recognition and ASR (automatic speech recognition).

 Using Human Perception for Automatic Accent Assessment
 Interspeech 2011
 Abstract: In this study, a new algorithm for automatic accent evaluation of native and non-native speakers is presented. The proposed system consists of two main steps: alignment and scoring. At the alignment step, the speech utterance is processed using a Weighted Finite State Transducer (WFST) based technique to automatically estimate the pronunciation errors. Subsequently, in the scoring step a Maximum Entropy (ME) based technique is employed to assign perceptually motivated scores to pronunciation errors. The combination of the two steps yields an approach that measures accent based on perceptual impact of pronunciation errors, and is termed as the Perceptual WFST (P-WFST). The P-WFST is evaluated on American English (AE) spoken by native and non-native (native speakers of Mandarin-Chinese) speakers from the CUAccent corpus. The proposed P-WFST algorithm shows higher and more consistent correlation with human evaluated accent scores, when compared to the Goodness Of Pronunciation (GOP) algorithm.

 Language Identification Using a Combined Articulatory Prosody Framework
 ICASSP 2011
 Abstract: This study presents new advancements in our articulatory-based language identification (LID) system. Our LID system automatically identifies language-features (LFs) from a phonological features
(PFs) based representation of speech. While our baseline system uses a static PF-representation for extracting LFs, the new system is based on a dynamic PF representation for feature extraction. Interestingly, the new LFs outperform our baseline system by 11.8% absolute in a difficult 5-way classification task of South Indian Languages. Additionally, we incorporate pitch and energy based features in our new system to leverage prosody in classification. In particular, we employ a Legendre polynomial based contour-estimation to capture shape parameters which are used in classification. Additionally, the fusion of PF and prosody-based LFs further improves the overall classification result by 16.5% absolute over the baseline system. Finally, the proposed articulatory language ID system is combined with a PPRLM (parallel phone recognition language model) system to obtain an overall classification accuracy of 86.6%.

 Keyword Recognition with Phone Confusion Networks and Phonological Features based Keyword Threshold Detection
 Abstract: In this study, a new keyword spotting system (KWS) that utilizes phone confusion networks (PCNs) is presented. The new system exploits the compactness and accuracy of phone confusion networks to deliver fast and accurate results. Special design considerations are provided within the new algorithm to account for phone recognizer induced insertion and deletion errors. Furthermore, this study proposes a new threshold estimation technique that uses the keyword constituent phones and phonological features (PFs) for threshold computation. The new threshold estimation technique is able to deliver thresholds that improves the overall F-score for keyword detection. The final integrated system is able to achieve a better balance between precision and recall.

 Automatic Language Analysis and Identification based on Speech Production Knowledge  ICASSP 2010
 Abstract: In this paper, a language analysis and classification system that leverages knowledge of speech production is proposed. The proposed scheme automatically extracts key production traits (or “hot-spots”) that are strongly tied to the underlying language structure. Particularly, the speech utterance is first parsed into consonant and vowel clusters. Subsequently, the production traits for each cluster is represented by the corresponding temporal evolution of speech articulatory states. It is hypothesized that a selection of these production traits are strongly tied to the underlying language, and can be exploited for language ID. The new scheme is evaluated on our South Indian Languages (SInL) corpus which consists of 5 closely related languages spoken in India, namely, Kannada, Tamil, Telegu, Malayalam, and Marathi. Good accuracy is achieved with a rate of 65% obtained in a difficult 5-way classification task with about 4sec of train and test speech data per utterance. Furthermore, the proposed scheme is also able to automatically identify key production traits of each language (e.g., dominant vowels, stop-consonants, fricatives etc.).
 Leveraging Speech Production Knowledge for Improved Speech Recognition  ASRU 2009
 Abstract: This study presents a novel phonological methodology for speech recognition based on phonological features (PFs) which leverages the relationship between speech phonology and phonetics. In particular, the proposed scheme estimates the likelihood of observing speech phonology given an associative lexicon. In this manner, the scheme is capable of choosing the most likely hypothesis (word candidate) among a group of competing alternative hypotheses. The framework employs the Maximum Entropy (ME) model to learn the relationship between phonetics and phonology. Subsequently, we extend the ME model to a ME-HMM (maximum entropy-hidden Markov model) which captures the speech production and linguistic relationship between phonology and words. The proposed ME-HMM model is applied to the task of re-processing N-best lists with good results.

 Speech Under Stress: A Production based Framework
 Abstract: This paper examines the impact of physical stress on speech. The methodology adopted here identifies inter-utterance breathing (IUB) pattern as a key intermediate variable while studying the relationship between physical stress and speech. Additionally, this work connects high-level prosodical changes in the speech signal (energy, pitch, and duration) to the corresponding breathing patterns. Our results demonstrate the diversity of breathing and articulation patterns that speakers employ in order to compensate for the increased body oxygen demand. Here, we identify the normalized value of breathing energy rate (proportional to minute volume) acquired from a conventional as well as physiological microphone as a reliable and accurate estimator of physical stress. Additionally, we also show that the prosodical patterns (pitch, energy, and duration) of highlevel speech structure shows good correlation with the normalizedbreathing energy rate. In this manner, the study establishes the interconnection between temporal speech structure and physical stress through breathing.

  Phonological Features Based Variable Frame Rate Scheme for Improved Speech Recognition
 ASRU 2007
  Abstract: In this paper, we propose a new scheme for variable frame rate (VFR) feature processing based on high level segmentation (HLS) of speech into broad phone classes. Traditional fixed-rate processing is not capable of accurately reflecting the dynamics of continuous speech. On the other hand, the proposed VFR scheme adapts the temporal representation of the speech signal by tying the framing strategy with the detected phone class sequence. The phone classes are detected and segmented by using appropriately trained phonological features (PFs). In this manner, the proposed scheme is capable of tracking the evolution of speech due to the underlying phonetic content, and exploiting the non-uniform information flow-rate of speech by using a variable framing strategy. The new VFR scheme is applied to automatic speech recognition of TIMIT and NTIMIT corpora, where it is compared to a traditional fixed window-size/frame-rate scheme. Our experiments yield encouraging results with relative reductions of 24% and 8% in WER (word error rate) for TIMIT and NTIMIT tasks, respectively.

Competitive Neyman-Pearson Hypotheses Testing + Voice Activity Detection

 Design and Performance Analysis of Bayesian, Neyman-Pearson and Competitive Neyman-Pearson Voice Activity Detectors  IEEE Signal Processing Journal
 Abstract: In this paper, the Bayesian, Neyman-Pearson and  Competitive Neyman-Pearson detection approaches are analyzed using a perceptually modified Ephraim-Malah (PEM) model, based on which a few practical voice activity detectors are developed. The voice activity detection is treated as a composite hypothesis testing problem with a free parameter formed by the prior signal-to-noise ratio (SNR). It is revealed that a high prior SNR is more likely to be associated with the ‘speech hypothesis’ than the ‘pause hypothesis’ and vice-versa, and the CNP approach exploits this relation by setting a variable upper bound for the probability of false-alarm. The proposed VADs are tested under different noises and various SNRs, using speech samples from the SWITCHBOARD database, and are compared with adaptive multi-rate (AMR) VADs. Our results show that the CNP VAD outperforms the NP and Bayesian VADs, and compares well to the AMR VADs. The CNP VAD is also computationally inexpensive making it a good candidate for applications in communication systems.

 Environmentally Aware Voice Activity Detector  Interspeech 2007
 Abstract: Traditional voice activity detectors (VADs) tend to be deaf to the acoustical background noise, as they (i) utilize a single operating point for all SNRs (signal-to-noise ratios) and noise types, and (ii) attempt to learn the background noise model online from finite data length. In this paper, we address the aforementioned issues by designing an environmentally aware (EA) VAD. The EA VAD scheme builds prior offline knowledge of commonly encountered acoustical backgrounds, and also combines the recently proposed competitive Neyman-Pearson (CNP) VAD with a SVM (support vector machine) based noise classifier. In operation, the EA VAD obtains accurate noise models of the acoustical background by employing the noise classifier and its prior knowledge of the noise type, and thereafter uses this information to set the best operating point and initialization parameters for the CNP VAD. The superior performance of the EA VAD scheme over the standard AMR (adaptive multi-rate) VADs in low SNR is confirmed in a simulation study, where speech and noise data were drawn from the SWITCHBOARD and NOISEX databases. We report an absolute improvement of 10-15% in detection rates over AMR VADs in low SNR for different noise types.

 On the Competitive Neyman-Pearson Approach for Composite Hypothesis Testing and its application in Voice Activity Detection  ICASSP 2006
 Abstract: The problem of composite hypothesis testing where the probability law governing the generation of the free parameter is not explicitly known is considered. It is shown that unlike the Neyman-Pearson (NP) approach, the competitive NP (CNP) approach models incomplete prior information about the source into the detector design by setting a variable upper bound for the probability of false-alarm term. Further, the CNP and NP approaches are employed to develop the CNP and NP detectors for voice activity detection (VAD), where the prior SNR is shown to be the free parameter of the composite hypothesis. We test the CNP and NP detectors using speech samples from the SWITCHBOARD database which are suitably corrupted using different noises and various SNRs. Our simulation results show that the CNP detector outperforms its NP counterpart and is comparable to the adaptive multi-rate (AMR) VADs.LR.

 Improved Voice Activity Detection via Contextual Information and Noise Suppression  ISCAS 2005
 Abstract: In this paper, we develop a contextual voice activity detection (VAD) scheme which combines both contextual and frame specific information to improve detection. Unlike many VAD algorithms which assume that the cues to activity lie within the frame alone, our scheme seeks information for activity in the current as well as the neighboring frames. The new approach provides good robustness in low SNR when the speech frame is corrupted and an alternate reliable source of activity information is necessary. Further, we present a simple noise suppression scheme to enhance the VAD performance at low SNR. The noise suppressor provides spectrally reshaped signal to the VAD. Finally, we combine the contextual VAD and the noise suppression scheme with a basic detector to form a comprehensive VAD. The proposed comprehensive VAD system is tested on speech samples from the SWITCHBOARD database. Various noises under different SNRs are added to the speech signals. Experimental results show that the proposed VAD outperforms the standard algorithm ETSI AMR VAD-1.I I .

 Comparison of Voice Activity Detection Algorithms for VoIP   ISCC 2002
 Abstract: We discuss techniques for Voice Activity Detection (VAD) for Voice over Internet Protocol (VoIP). VAD aids in saving bandwidth requirement of a voice session thereby increasing the bandwidth efficiently. Such a scheme would be implemented in the application Layer. Thus the VAD is independent of the lower layers in the network stack [1]. In this paper, we compare the quality of speech, level of compression and computational complexity for three time-domain and three frequency- domain VAD algorithms. Implementation of time-domain algorithms is direct and they are computationally simple. However, better speech quality is obtained with the frequency-domain algorithms. A comparison of the relative merits and demerits along with the subjective quality of speech after removal of silence periods is presented for all the algorithms. A quantitative measurement of speech quality for different algorithms is also presented.

 VAD Techniques for Real-Time Speech Transmission on the Internet  HSNMC 2002

Speech Feature: Warped Discrete Cosine Transform Cepstrum (WDCTC)

 Warped Discrete Cosine Transform Cepstrum: A New Feature For Speech Processing Eusipco 2005
 Abstract: In this paper, we propose a new feature for speech recognition and speaker identification application. The new feature is termed as warped-discrete cosine transform cepstrum (WDCTC). The feature is obtained by replacing the discrete cosine transform (DCT) by the warped discrete cosine transform (WDCT, [4]) in the discrete cosine tranform cepstrum (DCTC [2]). The WDCT is implemented as a cascade of the DCT and IIR all-pass filters. We incorporate a nonlinear frequency-scale in DCTC which closely follows the bark-scale. This is accomplished by setting the all-pass filter parameter using an expression given by Smith and Abel [5] . Performance of WDCTC is compared to mel-frequency cepstral coefficients (MFCC) in a speech recognition and speaker identification experiment. WDCTC outperforms MFCC in both noisy and noiseless conditions.form according to the frequency contents of the signal block [4]. As a parallel, the advantage of using WDCT over DCT for image compression has been shown in [4]. Smith and Abel [5] derived an analytic expression for an all-pass filter parameter such that the mapping between the warped and the unwarped frequencies, for a given sampling frequency fs , follows the psychoacoustic Bark-scale. This expression is used to get WDCT. Applying WDCT to DCTC generates a new feature, warped discrete cosine transform cepstrum (WDCTC). To evaluate the efficacy of the new feature in speech processing applications, we compare the performances of WDCTC with MFCC for vowel recognition and speaker identification tasks. Our experimental results show that WDCTC consistently performs better than MFCC.

 Theoretical Complex Cepstrum Of DCT and WDCT filters  IEEE Signal Processing Letters
 Abstract: In this letter, we derive the theoretical complex cepstrum (TCC) of the discrete cosine transform (DCT) and warped DCT (WDCT) filters. Using these derivations, we intend to develop an analytic model of the warped discrete cosine transform cepstrum (WDCTC), which was recently introduced as a speech processing feature. In our derivation, we start with the filter bank structure for the DCT, where each basis is represented by a finite impulse response (FIR) filter. The WDCT filter bank is obtained by substituting z −1 in the DCT filter bank with a first-order all-pass filter. Using the filter bank structures, we first derive the transfer functions for the DCT and WDCT, and subsequently the TCC for each filter is computed. We analyze the DCT and WDCT filter transfer functions and the TCC by illustrating the corresponding pole-zero maps and cepstral sequences. Moreover, we also use the derived TCC expressions to compute the cepstral sequence for a synthetic vowel /aa/ where the observations on the theoretical cepstrum corroborate well with our practical findings.

 Statistical Properties of the Warped Discrete Cosine Transform Cepstrum Compared with MFCC  Interspeech 2005
 In this paper, we continue our investigation of the warped discrete cosine transform cepstrum (WDCTC), which was earlier introduced as a new speech processing feature [1]. Here, we study the statistical properties of the WDCTC and compare them with the mel-frequency cepstral coefficients (MFCC). We report some interesting properties of the WDCTC when compared to the MFCC: its statistical distribution is more Gaussian-like with lower variance, it obtains better vowel cluster separability, it forms tighter vowel clusters and generates better codebooks. Further, we employ the WDCTC and MFCC features in a 5-vowel recognition task using Vector Quantization (VQ) and 1-Nearest Neighbour (1-NN) as classifiers. In our experiments, the WDCTC consistently outperforms the MFCC.vowel classes.

 Performance Analysis Of The Warped Discrete Cosine Transform Cepstrum with MFCC using different classifiers  MLSP 2005
 In this paper, we continue our investigation of the warped discrete cosine transform cepstrum (WDCTC), which was earlier introduced as a new speech processing feature [1]. Here, we study the statistical properties of the WDCTC and compare them with the mel-frequency cepstral coefficients (MFCC). We report some interesting properties of the WDCTC when compared to the MFCC: its statistical distribution is more Gaussian-like with lower variance, it obtains better vowel cluster separability, it forms tighter vowel clusters and generates better codebooks. Further, we employ the WDCTC and MFCC features in a 5-vowel recognition task using Vector Quantization (VQ) and 1-Nearest Neighbour (1-NN) as classifiers. In our experiments, the WDCTC consistently outperforms the MFCC.MFCC features and present the average separability of the vowel classes.

 Complex Cepstrum of Discrete Hartley And Warped Discrete Hartley Filters  DSP Workshop


Center for Robust Speech Systems, University of Texas at Dallas (CRSS)

Human Language Technology, IBM T.J. Watson Research Center