General Automatic Speech Recognition
Abstract: Automatic postal sorting systems have traditionally relied on optical character recognition (OCR) technology. While OCR systems perform well for flat mail items such as en- velopes, the performance deteriorates for parcels. In this study, we propose a new multimodal solution for parcel sorting which combines automatic speech recognition (ASR) technology with OCR in order to deliver better performance. Our multimodal approach is based on estimating OCR output confidence, and then optionally using ASR system output when OCR results show low confidence. Particularly, we proposed a Levenshtein edit distance (LED) based measure to compute OCR confidence. Based on the OCR confidence measure, a dynamic fusion strategy is developed that forms its final decision based on (i) OCR output alone, (ii) ASR output alone, and (iii) combination of ASR and OCR outputs. The proposed system is evaluated on speech and image data collected in real-world conditions. Our experiments show that the proposed multimodal solution achieves an overall zip code recognition rate of 90.2%, which is a substantial improvement over ASR alone (81%) and OCR alone (80.6%) systems. This advancement represents an important contribution that leverages OCR and ASR technologies to improve address recognition in parcels.
|
Automatic Multimedia Content Analysis and Detection
Abstract: The problem of automatic excitement detection in baseball videos is considered and applied for highlight generation. This paper focuses on detecting exciting events in video using complementary information from the audio and video domains. First, a new measure for non-stationarity which is extremely effective in separating background from speech is proposed. This new feature is employed in an unsupervised GMM-based segmentation algorithm that identifies the sports commentators speech within the crowd background. Thereafter, the “level-of-excitement” is measured using features such as pitch, F1–F3 center frequencies, and spectral center of gravity extracted from the commentators speech. Our experiments using actual baseball videos show that these features are well correlated with human assessment of excitability. Furthermore, slow-motion replay and baseball pitching-scenes from the video are also detected to estimate scene end-points. Finally, audio/video information is fused to rank-order scenes by “excitability” in order to generate highlights of user-defined timelengths. The techniques described in this paper are generic and applicable to a variety of topic and video/acoustic domains.
|
Phonological Features Based Speech Modeling: Applications in Speech Recognition, Language Identification, and Accent Assessment Analysis
Abstract: The problem of accent analysis and modeling has been considered from a variety of domains, including linguistic structure, statistical analysis of speech production features, and HMM/GMM (Hidden Markov Model / Gaussian Mixture Model) model classification. These studies however fail to connect speech production from a temporal perspective through a final classification strategy. Here, a novel accent analysis system and methodology which exploits the power of phonological features (PFs) is presented. The proposed system exploits the knowledge of articulation embedded in phonology by building Markov models (MMs) of PFs extracted from accented speech. The Markov models capture information in the PF space along two dimensions of articulation: PF state-transitions and state-durations. Furthermore, by utilizing MMs of native and non-native accents, a new statistical measure of “accentedness” is developed which rates the articulation of a word by a speaker on a scale of native-like (+1) to non-native like (-1). The proposed methodology is then used to perform an automatic cross-sectional study of accented English spoken by native speakers of Mandarin Chinese (N-MC). The experimental results demonstrate the capability of the proposed system to perform
quantitative as well as qualitative analysis of foreign accents. The work developed in this study can be easily expanded into language learning systems, and has potential impact in the areas of speaker recognition and ASR (automatic speech recognition).
|
Abstract: In this
study, a new algorithm for automatic accent evaluation of native and
non-native speakers is presented. The proposed system consists of two
main steps: alignment and scoring. At the alignment step, the speech
utterance is processed using a Weighted Finite State Transducer (WFST)
based technique to automatically estimate the pronunciation errors.
Subsequently, in the scoring step a Maximum Entropy (ME) based
technique is employed to assign perceptually motivated scores to
pronunciation errors. The combination of the two steps yields an
approach that measures accent based on perceptual impact of
pronunciation errors, and is termed as the Perceptual WFST (P-WFST).
The P-WFST is evaluated on American English (AE) spoken by native and
non-native (native speakers of Mandarin-Chinese) speakers from the
CUAccent corpus. The proposed P-WFST algorithm shows higher and more
consistent correlation with human evaluated accent scores, when
compared to the Goodness Of Pronunciation (GOP) algorithm.
|
Abstract: This study presents new advancements in our articulatory-based language identification (LID) system. Our LID system automatically identifies language-features (LFs) from a phonological features
(PFs) based representation of speech. While our baseline system uses a static PF-representation for extracting LFs, the new system is based on a dynamic PF representation for feature extraction. Interestingly, the new LFs outperform our baseline system by 11.8% absolute in a difficult 5-way classification task of South Indian Languages. Additionally, we incorporate pitch and energy based features in our new system to leverage prosody in classification. In particular, we employ a Legendre polynomial based contour-estimation to capture shape parameters which are used in classification. Additionally, the fusion of PF and prosody-based LFs further improves the overall classification result by 16.5% absolute over the baseline system. Finally, the proposed articulatory language ID system is combined with a PPRLM (parallel phone recognition language model) system to obtain an overall classification accuracy of 86.6%.
|
Abstract: In this study, a new keyword spotting system (KWS) that utilizes phone confusion networks (PCNs) is presented. The new system exploits the compactness and accuracy of phone confusion networks to deliver fast and accurate results. Special design considerations are provided within the new algorithm to account for phone recognizer induced insertion and deletion errors. Furthermore, this study proposes a new threshold estimation technique that uses the keyword constituent phones and phonological features (PFs) for threshold computation. The new threshold estimation technique is able to deliver thresholds that improves the overall F-score for keyword detection. The final integrated system is able to achieve a better balance between precision and recall.
|
Abstract: In this paper, a language analysis and classification
system that leverages knowledge of speech production is proposed. The
proposed scheme automatically extracts key production traits (or
“hot-spots”) that are strongly tied to the underlying language
structure. Particularly, the speech utterance is first parsed into
consonant and vowel clusters. Subsequently, the production traits for
each cluster is represented by the corresponding temporal evolution of
speech articulatory states. It is hypothesized that a selection of
these production traits are strongly tied to the underlying language,
and can be exploited for language ID. The new scheme is evaluated on
our South Indian Languages (SInL) corpus which consists of 5 closely
related languages spoken in India, namely, Kannada, Tamil, Telegu,
Malayalam, and Marathi. Good accuracy is achieved with a rate of 65%
obtained in a difficult 5-way classification task with about 4sec of
train and test speech data per utterance. Furthermore, the proposed
scheme is also able to automatically identify key production traits of
each language (e.g., dominant vowels, stop-consonants, fricatives etc.).
|
Abstract: This study presents a novel phonological methodology for speech recognition based on phonological features (PFs) which leverages the relationship between speech phonology and phonetics. In particular, the proposed scheme estimates the likelihood of observing speech phonology given an associative lexicon. In this manner, the scheme is capable of choosing the most likely hypothesis (word candidate) among a group of competing alternative hypotheses. The framework employs the Maximum Entropy (ME) model to learn the relationship between phonetics and phonology. Subsequently, we extend the ME model to a ME-HMM (maximum entropy-hidden Markov model) which captures the speech production and linguistic relationship between phonology and words. The proposed ME-HMM model is applied to the task of re-processing N-best lists with good results.
|
Abstract:
This paper examines the impact of physical stress on speech. The
methodology adopted here identifies inter-utterance breathing (IUB)
pattern as a key intermediate variable while studying the relationship between
physical stress and speech. Additionally, this work connects high-level
prosodical changes in the speech signal (energy, pitch, and duration)
to the corresponding breathing patterns. Our results demonstrate the
diversity of breathing and articulation patterns that speakers employ
in order to compensate for the increased body oxygen demand. Here, we
identify the normalized value of breathing energy rate (proportional to
minute volume) acquired from a conventional as well as physiological
microphone as a reliable and accurate estimator of physical stress.
Additionally, we also show that the prosodical patterns (pitch, energy,
and duration) of highlevel speech structure shows good correlation with
the normalizedbreathing energy rate. In this manner, the study
establishes the interconnection between temporal speech structure and
physical stress through breathing.
|
Abstract: In this paper, we propose a new scheme for variable frame rate (VFR) feature processing based on high level segmentation (HLS) of speech into broad phone classes. Traditional fixed-rate processing is not capable of accurately reflecting the dynamics of continuous speech. On the other hand, the proposed VFR scheme adapts the temporal representation of the speech signal by tying the framing strategy with the detected phone class sequence. The phone classes are detected and segmented by using appropriately trained phonological features (PFs). In this manner, the proposed scheme is capable of tracking the evolution of speech due to the underlying phonetic content, and exploiting the non-uniform information flow-rate of speech by using a variable framing strategy. The new VFR scheme is applied to automatic speech recognition of TIMIT and NTIMIT corpora, where it is compared to a traditional fixed window-size/frame-rate scheme. Our experiments yield encouraging results with relative reductions of 24% and 8% in WER (word error rate) for TIMIT and NTIMIT tasks, respectively.
|
Competitive Neyman-Pearson Hypotheses Testing + Voice Activity Detection
Abstract: In this paper, the Bayesian, Neyman-Pearson and
Competitive Neyman-Pearson detection approaches are analyzed using a
perceptually modified Ephraim-Malah (PEM) model, based on which a few
practical voice activity detectors are developed. The voice activity
detection is treated as a composite hypothesis testing problem with a
free parameter formed by the prior signal-to-noise ratio (SNR). It is
revealed that a high prior SNR is more likely to be associated with the
‘speech hypothesis’ than the ‘pause hypothesis’ and vice-versa, and the
CNP approach exploits this relation by setting a variable upper bound
for the probability of false-alarm. The proposed VADs are tested under
different noises and various SNRs, using speech samples from the
SWITCHBOARD database, and are compared with adaptive multi-rate (AMR)
VADs. Our results show that the CNP VAD outperforms the NP and Bayesian
VADs, and compares well to the AMR VADs. The CNP VAD is also
computationally inexpensive making it a good candidate for applications
in communication systems.
|
Abstract: Traditional voice activity detectors (VADs) tend to
be deaf to the acoustical background noise, as they (i) utilize a
single operating point for all SNRs (signal-to-noise ratios) and noise
types, and (ii) attempt to learn the background noise model online from
finite data length. In this paper, we address the aforementioned issues
by designing an environmentally aware (EA) VAD. The EA VAD scheme
builds prior offline knowledge of commonly encountered acoustical
backgrounds, and also combines the recently proposed competitive
Neyman-Pearson (CNP) VAD with a SVM (support vector machine) based
noise classifier. In operation, the EA VAD obtains accurate noise models
of the acoustical background by employing the noise classifier and its
prior knowledge of the noise type, and thereafter uses this information to
set the best operating point and initialization parameters for the CNP
VAD. The superior performance of the EA VAD scheme over the standard
AMR (adaptive multi-rate) VADs in low SNR is confirmed in a simulation
study, where speech and noise data were drawn from the SWITCHBOARD and
NOISEX databases. We report an absolute improvement of 10-15% in
detection rates over AMR VADs in low SNR for different noise types.
|
Abstract: The problem of composite hypothesis testing where
the probability law governing the generation of the free parameter is
not explicitly known is considered. It is shown that unlike the
Neyman-Pearson (NP) approach, the competitive NP (CNP) approach models
incomplete prior information about the source into the detector design
by setting a variable upper bound for the probability of false-alarm
term. Further, the CNP and NP approaches are employed to develop the
CNP and NP detectors for voice activity detection (VAD), where the
prior SNR is shown to be the free parameter of the composite
hypothesis. We test the CNP and NP detectors using speech samples from
the SWITCHBOARD database which are suitably corrupted using different
noises and various SNRs. Our simulation results show that the CNP
detector outperforms its NP counterpart and is comparable to the
adaptive multi-rate (AMR) VADs.LR.
|
Abstract: In this paper, we develop a contextual voice activity
detection (VAD) scheme which combines both contextual and frame specific
information to improve detection. Unlike many VAD algorithms which
assume that the cues to activity lie within the frame alone, our scheme
seeks information for activity in the current as well as the
neighboring frames. The new approach provides good robustness in low
SNR when the speech frame is corrupted and an alternate reliable source
of activity information is necessary. Further, we present a simple
noise suppression scheme to enhance the VAD performance at low SNR. The
noise suppressor provides spectrally reshaped signal to the VAD.
Finally, we combine the contextual VAD and the noise suppression scheme
with a basic detector to form a comprehensive VAD. The proposed
comprehensive VAD system is tested on speech samples from the
SWITCHBOARD database. Various noises under different SNRs are added to
the speech signals. Experimental results show that the proposed VAD
outperforms the standard algorithm ETSI AMR VAD-1.I I .
|
Abstract: We discuss techniques for Voice Activity Detection (VAD) for Voice over Internet Protocol (VoIP). VAD aids in saving bandwidth requirement of a voice session thereby increasing the bandwidth efficiently. Such a scheme would be implemented in the application Layer. Thus the VAD is independent of the lower layers in the network stack [1]. In this paper, we compare the quality of speech, level of compression and computational complexity for three time-domain and three frequency- domain VAD algorithms. Implementation of time-domain algorithms is direct and they are computationally simple. However, better speech quality is obtained with the frequency-domain algorithms. A comparison of the relative merits and demerits along with the subjective quality of speech after removal of silence periods is presented for all the algorithms. A quantitative measurement of speech quality for different algorithms is also presented.
|
Speech Feature: Warped Discrete Cosine Transform Cepstrum (WDCTC)
Abstract: In this paper, we propose a new feature for speech recognition and speaker identification application. The new feature is termed as warped-discrete cosine transform cepstrum (WDCTC). The feature is obtained by replacing the discrete cosine transform (DCT) by the warped discrete cosine transform (WDCT, [4]) in the discrete cosine tranform cepstrum (DCTC [2]). The WDCT is implemented as a cascade of the DCT and IIR all-pass filters. We incorporate a nonlinear frequency-scale in DCTC which closely follows the bark-scale. This is accomplished by setting the all-pass filter parameter using an expression given by Smith and Abel [5] . Performance of WDCTC is compared to mel-frequency cepstral coefficients (MFCC) in a speech recognition and speaker identification experiment. WDCTC outperforms MFCC in both noisy and noiseless conditions.form according to the frequency contents of the signal block [4]. As a parallel, the advantage of using WDCT over DCT for image compression has been shown in [4]. Smith and Abel [5] derived an analytic expression for an all-pass filter parameter such that the mapping between the warped and the unwarped frequencies, for a given sampling frequency fs , follows the psychoacoustic Bark-scale. This expression is used to get WDCT. Applying WDCT to DCTC generates a new feature, warped discrete cosine transform cepstrum (WDCTC). To evaluate the efficacy of the new feature in speech processing applications, we compare the performances of WDCTC with MFCC for vowel recognition and speaker identification tasks. Our experimental results show that WDCTC consistently performs better than MFCC.
|
Abstract: In this letter, we derive the theoretical complex cepstrum (TCC) of the discrete cosine transform (DCT) and warped DCT (WDCT) filters. Using these derivations, we intend to develop an analytic model of the warped discrete cosine transform cepstrum (WDCTC), which was recently introduced as a speech processing feature. In our derivation, we start with the filter bank structure for the DCT, where each basis is represented by a finite impulse response (FIR) filter. The WDCT filter bank is obtained by substituting z −1 in the DCT filter bank with a first-order all-pass filter. Using the filter bank structures, we first derive the transfer functions for the DCT and WDCT, and subsequently the TCC for each filter is computed. We analyze the DCT and WDCT filter transfer functions and the TCC by illustrating the corresponding pole-zero maps and cepstral sequences. Moreover, we also use the derived TCC expressions to compute the cepstral sequence for a synthetic vowel /aa/ where the observations on the theoretical cepstrum corroborate well with our practical findings.
|
In this paper, we continue our investigation of the warped discrete cosine transform cepstrum (WDCTC), which was earlier introduced as a new speech processing feature [1]. Here, we study the statistical properties of the WDCTC and compare them with the mel-frequency cepstral coefficients (MFCC). We report some interesting properties of the WDCTC when compared to the MFCC: its statistical distribution is more Gaussian-like with lower variance, it obtains better vowel cluster separability, it forms tighter vowel clusters and generates better codebooks. Further, we employ the WDCTC and MFCC features in a 5-vowel recognition task using Vector Quantization (VQ) and 1-Nearest Neighbour (1-NN) as classifiers. In our experiments, the WDCTC consistently outperforms the MFCC.vowel classes.
|
In this paper, we continue our investigation of the warped discrete cosine transform cepstrum (WDCTC), which was earlier introduced as a new speech processing feature [1]. Here, we study the statistical properties of the WDCTC and compare them with the mel-frequency cepstral coefficients (MFCC). We report some interesting properties of the WDCTC when compared to the MFCC: its statistical distribution is more Gaussian-like with lower variance, it obtains better vowel cluster separability, it forms tighter vowel clusters and generates better codebooks. Further, we employ the WDCTC and MFCC features in a 5-vowel recognition task using Vector Quantization (VQ) and 1-Nearest Neighbour (1-NN) as classifiers. In our experiments, the WDCTC consistently outperforms the MFCC.MFCC features and present the average separability of the vowel classes.
|
Complex Cepstrum of Discrete Hartley And Warped Discrete Hartley Filters |
DSP Workshop |
Affiliation
Current: Center for Robust Speech Systems, University of Texas at Dallas (CRSS)
Previous: Human Language Technology, IBM T.J. Watson Research Center
|
|