Siri Speech Team at Apple, Cupertino, USA
Acoustic Modeling for Robust Automatic Speech Recognition
EDEE, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
IDIAP Research Institute, Martigny, Switzerland
Supervisor: Prof. Herve Bourlard
Project: Parsimonious Hierarchical Automatic Speech Recognition (Funding source: SNSF PHASER Project)
Codes (Kaldi): Eigenposteriors (github)
Low-rank and Sparse Soft Targets to Learn Better DNN Acoustic Models
Sparse Modeling of Deep Neural Network Posterior Probabilities
Working towards formulating a novel compressive sensing (CS) perspective towards automatic speech recognition. Exploiting low-rank structures in deep neural network (DNN) based posterior probabilities using sparse coding techniques. Modeling acoustic space of DNN outputs as a union of low-dimensional subspaces resulting in application of dictionary learning for ASR.
Relevant Publications:
IDIAP Research Institute Martigny, Switzerland
Supervisor: Prof. Herve Bourlard
Handling Overlapping Speech During Speaker Diarization
In speaker diarization systems, presence of overlapping speech affects the diarization performance at two steps. First issue arises when overlapping speech corrupts quality of pure speaker models computed from the audio. The second issue arises when the system tries to label overlapping speech segments with only a single speaker. We approach this problem by modelling overlapping speech by a vector taylor series approximation. Overlapping speech is modeled as corruption of one pure speaker model by another speaker model. A similar approach can be used for modelling corruption of a pure speaker model by some specific sound classes like music, laughing, clapping etc. Approach was evaluated on AMI Meeting Corpus.
Relevant Publications:
Indian Institute of Technology Kanpur
Supervisor: Prof Harish Karnick (IIT Kanpur), Prof Bhiksha Raj (LTI, CMU)
Automated Analysis of Indian Classical Music
Created a framework for robust automated analysis of Indian classical music through machine learning and signal processing tools and techniques. Implemented scale-independent "raga" identification using chromagram patterns and "swara" based features with state-of-the-art results. Work demonstrates the approach for 8 ragas namely Darbari, Khamaj, Malhar, Sohini, Bahar, Basant, Bhairavi and Yaman.
Relevant Publications:
Carnegie Mellon University Pittsburgh, USA
Supervisor: Prof. Bhiksha Raj
Multimedia Content Analysis
Developed GMM-HMM based system for detection of points of significant changes in the structure of audio and worked with Acoustic Unit Descriptors as features for event detection and context recognition. Evaluated approach on TRECVID Multimedia Event Detection 2011 corpus that has 15 distinct audio events and achieved 99% accuracy for 2 events viz. music and wedding audio and average accuracy of 91% over 10 different audio events.
Relevant Publications:
Language Identification (LID) using Spectro-Temporal Patch Features
Defined a randomly selected library of spectro-temporal patterns from spoken examples from a language and derived features from the correlations of this library to spectrograms obtained from the speech signal. Modeled a discriminative classifier based on these features to detect presence of a language in a recording. Evaluation was done on VoxForge (English, Hindi, German, Farsi) and CallFriend (English and Russian) corpus.
Relevant Publications: