• Research Internship (Mar'17 - Jun'17)

Siri Speech Team at Apple, Cupertino, USA

Acoustic Modeling for Robust Automatic Speech Recognition

  • Research Assistant (Ph.D. Candidate) Aug'14 - Jul'18 (Expected)

EDEE, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland

IDIAP Research Institute, Martigny, Switzerland

Supervisor: Prof. Herve Bourlard

Project: Parsimonious Hierarchical Automatic Speech Recognition (Funding source: SNSF PHASER Project)

Codes (Kaldi): Eigenposteriors (github)

Low-rank and Sparse Soft Targets to Learn Better DNN Acoustic Models

    • Senone probabilities obtained from a DNN trained with binary labels can provide more accurate targets to learn better acoustic models. However, DNN outputs bear inaccuracies which are exhibited as high dimensional unstructured noise, whereas the informative components are structured and low-dimensional.
    • We exploit principle component analysis (PCA) and sparse coding to characterize the senone subspaces. Enhanced probabilities obtained from low-rank and sparse reconstructions are used as soft-targets for DNN acoustic modeling, that also enables training with untranscribed data.

Sparse Modeling of Deep Neural Network Posterior Probabilities

Working towards formulating a novel compressive sensing (CS) perspective towards automatic speech recognition. Exploiting low-rank structures in deep neural network (DNN) based posterior probabilities using sparse coding techniques. Modeling acoustic space of DNN outputs as a union of low-dimensional subspaces resulting in application of dictionary learning for ASR.

Relevant Publications:

  • Intern, Speech & Audio Processing Group Jun'13 - May'14

IDIAP Research Institute Martigny, Switzerland

Supervisor: Prof. Herve Bourlard

Handling Overlapping Speech During Speaker Diarization

In speaker diarization systems, presence of overlapping speech affects the diarization performance at two steps. First issue arises when overlapping speech corrupts quality of pure speaker models computed from the audio. The second issue arises when the system tries to label overlapping speech segments with only a single speaker. We approach this problem by modelling overlapping speech by a vector taylor series approximation. Overlapping speech is modeled as corruption of one pure speaker model by another speaker model. A similar approach can be used for modelling corruption of a pure speaker model by some specific sound classes like music, laughing, clapping etc. Approach was evaluated on AMI Meeting Corpus.

Relevant Publications:

  • Master's Thesis, Computer Science & Engineering Jan'12 - May'13

Indian Institute of Technology Kanpur

Supervisor: Prof Harish Karnick (IIT Kanpur), Prof Bhiksha Raj (LTI, CMU)

Automated Analysis of Indian Classical Music

Created a framework for robust automated analysis of Indian classical music through machine learning and signal processing tools and techniques. Implemented scale-independent "raga" identification using chromagram patterns and "swara" based features with state-of-the-art results. Work demonstrates the approach for 8 ragas namely Darbari, Khamaj, Malhar, Sohini, Bahar, Basant, Bhairavi and Yaman.

Relevant Publications:

  • Summer Intern, Language Technologies Institute May'11 - Jul'11

Carnegie Mellon University Pittsburgh, USA

Supervisor: Prof. Bhiksha Raj

Multimedia Content Analysis

Developed GMM-HMM based system for detection of points of significant changes in the structure of audio and worked with Acoustic Unit Descriptors as features for event detection and context recognition. Evaluated approach on TRECVID Multimedia Event Detection 2011 corpus that has 15 distinct audio events and achieved 99% accuracy for 2 events viz. music and wedding audio and average accuracy of 91% over 10 different audio events.

Relevant Publications:

Language Identification (LID) using Spectro-Temporal Patch Features

Defined a randomly selected library of spectro-temporal patterns from spoken examples from a language and derived features from the correlations of this library to spectrograms obtained from the speech signal. Modeled a discriminative classifier based on these features to detect presence of a language in a recording. Evaluation was done on VoxForge (English, Hindi, German, Farsi) and CallFriend (English and Russian) corpus.

Relevant Publications: