Research

Research

Siri Speech
Parsimonious Hierarchical Automatic Speech Recognition
Handling Overlapping Speech During Speaker Diarization
Automated Analysis of Indian Classical Music
Multimedia Content Analysis
Language Identification (LID) using Spectro-Temporal Patch Features

Research Internship (Mar'17 - Jun'17)

Siri Speech Team at Apple, Cupertino, USA

Acoustic Modeling for Robust Automatic Speech Recognition

Research Assistant (Ph.D. Candidate) Aug'14 - Jul'18 (Expected)

EDEE, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland

IDIAP Research Institute, Martigny, Switzerland

Supervisor: Prof. Herve Bourlard

Project: Parsimonious Hierarchical Automatic Speech Recognition (Funding source: SNSF PHASER Project)

Codes (Kaldi): Eigenposteriors (github)

Low-rank and Sparse Soft Targets to Learn Better DNN Acoustic Models

Senone probabilities obtained from a DNN trained with binary labels can provide more accurate targets to learn better acoustic models. However, DNN outputs bear inaccuracies which are exhibited as high dimensional unstructured noise, whereas the informative components are structured and low-dimensional.
We exploit principle component analysis (PCA) and sparse coding to characterize the senone subspaces. Enhanced probabilities obtained from low-rank and sparse reconstructions are used as soft-targets for DNN acoustic modeling, that also enables training with untranscribed data.

Relevant Publications:
P. Dighe, A. Asaei and H. Bourlard “Exploiting Eigenposteriors for Semi-supervised Training of DNN Acoustic Models with Sequence Discrimination”, in Interspeech 2017, Stockholm, Sweden.
P. Dighe, A. Asaei and H. Bourlard “Low-rank and Sparse Soft Targets to Learn Better DNN Acoustic Models”, in ICASSP 2017, New Orleans, USA.
P. Dighe, G. Luyet, A. Asaei and H. Bourlard “Exploiting Low-dimensional Structures to enhance DNN Based Acoustic Modeling in Speech Recognition”, in ICASSP 2016, Shanghai, China

Sparse Modeling of Deep Neural Network Posterior Probabilities

Working towards formulating a novel compressive sensing (CS) perspective towards automatic speech recognition. Exploiting low-rank structures in deep neural network (DNN) based posterior probabilities using sparse coding techniques. Modeling acoustic space of DNN outputs as a union of low-dimensional subspaces resulting in application of dictionary learning for ASR.

Relevant Publications:

P. Dighe, G. Luyet, A. Asaei and H. Bourlard “Exploiting Low-dimensional Structures to enhance DNN Based Acoustic Modeling in Speech Recognition”, in ICASSP 2016, Shanghai, China.
P. Dighe, A. Asaei and H. Bourlard “Sparse Modeling of Neural Network Posterior Probabilities for Exemplar-based Speech Recognition”, in Speech Communication: Special Issue on Advances in Sparse Modeling and Low-rank Modeling for Speech Processing, 2015.
D Ram, A Asaei, P Dighe and H Bourlard “Sparse Modeling of Posterior Exemplars for Keyword Detection”, in Interspeech, 2015.

Intern, Speech & Audio Processing Group Jun'13 - May'14

IDIAP Research Institute Martigny, Switzerland

Supervisor: Prof. Herve Bourlard

Handling Overlapping Speech During Speaker Diarization

In speaker diarization systems, presence of overlapping speech affects the diarization performance at two steps. First issue arises when overlapping speech corrupts quality of pure speaker models computed from the audio. The second issue arises when the system tries to label overlapping speech segments with only a single speaker. We approach this problem by modelling overlapping speech by a vector taylor series approximation. Overlapping speech is modeled as corruption of one pure speaker model by another speaker model. A similar approach can be used for modelling corruption of a pure speaker model by some specific sound classes like music, laughing, clapping etc. Approach was evaluated on AMI Meeting Corpus.

Relevant Publications:

P Dighe, M Ferras, H Bourlard “Detecting and labeling speakers on overlapping speech using vector taylor series” in Interspeech 2014, Singapore.
P Dighe, M Ferras, H Bourlard “Modeling Overlapping Speech using Vector Taylor Series” in IEEE Speaker Odyssey Workshop 2014, Joensuu, Finland.

Master's Thesis, Computer Science & Engineering Jan'12 - May'13

Indian Institute of Technology Kanpur

Supervisor: Prof Harish Karnick (IIT Kanpur), Prof Bhiksha Raj (LTI, CMU)

Automated Analysis of Indian Classical Music

Created a framework for robust automated analysis of Indian classical music through machine learning and signal processing tools and techniques. Implemented scale-independent "raga" identification using chromagram patterns and "swara" based features with state-of-the-art results. Work demonstrates the approach for 8 ragas namely Darbari, Khamaj, Malhar, Sohini, Bahar, Basant, Bhairavi and Yaman.

Relevant Publications:

P Dighe, P Agrawal, H Karnick, S Thota, B Raj "Scale Independent Raga Indentification Using Chromagram Patterns and Swara Based Features" in ICME 2013, San Jose, USA.
P Dighe, H Karnick, B Raj "Swara Histogram Based Structural Analysis and Identification of Indian Classical Ragas" in ISMIR 2013, Curitiba, Brazil.

Summer Intern, Language Technologies Institute May'11 - Jul'11

Carnegie Mellon University Pittsburgh, USA

Supervisor: Prof. Bhiksha Raj

Multimedia Content Analysis

Developed GMM-HMM based system for detection of points of significant changes in the structure of audio and worked with Acoustic Unit Descriptors as features for event detection and context recognition. Evaluated approach on TRECVID Multimedia Event Detection 2011 corpus that has 15 distinct audio events and achieved 99% accuracy for 2 events viz. music and wedding audio and average accuracy of 91% over 10 different audio events.

Relevant Publications:

A Kumar, P Dighe, R Singh, S Chaudhuri, B Raj “Audio event detection from acoustic unit patterns” in ICASSP 2012, Kyoto, Japan.

Language Identification (LID) using Spectro-Temporal Patch Features

Defined a randomly selected library of spectro-temporal patterns from spoken examples from a language and derived features from the correlations of this library to spectrograms obtained from the speech signal. Modeled a discriminative classifier based on these features to detect presence of a language in a recording. Evaluation was done on VoxForge (English, Hindi, German, Farsi) and CallFriend (English and Russian) corpus.

Relevant Publications:

K Sahni, P Dighe, R Singh, B Raj "Language Identification using Spectro-Temporal Patch features" in SAPA SCALE Conference 2012, Portland, Oregon, USA.