Paralinguistics, the non-lexical components of speech, play a crucial role in human-human interaction. Paralinguistic tasks, specifically the detection of emotional expression from speech, have limited access to large datasets with accurate labels. As a result, it is difficult to train models that capture paralinguistic attributes via the supervised learning paradigm. In this work, we propose the Expressive Voice Conversion Autoencoder (EVoCA), which is a framework for capturing paralinguistic (e.g., emotion) attributes from a large-scale (i.e., 200 hours) audio-textual data without requiring manual emotion annotations. The proposed network utilizes the conversion of synthesized (neutral) speech and real (expressive) speech in order to learn what makes speech expressive in an unsupervised manner. We demonstrate that the learned embeddings from EVoCA outperform Mel-spectrum based acoustic features and other current unsupervised methods on emotion and speaking style classification tasks.
-- NAACL 2021 [Accepted]
Automatic speech recognition (ASR) is a key component for automatic, aphasic speech analysis. However, current approaches of using a standard, one-size-fits-all ASR model might be sub-optimal due to the wide range of speech intelligibility that exists both within and between speakers. In this work, we investigate the importance of speech intelligibility with regards to ASR modeling. We show how speech intelligibility can be estimated using a neural network and how intelligibility variability can be addressed within our acoustic model architecture using a mixture of experts. Our results show that our model leads to significant phone recognition improvement compared to a traditional, one-size-fits-all model.
-- Interspeech 2020 [Paper]
This works presents a pipeline for an automatic, end-to-end classification system using speech as the primary input for predicting Huntington Disease. We explore using transcript-based features to capture speech-characteristics of interest and use methods such as k-Nearest Neighbors (with euclidean and dynamic time warped distances) as well as more modern neural net approaches for classification.
-- Interspeech 2018 [Paper]
This work investigates the use of mobile devices for the extraction and analysis of various acoustic features at detecting mild traumatic brain injury (mTBI). Our results suggest strong correlation between certain temporal and frequency features and likelihood of a concussion.
-- IEEE Journal of Biomedical and Health Informatics [Paper]