SMILE Lab at Hopkins

Resources

NaturalVoices - Voice Conversion Dataset is available here.

A large voice conversion (VC) dataset curated from spontaneous, in-the-wild podcast speech as part of the NaturalVoices project in collaboration with MSP Lab at CMU LTI. This release provides the 870-hour VC dataset and subsets mainly intended for training and evaluating emotion-aware voice conversion systems but not limited to VC tasks.

Please cite:

Du, Zongyang, et al. "NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion." arXiv preprint arXiv:2511.00256 (2025).

Emotional Speech Dataset (ESD) is available here.

The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 h of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies.

Please cite:

K. Zhou, B. Sisman, R. Liu and H. Li, "Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 920-924, doi: 10.1109/ICASSP39728.2021.9413391.

Zhou, K., Sisman, B., Liu, R., & Li, H. (2022). Emotional voice conversion: Theory, databases and ESD. Speech Communication, 137, 1-18.

EmoSpoof-TTS Dataset (Interspeech 2025) is available here.

Traditional anti-spoofing focuses on models and datasets built on synthetic speech with mostly neutral state, neglecting diverse emotional variations. As a result, their robustness against high-quality, emotionally expressive synthetic speech is uncertain. We address this by introducing EmoSpoof-TTS, a diverse dataset containing over 29 hours of emotionally expressive synthetic speech generated using recent TTS models. Our analysis using EmoSpoof-TTS shows existing anti-spoofing models struggle with emotional synthetic speech, exposing risks of emotion-targeted attacks. The EmoSpoof-TTS dataset is among the first corpora specifically targeting emotionally expressive synthetic speech for spoofing analysis.

Please cite: Mahapatra, A., Ulgen, I.R., Reddy Naini, A., Busso, C., Sisman, B. (2025) Can Emotion Fool Anti-spoofing? Proc. Interspeech 2025, 5628-5632, doi: 10.21437/Interspeech.2025-1234

Emotional Speaking Style Captions for MSP Podcast release 1.12 (EmoRankCLAP) (Interspeech 2025) is available here.

We release a collection of natural-language emotional speaking style descriptions derived from the MSP-Podcast corpus(release 1.12). These descriptions are generated from dimensional emotion attributes (valence and arousal) to help bridge the gap between speech and text modalities in CLAP-style models. Traditionally, speaking style annotations have been limited to speaker traits and categorical emotions. Our dataset extends beyond such fixed categories, enabling the development of emotion understanding models that capture the subtleties of emotion on a continuous scale. All captions were generated using OpenAI’s o1 large language model with the following prompt: “Given the following scale of emotions – valence (1 = very negative; 7 = very positive), arousal (1 = very calm; 7 = very active), write a sentence describing a speaking style that is {VALENCE} on valence and {AROUSAL} on arousal. Do not use any numbers in the sentence. The sentence should start with: The person is speaking ...”

Please cite:

Chandra, S.S., Goncalves, L., Lu, J., Busso, C., Sisman, B. (2025) EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast. Proc. Interspeech 2025, 3000-3004, doi: 10.21437/Interspeech.2025-1198

MSP Podcast : Lotfian, Reza, and Carlos Busso. “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings.” IEEE Transactions on Affective Computing 10.4 (2017): 471-483.

NaturalVoices Dataset (Interspeech 2024) is available here.

Voice conversion (VC) research traditionally depends on scripted or acted speech, which lacks the natural spontaneity of real-life conversations. While natural speech data is limited for VC, our study focuses on filling in this gap. We introduce a novel data-sourcing pipeline that makes the release of a natural speech dataset for VC, named NaturalVoices. The pipeline extracts rich information in speech such as emotion and signal-to-noise ratio (SNR) from raw podcast data, utilizing recent deep learning methods and providing flexibility and ease of use. NaturalVoices marks a large-scale, spontaneous, expressive, and emotional speech dataset, comprising over 4,000 hours speech sourced from the original podcasts in the MSP-Podcast dataset. Objective and subjective evaluations demonstrate the effectiveness of using our pipeline for providing natural and expressive data for VC, suggesting the potential of NaturalVoices for broader speech generation tasks.

Please cite:

Salman, A.N., Du, Z., Chandra, S.S., Ülgen, İ.R., Busso, C., Sisman, B. (2024) Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline. Proc. Interspeech 2024, 4358-4362, doi: 10.21437/Interspeech.2024-1256

Page updated

Google Sites

Report abuse