EmoSpoof-TTS Dataset (Interspeech 2025) is available here.
Traditional anti-spoofing focuses on models and datasets built on synthetic speech with mostly neutral state, neglecting diverse emotional variations. As a result, their robustness against high-quality, emotionally expressive synthetic speech is uncertain. We address this by introducing EmoSpoof-TTS, a diverse dataset containing over 29 hours of emotionally expressive synthetic speech generated using recent TTS models. Our analysis using EmoSpoof-TTS shows existing anti-spoofing models struggle with emotional synthetic speech, exposing risks of emotion-targeted attacks. The EmoSpoof-TTS dataset is among the first corpora specifically targeting emotionally expressive synthetic speech for spoofing analysis.
Please cite: Mahapatra, A., Ulgen, I.R., Naini, A.R., Busso, C. and Sisman, B., 2025. Can Emotion Fool Anti-spoofing?, Interspeech 2025.
NaturalVoices Dataset (Interspeech 2024) is available here.
Voice conversion (VC) research traditionally depends on scripted or acted speech, which lacks the natural spontaneity of real-life conversations. While natural speech data is limited for VC, our study focuses on filling in this gap. We introduce a novel data-sourcing pipeline that makes the release of a natural speech dataset for VC, named NaturalVoices. The pipeline extracts rich information in speech such as emotion and signal-to-noise ratio (SNR) from raw podcast data, utilizing recent deep learning methods and providing flexibility and ease of use. NaturalVoices marks a large-scale, spontaneous, expressive, and emotional speech dataset, comprising over 4,000 hours speech sourced from the original podcasts in the MSP-Podcast dataset. Objective and subjective evaluations demonstrate the effectiveness of using our pipeline for providing natural and expressive data for VC, suggesting the potential of NaturalVoices for broader speech generation tasks.
Please cite: Salman, A.N., Du, Z., Chandra, S.S., Ülgen, İ.R., Busso, C., Sisman, B. (2024) Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline. Proc. Interspeech 2024, 4358-4362, doi: 10.21437/Interspeech.2024-1256
Emotional Speech Dataset (ESD; ICASSP 2021 and Speech Communication 2022) is available here.
The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 h of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies.
Please cite:
K. Zhou, B. Sisman, R. Liu and H. Li, "Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 920-924, doi: 10.1109/ICASSP39728.2021.9413391.
Zhou, K., Sisman, B., Liu, R., & Li, H. (2022). Emotional voice conversion: Theory, databases and ESD. Speech Communication, 137, 1-18.