Speech foundation models are emerging as a universal solution to various speech tasks. Indeed, their superior performance has extended beyond ASR. For instance, Whisper has proven to be a noise-robust audio event tagger [1], showcasing its potential beyond its original training objectives. However, the limitations and risks associated with foundation speech models have not been thoroughly studied. For example, foundation models have been found to exhibit biases in different paralinguistic features, emotions [2], accents [3, 4], as well as noise [5]. Besides, foundation models present challenges in terms of ethical concerns, including privacy, sustainability, fairness, safety [6], and social bias [7]. The responsible use of speech foundation models has attracted increasing attention not only in the speech community but also in the language community, including organizations like OpenAI [8].
Thus, it is necessary to investigate speech foundation models for de-biasing (e.g., consistent accuracy for different languages, different genders, and ages), enhancing factuality (not making mistakes in critical applications), and not used for malicious applications (e.g., using a TTS to attack speaker verification systems, not to use for surveillance).
In this special session, we aim to look beyond individual areas in regular sessions such as speech recognition, paralinguistics, or speech synthesis, which lack a particular focus on foundation models, especially their limitations and risks; To concern about the risks of foundation models and speech-LLMs recently emerging globally [9]; To catch up with NLP/ML community by addressing responsible foundation models within the speech community; And to draw attention from various speech areas to exchange ideas on this emerging topic.
*Speech foundation models are broadly defined, including emerging speech-LLMs.
TOPICS
including but not limited to:
Fairness of speech foundation models for understudied tasks
prosody in context
dialog behaviors
speech-based healthcare
emotion in conversations
disfluencies and non-verbal vocalizations including filler, backchannels, laughter
Limitations of speech foundation models and/or their solutions
inability to capture certain information
biases and inconsistent accuracies for different speech
risks that propagate in actual use
Potential risks and security concerns of foundation models
unauthorized voice generation
generating copyrighted content
ungrounded inference/sensitive trait attribution
disallowed content in audio output
erotic and violent speech output
gender unfairness
deepfake and anti-spoofing
Interpretability and generalizability of foundation models
why the models perform well or poorly in certain types of speech, tasks, scenarios
Multimodal speech foundation models
misalignment among audio, textual, visual, neural, and multisensory information
Joint training of diverse speech tasks using foundation models
traditionally impossible multitasking
Adaptation methods for low-resource/out-of-domain speech using foundation models
parameter-efficient tuning
transfer learning
knowledge distillation
Robustness of speech foundation models
to noisy, disfluent, stuttered, disarthria, disphonia, atypical speech
Integrating tech and non-tech elements to ensure speech responsibility
social theories and ethical standards
And many more that are not covered by regular sessions!
Paper Submission
Special sessions are essentially an extension of the topics covered by regular sessions. The paper quality, submission and review process are all the same.
Papers submitted to the session should follow the regular Interspeech paper guidelines, submission and review process.
Accepted papers will appear in the main proceedings and the ISCA archive, presentations will take place during the main conference. All the same with regular sessions.
Be sure to select “14.07 Responsible Speech Foundation Models (Special Session)” as your paper subject area when making a submission in the system.
Paper submission deadline: 12 Feb 2025
Paper update deadline: 19 Feb 2025
Author notification: 21 May 2025
We aim to make the session entirely poster presentations after a short introductory talk.
In the event of a large number of accepted papers, we plan to invite one or two keynote speakers, ideally with experience spanning both the speech and non-speech but relevant AI communities.
Best paper award: Although there are no official awards from ISCA/Interspeech 2025 specifically designated for special sessions, we would like to select one accepted paper for the best paper award in our session and provide it with a certificate.
Organizers
Yuanchao Li, University of Edinburgh
Jennifer Williams, University of Southampton
Tiantian Feng, USC
Vikramjit Mitra, Apple AI/ML
Yuan Gong, xAI
Bowen Shi, Meta AI
Catherine Lai, University of Edinburgh
Peter Bell, University of Edinburgh
References
[1] Y. Gong, S. Khurana, L. Karlinsky, and J. Glass. “Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers”. Interspeech 2023.
[2] Y. Li, Y. Mohamied, P. Bell, and C. Lai. “Exploration of a self-supervised speech model: A study on emotional corpora”. IEEE Spoken Language Technology Workshop (SLT) 2022.
[3] R. Sanabria, N. Bogoychev, N. Markl, A. Carmantini, O. Klejch, and P. Bell. “The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR”. IEEE ICASSP 2023.
[4] K. Chang, Y.-H. Chou, J. Shi, H.-M. Chen, N. Holliday, O. Scharenborg, and D. R. Mortensen. “Self-supervised speech representations still struggle with african american vernacular english”. Interspeech 2024.
[5] V. Mitra, V. Kowtha, H.-Y. S. Chien, E. Azemi, and C. Avendano. “Pre-Trained Model Representations and Their Robustness Against Noise for Speech Emotion Analysis”. IEEE ICASSP 2023.
[6] T. Feng, R. Hebbar, and S. Narayanan. “TrustSER: On the trustworthiness of fine-tuning pre-trained speech embeddings for speech emotion recognition”. IEEE ICASSP 2024.
[7] Lin, Y. C., Lin, T. Q., Lin, H. C., Liu, A. T., & Lee, H. Y. (2024). On the social bias of speech self-supervised models. Interspeech 2024.
[8] OpenAI. GPT-4o-system-card. 2024.
[9] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. “On the opportunities and risks of foundation models”. arXiv:2108.07258. 2021
Contact: yuanchao.li (at) ed.ac.uk