Responsible Speech Foundation Models II

Speech foundation models are emerging as a universal solution to various speech tasks. Indeed, their superior performance has extended beyond ASR. For instance, Whisper has proven to be a noise-robust audio event tagger [1], showcasing its potential beyond its original training objectives. However, the limitations and risks associated with foundation speech models have not been thoroughly studied. For example, foundation models have been found to exhibit biases in different paralinguistic features, emotions [2], accents [3, 4], as well as noise [5]. Besides, foundation models present challenges in terms of ethical concerns, including privacy, sustainability, fairness, safety [6], and social bias [7]. The responsible use of speech foundation models has attracted increasing attention not only in the speech community but also in the language community, including organizations like OpenAI [8].

Thus, it is necessary to investigate speech foundation models for de-biasing (e.g., consistent accuracy for different languages, different genders, and ages), enhancing factuality (not making mistakes in critical applications), and not used for malicious applications (e.g., using a TTS to attack speaker verification systems, not to use for surveillance).

In this special session, we aim to look beyond individual areas in regular sessions such as speech recognition, paralinguistics, or speech synthesis, which lack a particular focus on foundation models, especially their limitations and risks; To concern about the risks of foundation models and speech-LLMs recently emerging globally [9]; To catch up with NLP/ML community by addressing responsible foundation models within the speech community; And to draw attention from various speech areas to exchange ideas on this emerging topic.

*Speech foundation models are broadly defined, including emerging speech-LLMs.

TOPICS

including but not limited to:

Fairness of speech foundation models for understudied tasks
- prosody in context
- dialog behaviors
- speech-based healthcare
- emotion in conversations
- disfluencies and non-verbal vocalizations including filler, backchannels, laughter
Limitations of speech foundation models and/or their solutions
- inability to capture certain information
- biases and inconsistent accuracies for different speech
- risks that propagate in actual use
Potential risks and security concerns of foundation models
- unauthorized voice generation
- generating copyrighted content
- ungrounded inference/sensitive trait attribution
- disallowed content in audio output
- erotic and violent speech output
- gender unfairness
- deepfake and anti-spoofing

Interpretability and generalizability of foundation models
- why the models perform well or poorly in certain types of speech, tasks, scenarios
Multimodal speech foundation models
- misalignment among audio, textual, visual, neural, and multisensory information
Joint training of diverse speech tasks using foundation models
- traditionally impossible multitasking
Adaptation methods for low-resource/out-of-domain speech using foundation models
- parameter-efficient tuning
- transfer learning
- knowledge distillation
Robustness of speech foundation models
- to noisy, disfluent, stuttered, disarthria, disphonia, atypical speech
Integrating tech and non-tech elements to ensure speech responsibility
- social theories and ethical standards

And many more that are not covered by regular sessions!

Paper Submission

Special sessions are essentially an extension of the topics covered by regular sessions. The paper quality, submission and review process are all the same.
Papers submitted to the session should follow the regular Interspeech paper guidelines, submission and review process.
Accepted papers will appear in the main proceedings and the ISCA archive, presentations will take place during the main conference. All the same with regular sessions.
Be sure to select “14.07 Responsible Speech Foundation Models (Special Session)” as your paper subject area when making a submission in the system.

Paper submission deadline: 12 Feb 2025

Paper update deadline: 19 Feb 2025

Author notification: 21 May 2025

We aim to make the session entirely poster presentations after a short introductory talk.

In the event of a large number of accepted papers, we plan to invite one or two keynote speakers, ideally with experience spanning both the speech and non-speech but relevant AI communities.

Best paper award: Although there are no official awards from ISCA/Interspeech 2025 specifically designated for special sessions, we would like to select one accepted paper for the best paper award in our session and provide it with a certificate.

Organizers

Yuanchao Li, University of Edinburgh

Jennifer Williams, University of Southampton

Tiantian Feng, USC

Vikramjit Mitra, Apple AI/ML

Yuan Gong, xAI

Bowen Shi, Meta AI

Catherine Lai, University of Edinburgh

Peter Bell, University of Edinburgh

References

[1] Y. Gong, S. Khurana, L. Karlinsky, and J. Glass. “Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers”. Interspeech 2023.

[2] Y. Li, Y. Mohamied, P. Bell, and C. Lai. “Exploration of a self-supervised speech model: A study on emotional corpora”. IEEE Spoken Language Technology Workshop (SLT) 2022.

[3] R. Sanabria, N. Bogoychev, N. Markl, A. Carmantini, O. Klejch, and P. Bell. “The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR”. IEEE ICASSP 2023.

[4] K. Chang, Y.-H. Chou, J. Shi, H.-M. Chen, N. Holliday, O. Scharenborg, and D. R. Mortensen. “Self-supervised speech representations still struggle with african american vernacular english”. Interspeech 2024.

[5] V. Mitra, V. Kowtha, H.-Y. S. Chien, E. Azemi, and C. Avendano. “Pre-Trained Model Representations and Their Robustness Against Noise for Speech Emotion Analysis”. IEEE ICASSP 2023.

[6] T. Feng, R. Hebbar, and S. Narayanan. “TrustSER: On the trustworthiness of fine-tuning pre-trained speech embeddings for speech emotion recognition”. IEEE ICASSP 2024.

[7] Lin, Y. C., Lin, T. Q., Lin, H. C., Liu, A. T., & Lee, H. Y. (2024). On the social bias of speech self-supervised models. Interspeech 2024.

[8] OpenAI. GPT-4o-system-card. 2024.

[9] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. “On the opportunities and risks of foundation models”. arXiv:2108.07258. 2021

Contact: yuanchao.li (at) ed.ac.uk

Page updated

Google Sites

Report abuse