Responsible Speech Foundation Models I

Speech foundation models are emerging as a universal solution to various speech tasks. Indeed, their superior performance has extended beyond ASR. For instance, Whisper has proven to be a noise-robust audio event tagger [1], showcasing its potential beyond its original training objectives. Despite the advancements, the limitations and risks associated with speech foundation models have not been thoroughly studied. For example, it has been found that wav2vec 2.0 exhibits biases in different paralinguistic features, emotions [2], and accents [3], while HuBERT lacks noise robustness in certain downstream tasks [4]. Besides this, foundation models present challenges in terms of ethical concerns, including privacy, sustainability, fairness, and safety [5]. Furthermore, risks and biases of one model may propagate in usage alongside other models, especially in a unified framework, such as Seamless [6].

Thus, it is necessary to investigate speech foundation models for de-biasing (e.g., consistent accuracy for different languages, different genders, and ages), enhancing factuality (not making mistakes in critical applications), preventing malicious applications (e.g., using a TTS to attack speaker verification systems, not to use for surveillance), and addressing various other aspects.

In this special session, we specialize in responsible aspects of speech foundation models, which are not adequately covered by regular sessions. We aim to facilitate knowledge sharing from diverse speech areas and pioneer discussions on both tech and non-tech issues. Furthermore, in line with the IS 2024 Speech and Beyond theme, we aim to foster connections with other communities such as NLP and ML, which have long been investigating responsible and trustworthy models [7]. Theoretical and position papers from those communities with views, directions, ideas, or solutions on a particular topic for bridging the gap between speech and NLP/ML are also welcome (e.g., integrating speech foundation models with LLMs in dialog systems).

Thursday, September 5, 10:00 - 12:00 | Location: Yanis Club

Accepted Papers

#102 Speech foundation models in healthcare: Effect of layer selection on pathological speech feature prediction. Daniela Wiepert (Mayo Clinic); Rene L Utianski (Mayo Clinic); Joseph Duffy (Mayo Clinic); John Stricker (Mayo Clinic); Leland Barnard (Mayo Clinic); David Jones (Mayo Clinic); Hugo Botha (Mayo Clinic)

#454 On the social bias of speech self-supervised models. Yi-Cheng Lin (National Taiwan University); Tzu-Quan Lin (National Taiwan University); Hsi-Che Lin (National Taiwan University); Andy T. Liu (National Taiwan University); Hung-yi Lee (National Taiwan University) -- Best Paper Runner-up

#971 Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System. Lingwei Meng (The Chinese University of Hong Kong); Jiawen Kang (The Chinese University of Hong Kong); Yuejiao Wang (The Chinese University of Hong Kong); Zengrui Jin (The Chinese University of Hong Kong); Xixin Wu (The Chinese University of Hong Kong); Xunying Liu (The Chinese University of Hong Kong); Helen Meng (The Chinese University of Hong Kong)

#1073 Emo-bias: A Large Scale Evaluation of Social Bias on Speech Emotion Recognition. Yi-Cheng Lin (National Taiwan University); Haibin Wu (National Taiwan University); Huang-Cheng Chou (Department of Electrical Engineering at National Tsing Hua University (NTHU)); Chi-Chun Lee (National Tsing Hua University); Hung-yi Lee (National Taiwan University)

#1212 Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features? Zakaria Aldeneh (Apple); Takuya Higuchi (Apple); Jee-weon Jung (Carnegie Mellon University); Skyler Seto (Apple); Tatiana Likhomanenko (Apple); Stephen Shum (Apple); Ahmed Hussen Abdelaziz (Apple); Shinji Watanabe (Carnegie Mellon University); Barry-John Theobald (Apple)

#2105 Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models. Dominik Wagner (Technische Hochschule Nuernberg Georg Simon Ohm); Ilja Baumann (Technische Hochschule Nürnberg Georg Simon Ohm); Korbinian Riedhammer (Technische Hochschule Nürnberg Georg Simon Ohm); Tobias Bocklet (TH Nürnberg)

#2394 Self-supervised Speech Representations Still Struggle with African American Vernacular English. Kalvin Chang (Carnegie Mellon University); Yi-Hui Chou (Carnegie Mellon University); Jiatong Shi (Carnegie Mellon University); Hsuan-Ming Chen (Carnegie Mellon University); Nicole Holliday (Pomona College); Odette Scharenborg (Multimedia Computing Group, Delft University of Technology); David R. Mortensen (Language Technologies Institute, Carnegie Mellon University) -- Honorable Mention

#2494 Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems. Ajinkya Kulkarni (ValidSoft MBZUAl); Atharva Kulkarni (Erisha); Miguel Couceiro (LORIA); Isabel Trancoso (INESCID) -- Best Paper Award

TOPICS

including but not limited to:

Limitations of speech foundation models and/or their solutions
- inability to capture certain information
- biases and inconsistent performance for different speech types
- risks that propagate in actual use
Better utilization of speech foundation models
- adaptation methods for low-resource/out-of-domain speech for fairness (e.g., parameter-efficient tuning)
- joint training of multiple tasks for reliable and holistic speech perception
- integration with language foundation models to address the limitations of speech (e.g., whisper + llama)
Interpretability, generalizability, and robustness of speech foundation models
- whether and/or why the models perform well or poorly in certain types of speech, tasks, or scenarios (e.g., noisy, prosodic, pathological, multi-talker, far-field speech)

Foundation models for understudied speech tasks
- prosody in context
- dialog behaviors
- speech-based healthcare
- emotion in conversations
- disfluencies and non-verbal vocalizations including filler, backchannels, laughter
Potential risks of employing speech foundation models
- privacy breaches through the representations
- lack of sustainability due to computing resources
- lack of fairness concerning gender or other speaker characteristics
Strategies for integrating tech and non-tech elements to ensure model responsibility
- foundation models for social good

And many more aspects about speech foundation models that are not covered by regular sessions!!

Paper Submission

Special sessions are essentially an extension of the topics covered by regular sessions. The paper quality, submission and review process are all the same.
Papers submitted to the session should follow the regular Interspeech paper guidelines, submission and review process.
Accepted papers will appear in the main proceedings and the ISCA archive, presentations will take place during the main conference. All the same with regular sessions.
Be sure to select “14.07 Responsible Speech Foundation Models (Special Session)” as your paper subject area when making a submission in the system.

Paper submission deadline: 2 March 2024

Paper update deadline: 11 March 2024

Author notification: 6 June 2024

We aim to make the session entirely poster presentations after a short introductory talk.

In the event of a large number of accepted papers, we plan to invite one or two keynote speakers, ideally with experience spanning both the speech and non-speech but relevant AI communities.

Best paper award: Although there are no official awards from ISCA/Interspeech 2024 specifically designated for special sessions, we would like to select one accepted paper for the best paper award in our session and provide it with a certificate.

Organizers

Yuanchao Li, University of Edinburgh

Jennifer Williams, University of Southampton

Tiantian Feng, USC

Vikramjit Mitra, Apple AI/ML

Yuan Gong, MIT

Bowen Shi, Meta AI

Catherine Lai, University of Edinburgh

Peter Bell, University of Edinburgh

References

[1] Y. Gong, S. Khurana, L. Karlinsky, and J. Glass. “Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers”. Interspeech 2023.

[2] Y. Li, Y. Mohamied, P. Bell, and C. Lai. “Exploration of a self-supervised speech model: A study on emotional corpora”. In: 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE. 2023, pp. 868–875.

[3] R. Sanabria, N. Bogoychev, N. Markl, A. Carmantini, O. Klejch, and P. Bell. “The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR”. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2023, pp. 1–5.

[4] V. Mitra, V. Kowtha, H.-Y. S. Chien, E. Azemi, and C. Avendano. “Pre-Trained Model Representations and Their Robustness Against Noise for Speech Emotion Analysis”. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2023, pp. 1–5

[5] T. Feng, R. Hebbar, and S. Narayanan. “Trustser: On the trustworthiness of fine-tuning pre-trained speech embeddings for speech emotion recognition”. In: arXiv preprint arXiv:2305.11229 (2023).

[6] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, M. Duppenthaler, P.-A. Duquenne, B. Ellis, H. Elsahar, J. Haaheim, et al. “Seamless: Multilingual Expressive and Streaming Speech Translation”. In: arXiv preprint arXiv:2312.05187 (2023).

[7] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. “On the opportunities and risks of foundation models”. In: arXiv preprint arXiv:2108.07258 (2021).

Contact: yuanchao.li (at) ed.ac.uk