The rise of large foundation models has transformed natural language processing and speech understanding, driving major progress in recognition, synthesis, and translation. Through large-scale pre-training, these models acquire strong generalization capabilities across tasks. However, pre-training alone is rarely sufficient for real-world use, as models often lack sensitivity to specialized domains, ethical considerations, and cultural nuances.
Post-training has therefore emerged as a crucial stage in developing speech foundation models. It advances research in four main directions. First, it supports adaptation, enabling models to adjust to new languages, acoustic environments, and user contexts for improved personalization and cultural relevance [1, 2]. Second, it fosters alignment, embedding safety principles and human values to ensure fairness and responsible behavior [3]. Third, it enhances reasoning, strengthening the ability to perform audio and multimodal inference, sustain multi-turn interactions, and capture deeper semantic meaning [4, 5]. Finally, it improves efficiency through model compression, knowledge distillation, and continual learning, making models scalable and sustainable [6].
Despite these advancements, post-training faces persistent challenges. Limited and unbalanced data hinder adaptation across diverse linguistic and acoustic settings. The definition and quantification of human values remain ambiguous, complicating reliable alignment. Reasoning and interpretability are also difficult to evaluate, while over-alignment may reduce flexibility and utility. Furthermore, ensuring transparency and fairness in sensitive domains, such as healthcare and education, demands careful data design and ethical oversight.
Addressing these challenges requires collaboration across the speech community. Researchers, engineers, and practitioners must work together to build post-training methodologies that enhance adaptation, reasoning, and ethical integrity, ensuring that speech foundation models remain powerful, interpretable, and aligned with societal values.
Adaptation methods for speech foundation models to new contexts and applications.
Cross-cultural adaptation and sensitivity in speech model post-training.
Alignment methods with human values, ethics, and safety in speech technologies.
Fairness and bias mitigation in post-training for inclusive speech technologies.
Reasoning-aware post-training for improved downstream speech tasks.
Speech foundation model distillation, quantization, and pruning techniques in post-training.
Post-training for language preservation and documentation.
Post-training applications in healthcare, accessibility, education, and other high-impact domains.
Advancements in multimodal reasoning abilities through post-training.
Continual and Lifelong Adaptation in Speech Models.
And many more that are not covered by regular sessions!
The Special Session follows the same guidelines as the Interspeech 2026 regular sessions. When submitting your paper, select “Post-Training of Speech Foundation Models” as the subject area. Submitted papers will go through the same review process as the regular papers.
Paper Submission Portal Opens: 17 January 2026
Paper Submission Deadline: 25 February 2026
Paper Update Deadline: 4 March 2026
Rebuttal Period: 24 April – 1 May 2026
Paper Acceptance Notification: 5 June 2026
For changes and updates to the dates, always refer to the official Interspeech 2026 website.
Yang Xiao, The University of Melbourne (Primary contact)
Xiangyu Zhang, University of New South Wales
Ziyang Ma, Nanyang Technological University
Siyi Wang, The University of Melbourne
Jiaheng Dong, The University of Melbourne
Eng Siong Chng, Nanyang Technological University
Ting Dang, The University of Melbourne
[1] J. Dong, H. Jia, S. Chatterjee, A. Ghosh, J. Bailey, and T. Dang. "E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models." NeurIPS 2025.
[2] Y. Xiao, T. Peng, Y. Zhou, R. K. Das. "AdaKWS: Towards Robust Keyword Spotting with Test-Time Adaptation." Interspeech 2025.
[3] Z. Ma, X. Li, Y. Song, W. Chen, C. Du, J. Wu, Y. Chen, Z. Chen, Y. Wang, Y. Wang, X. Chen. "Towards Reliable Large Audio Language Model." ACL 2025.
[4] F. Tian, X. Zhang, Y. Zhang, H. Zhang, Y. Li, D. Liu, Y. Deng, D. Wu, J. Chen, L. Zhao, C. Yao, H. Liu, E. S. Chng, X. Yang, X. Zhang, D. Jiang, G. Yu. "Step-Audio-R1 Technical Report." Tech Report 2025.
[5] Z. Ma, Y. Ma, Y. Zhu, C. Yang, Y. W. Chao, R. Xu, W. Chen, Y. Chen, Z. Chen, J. Cong, K. Li, K. Li, S. Li, X. Li, X. Li, Z. Lian, Y. Liang, M. Liu, Z. Niu, T. Wang, Y. Wang, Y. Wang, Y. Wu, G. Yang, J. Yu, R. Yuan, Z. Zheng, Z. Zhou, H. Zhu, W.Xue, E. Benetos, K. Yu, E. S. Chng, X. Chen. "MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix." NeurIPS 2025.
[6] Y. Xiao, T. Peng, R. K. Das, Y. Hu, H. Zhuang. "AnalyticKWS: Towards Exemplar-Free Analytic Class Incremental Learning for Small-footprint Keyword Spotting." ACL 2025.