We are live streaming the event! Notice that Day 2 and Day 1 use different links. (--> both deactivated now)
9:15 - 9:50 Registration & Breakfast
9:50 - 10:00 Opening Remarks
10:00 - 11:00 Keynote Talk: Hung-yi Lee
11:00 - 11:10 Break
11:10 - 11:50 Invited Talk: Neil Zeghidour
11:50 - 12:30 Invited Talk: Mingqiu Wang
12:30 - 13:05 Lightning Talks Session 1
13:05 - 14:40 Lunch & Poster Session 1
14:40 - 15:20 Invited Talk: Yossi Adi
15:20 - 16:00 Invited Talk: Wei-Ning Hsu
16:00 - 16:10 Break
16:10 - 17:30 Nuts & Bolts Session 1: Democratization of Speech Foundation Models
Led by: Shinji Watanabe
9:00 - 9:30 Breakfast
9:30 - 10:30 Keynote Talk: Noah Smith
10:30 - 10:40 Break
10:40 - 11:20 Invited Talk: David Harwath
11:20 - 12:00 Invited Talk: Tatiana Likhomanenko
12:00 - 12:40 Lightning Talks Session 2
12:40 - 14:15 Lunch & Poster Session 2
14:15 – 15:00 Nuts & Bolts Session 2: Evaluation
Led by: Hung-yi Lee
15:00 - 15:10 Break, panel setup
15:10 - 16:10 Nuts & Bolts Session 3: The Industry Experience
Panel: industry speakers
16:10 - Close
Abstract: This talk highlights recent advancements in Spoken Language Models (SLMs), focusing on enabling text-based Large Language Models (LLMs) to seamlessly process and generate speech while retaining their universal capabilities. Starting from traditional text-based LLMs, we explore methods to integrate speech comprehension and generation without causing catastrophic forgetting of their original skills. We introduce novel speech representation learning techniques specifically tailored for SLMs and present analyses of their internal representations. Additionally, we discuss benchmark evaluations designed for SLMs, assessing their universal capabilities, instruction-following proficiency, reasoning abilities, and effectiveness in full-duplex dialogues.
Bio:
Hung-yi Lee is a professor of the Department of Electrical Engineering at National Taiwan University (NTU), with a joint appointment at the Department of Computer Science & Information Engineering of the university. His recent research focuses on developing technology that can reduce the requirement of annotated data for speech processing (including voice conversion and speech recognition) and natural language processing (including abstractive summarization and question answering). He won Salesforce Research Deep Learning Grant in 2019, AWS ML Research Award in 2020, Outstanding Young Engineer Award from The Chinese Institute of Electrical Engineering in 2018, Young Scholar Innovation Award from Foundation for the Advancement of Outstanding Scholarship in 2019, Ta-You Wu Memorial Award from Ministry of Science and Technology of Taiwan in 2019, and The 59th Ten Outstanding Young Person Award in Science and Technology Research & Development of Taiwan. He owns a YouTube channel teaching deep learning technology in Marian, which has more than 300,000 subscribers.
Abstract: Neural language models with billions of parameters and trained on trillions of words are powering the fastest-growing computing applications in history and generating discussion and debate around the world. Yet most scientists cannot study or improve those state-of-the-art models because the organizations deploying them keep their data and machine learning processes secret. I believe that the path to models that are usable by all, at low cost, customizable for areas of critical need like the sciences, and whose capabilities and limitations are made transparent and understandable, is radically open development, with academic and not-for-profit researchers empowered to do reproducible science. In this talk, I’ll discuss some of the work our team is doing to radically open up the science of language modeling and make it possible to explore new scientific questions and democratize control of the future of this fascinating and important technology.
The work I’ll present was led by a large team at the Allen Institute for Artificial Intelligence in Seattle, with collaboration from the Paul G. Allen School at the University of Washington and various kinds of support and coordination from many organizations, including the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University, AMD, CSC - IT Center for Science (Finland), Databricks, Together.ai, and the National AI Research Resource Pilot. In August, the team was awarded a $75M mid-scale research infrastructure grant from the National Science Foundation, with additional support from NVIDIA, enabling continued work for five years.
Bio:
Noah A. Smith is a researcher in natural language processing and machine learning, serving as the Amazon Professor at the University of Washington and Senior Director of NLP Research at the Allen Institute for AI. He co-directs the OLMo open language modeling initiative. His current work spans language, music, and AI research methodology, with a strong emphasis on mentoring—his former mentees now hold faculty and leadership roles worldwide. Smith is a Fellow of the Association for Computational Linguistics and has received numerous awards for research and innovation. More up-to-date information can be found at https://nasmith.github.io/.
Abstract: Spoken language models (SLMs) have emerged as an interesting and promising research direction. Yet, the term ‘spoken language model’ is interpreted differently across research groups. In this talk, I will present my perspective on the current landscape of SLMs, with particular emphasis on their scaling behavior, evaluation methods, and their internal thinking process. Building on this foundation, we will discuss current SLM evaluations and what I believe should be the role of speech and audio in these models. I will conclude by introducing a novel data generation pipeline, alongside empirical evidence demonstrating that fine-tuning SLMs on this data leads to notable improvements in prosodic understanding, without compromising performance on other downstream tasks.
Bio:
Yossi Adi is an Assistant Professor at the school of computer science and engineering at the Hebrew University of Jerusalem, and a Research Scientist at the FAIR team (Meta). Yossi completed his Ph.D. in computer science at Bar-Ilan University and is a recipient of the IAAI Best Doctoral Dissertation Award and the Alon scholarship. Yossi's research interests are in speech and language processing using machine learning and deep learning models. Yossi's research spans core machine learning and deep learning algorithms, their applications to spoken language processing, and the impact of the technology on social systems.
Abstract: LLMs have not only revolutionized text-based natural language processing, but their multimodal extensions have proven to be extremely powerful models for vision, speech, and natural sounds. By tokenizing input from disparate modalities and mapping them into the same input/output space as text, these models are capable of learning not only how to reason over multimodal inputs, but also generate new speech, audio, and visual outputs guided by text instructions - multiplying the number of capabilities these models have. A key to unlocking these new capabilities is the curation of new datasets that reflect the tasks to be learned, as well as new training methods that align with those tasks. In my talk, I will discuss several recent works in this direction from my lab at UT Austin.
The first part of my talk will describe our work on VoiceCraft, a neural codec language model capable of performing voice cloning text-to-speech synthesis, as well as targeted edits of speech recordings where words can be arbitrarily inserted, deleted, or substituted in the waveform itself. I will also describe multimodal extensions to this model that are trained to synchronize the generated speech with videos of talking heads, enabling the model to be used for video dubbing.
While voice cloning TTS has many use cases, in other circumstances users may want to control the vocal characteristics of a speech synthesis system with a text prompt, such as "a young woman with a British accent speaking in an authoritative tone." In the second part of my talk, I will discuss our recent work building ParaSpeechCaps, a dataset containing 2,700 hours of transcribed speech accompanied by rich stylistic captions. I will show experimental results demonstrating how we were able to use this dataset to train text-controllable TTS models that can not only manipulate basic attributes of the speech signal such as pitch and speaking rate, but also more abstract and higher-level vocal styles such as "husky", "nasal", "sleepy", and so forth.
Finally, I will discuss our work on spatial sound understanding. Sound event localization and detection is a classic task in the speech and audio community, and involves predicting the class of a sound source as well as localizing it (e.g. predicting the direction of arrival). We extend this task to encompass higher-level reasoning about multiple sources within a physical environment by proposing the SpatialSoundQA dataset. This dataset contains over 800,000 ambisonic waveforms and accompanying question-answer pairs, and evaluates models on their ability to answer natural language questions such as “Is the sound of the telephone further to the left than the sound of the barking dog?” I will also describe our BAT model, an extension of the LLaMA LLM that is capable of taking spatial audio recordings as input and reasoning about them using natural language.
Bio:
David Harwath is an assistant professor in the computer science department at UT Austin, where he leads the Speech, Audio, and Language Technologies (SALT) Lab. His group's research focuses on developing novel machine learning methods applied to speech, audio, and multimodal data for tasks such as automatic speech recognition, text to speech synthesis, and acoustic scene analysis. He has received the NSF CAREER award (2023), an ASRU best paper nomination (2015), and was awarded the 2018 George M. Sprowls Award for best computer science PhD thesis at MIT. He holds a B.S. in electrical engineering from UIUC (2010), a S.M. in computer science from MIT (2013), and a Ph.D. in computer science from MIT (2018).
Abstract: The emergence of large language models (LLMs) has transformed spoken dialog systems, yet the optimal architecture for real-time on-device conversational agents remains an open question. While end-to-end approaches promise theoretical advantages, cascaded systems continue to outperform them in language understanding tasks, despite being constrained by sequential processing latency. In this talk, I will revise cascaded system and show a novel low-latency variant that overcomes traditional bottlenecks through architectural innovations and streaming optimizations. This cascaded system integrates streaming (a) conversational speech recognition with mixture-of-experts, (b) state-action augmented LLM, (c) text-to-speech synthesis, (d) neural vocoder, and (e) speaker modeling. Proposed cascaded system achieves sub-second response latency with complete on-device processing.
Bio:
Tatiana is a research scientist at the Machine Learning Research team, Apple. Prior to Apple, she was a postdoctoral research scientist in the speech recognition team, Facebook AI Research. Back in the day, Tatiana received a Ph.D. in mixed type partial differential equations from Moscow State University. For several years she worked on applications of machine learning to high energy physics at CERN before moving to deep learning. The main focus of her research in past years is speech recognition and generation, private federated learning and general machine learning problems including optimization, scaling laws, efficient architectures. More details can be found at https://github.com/tlikhomanenko.
Abstract: As audio interfaces move from novelty to necessity, generative models are beginning to transform how machines understand, produce, and reason over spoken language. This talk will explore the evolving landscape of audio in the AGI era. We will look at recent advances in contextual text-to-speech (TTS), and native audio dialogue systems that integrate speech, reasoning, and interaction seamlessly and their products in Gemini. On the research side, we will discuss post-training methodologies—including supervised fine-tuning (SFT), knowledge distillation, and reinforcement learning (RL)—as well as emerging techniques for integrating thinking, tool use, and multi-turn audio reasoning into conversational agents.
Bio:
Mingqiu Wang is a Research Engineer at Google DeepMind. She currently leads the audio post-training in Gemini. She has co-led the launches of native audio generation models in Gemini, powering 1) real-time native audio dialog, 2) contextual speech synthesis, and 3) sophisticated audio understanding for products including Google Cloud (AI Studio and Vertex AI), Astra, Gemini Live, Gemini App, and NotebookLM, etc.
She also leads the core research of advanced training recipes from SFT to distillation to RLHF, reasoning/thinking, and agentic/tool-using capabilities for audio. Before joining DeepMind, she worked in Google Brain as a key contributor to Bard and developed the foundational speech-to-X model that became the audio backbone for Gemini 1.0.
Abstract: Audio analysis and audio synthesis require modeling long-term, complex phenomena and have historically been tackled in an asymmetric fashion, with specific analysis models that differ from their synthesis counterpart. In this presentation, we will introduce the concept of audio language models, a recent innovation aimed at overcoming these limitations. By discretizing audio signals using a neural audio codec, we can frame both audio generation and audio understanding as similar autoregressive sequence-to-sequence tasks, capitalizing on the well-established Transformer architecture commonly used in language modeling. This approach unlocks novel capabilities in areas such as textless speech modeling, zero-shot voice conversion, text-to-music generation and even real-time spoken dialogue. Furthermore, we will illustrate how the integration of analysis and synthesis within a single model enables the creation of versatile audio models capable of handling a wide range of tasks involving audio as inputs or outputs. We will conclude by highlighting the promising prospects offered by these models and discussing the key challenges that lie ahead in their development.
Bio:
Neil is co-founder and Chief Modeling Officer of the Kyutai non-profit research lab. He was previously at Google DeepMind, where he started and led a team working on generative audio. Before that, Neil spent three years at Facebook AI Research, working on automatic speech recognition and audio understanding. He graduated with a PhD in machine learning from Ecole Normale Supérieure (Paris), and holds an MSc in machine learning from Ecole Normale Supérieure (Saclay) and an MSc in quantitative finance from Université Paris Dauphine. In parallel with his research activities, Neil teaches speech processing technologies at the École Normale Supérieure (Saclay).
Abstract: Audio generation technologies have advanced rapidly over the past three years. In addition to having higher quality, they have also become much more universal and controllable in terms of the variety of audio they could generate and the variety of input one may use for prompting. In this talk, I will discuss the keys that led to the breakthrough, and also dive into four recent works from our team: Voicebox, Audiobox, and Movie Gen Audio. These works span speech generation, self-supervised pre-training for diffusion-style model, general audio generation, and video conditioned audio generation.
Bio:
Wei-Ning Hsu is a Research Scientist leading multimodal audio generation and understanding efforts at Meta Superintelligence Lab. Previously, he led speech self-supervised learning efforts at FAIR. Prior to joining Meta full-time, he worked at MERL, Google Brain, and FAIR as a research intern. He received his Ph.D. and S.M from Massachusetts Institute of Technology in 2020 and 2018, and B.S. from National Taiwan University in 2014.
He led multiple pioneering efforts in large scale unified audio generation, including Meta Movie Gen Audio, the first model capable of generating long-form cinematic sound tracks for videos with motion-synced sound and music, Voicebox and Audiobox, the first multilingual text-to-speech generation models for high fidelity voice & style cloning and text-based voice design, which power Meta AI Voice and Instagram Dubbing. He was also the inventor of HuBERT, which was the first speech tokenizer and served as the foundation of text-free spoken language models (GSLM, dGSLM, SpiritLM), text-free speech translation models (SeamlessExpressive and Hokkien-English Translation), and speech LLMs (Llama4, Moshi).
Note: Lightning talks will also be presented during the poster session on the same day.
Authors: Kang-wook Kim, Sehun Lee, Sang Hoon Woo, Gunhee Kim
Abstract:
One factor contributing to the performance discrepancy between large language models and spoken language models is the modality gap in their representations. To address this issue, we introduce SubAlign, the first speech tokenization framework to explicitly segment speech at the subword level corresponding to large language model vocabularies. Each resulting SubAlign unit is composed of the textual content, acoustic features, and duration associated with its respective subword. Building on this framework, we present SubAlign-SLM, a spoken language model trained on SubAlign units, and demonstrate the effectiveness of SubAlign on downstream tasks. Extensive automatic and human evaluations show that SubAlign-SLM surpasses baseline models, demonstrating the potential of SubAlign for speech processing applications.
Authors: Ju-Chieh Chou, Jiawei Zhou, Karen Livescu
Abstract:
Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details.
In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens.
We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.
Authors: Anuj Diwan, Zhisheng Zheng, David Harwath, Eunsol Choi
Abstract:
We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 342 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. An anonymous demo is available at https://paraspeechcaps.github.io/ .
Authors: Kwanghee Choi, Masao Someki, Emma Strubell, Shinji Watanabe
Abstract:
Discrete speech units (DSUs) are derived from clustering the features of self-supervised speech models (S3Ms). DSUs offer significant advantages for on-device streaming speech applications due to their rich phonetic information, high transmission efficiency, and seamless integration with large language models. However, conventional DSU-based approaches are impractical as they require full-length speech input and computationally expensive S3Ms. In this work, we reduce both the attention window and the model size while preserving the effectiveness of DSUs. Our results demonstrate that we can reduce floating-point operations (FLOPs) by 50% with only a relative increase of 6.5% in character error rate (CER) on the ML-SUPERB 1h dataset. These findings highlight the potential of DSUs for real-time speech processing in resource-constrained environments.
Authors: Martijn Bartelds, Martijn_Bartelds, Ananjan Nandi, Moussa Koulako Bala Doumbouya, Dan Jurafsky, Tatsunori Hashimoto, Karen Livescu
Abstract:
Modern deep learning models often achieve high overall performance, but consistently fail on specific subgroups. Group distributionally robust optimization (group DRO) addresses this problem by minimizing the worst-group loss, but it fails when group losses misrepresent performance differences between groups. This is common in domains like speech, where the widely used connectionist temporal classification (CTC) loss scales with input length and varies with linguistic and acoustic properties, leading to spurious differences between group losses. We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC's scaling issues. We evaluate CTC-DRO on the task of multilingual automatic speech recognition (ASR) across five language sets from the diverse ML-SUPERB 2.0 benchmark. CTC-DRO consistently outperforms group DRO and CTC-based baseline models, reducing the worst-language error by up to 47.1% and the average error by up to 32.9%. CTC-DRO can be applied to ASR with minimal computational costs, and, while motivated by multilingual ASR, offers the potential for reducing group disparities in other domains with similar challenges.
Authors: Sreyan Ghosh, Ramani Duraiswami
Abstract:
We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.
Authors: Jingyi Chen, Ju-Seung Byun, Micha Elsner, Pichao Wang, Andrew Perrault
Abstract:
Diffusion-based text-to-speech (TTS) models have recently achieved remarkable success in generating high-fidelity, natural-sounding speech. These models synthesize audio by iteratively denoising a latent representation, allowing them to capture complex acoustic and prosodic patterns. However, their performance is often limited by inefficiencies in generation and a lack of alignment with human preferences—particularly in capturing natural intonation, rhythm, and expressiveness. Standard training objectives may not fully reflect the perceptual criteria that listeners use to judge speech quality, motivating the need for new fine-tuning strategies that incorporate human feedback.
To address this, we propose Diffusion Loss-Guided Policy Optimization (DLPO), a novel reinforcement learning with human feedback (RLHF) framework for fine-tuning diffusion-based TTS models. DLPO introduces a reward formulation that combines human preference scores with the model’s original diffusion training loss. This approach aligns the reinforcement learning objective with the underlying generative structure of the diffusion model, enabling more stable and effective optimization. The inclusion of the original training loss as a regularizer serves two key roles: (1) it preserves the model’s ability to generate coherent and high-quality speech, and (2) it mitigates over-optimization to noisy or imperfect feedback signals, which are common in human evaluation of speech.
We apply DLPO to WaveGrad 2, a non-autoregressive diffusion TTS model designed for efficient waveform synthesis. WaveGrad 2 provides a strong foundation for testing RLHF strategies due to its streamlined architecture and high-quality baseline performance. In the DLPO framework, we use naturalness ratings from human evaluators to guide learning while maintaining consistency with the diffusion model’s generative prior. This dual-objective design allows DLPO to improve perceptual quality without sacrificing stability or intelligibility.
As illustrated in Figure 1, DLPO operates in a three-stage loop: (1) a pretrained diffusion model generates speech samples from text prompts; (2) a reward model assigns scalar-valued scores to the generated audio based on human preferences; and (3) the diffusion model is updated using a policy gradient objective that integrates both the reward signal and the diffusion loss. This iterative loop enables DLPO to progressively align synthesized speech with human judgments while preserving high audio fidelity and generative consistency.
Authors: Kuan-Po Huang
Abstract:
Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state-of-the-art in text-to-audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask-based model operating on discrete tokens, addresses slow inference through iterative mask-based parallel decoding. However, its audio quality still lags behind that of diffusion-based models. In this work, we introduce IMPACT, a text-to-audio generation framework that achieves high performance in audio quality and fidelity while ensuring fast inference. IMPACT utilizes iterative mask-based parallel decoding in a continuous latent space powered by diffusion modeling. This approach eliminates the fidelity constraints of discrete tokens while maintaining competitive inference speed. Results on AudioCaps demonstrate that IMPACT achieves state-of-the-art performance on key metrics including Fréchet Distance (FD) and Fréchet Audio Distance (FAD) while significantly reducing latency compared to prior models. The project website is available at https://audio-impact.github.io/.
Authors: Yuan Tseng, Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya
Abstract:
Recent work suggests that large language models (LLMs) can improve performance of speech tasks compared to existing systems. To support their claims, results on LibriSpeech and Common Voice are often quoted. However, this work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora. This calls into question the reliability of findings drawn from these two datasets. To measure contamination impact, LLMs trained with/without contamination are compared. A contaminated LLM is more likely to generate test sentences it has seen during training. Then, speech recognisers based on LLMs are compared. They show only subtle error rate differences if the LLM is contaminated, but assign significantly higher probabilities to transcriptions seen during LLM training. Results show that LLM outputs can be biased by tiny amounts of data contamination, highlighting the importance of evaluating LLM-based speech systems with held-out data.
Authors: Siddhant Arora, Jinchuan Tian, Hayato Futami, Jee-weon Jung, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe
Abstract:
Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition~(ASR), text-to-speech synthesis (TTS), and text LM tasks. Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets, while being compute-efficient enough to train on just 300 hours of public human-human conversation data, such as the Switchboard. We will publicly release our models and training code.
Authors: Jaeyeon Kim, Heeseung Yun, Sang Hoon Woo, Chao-Han Huck Yang, Gunhee Kim
Abstract:
Large audio language models (LALMs) extend language understanding into the auditory domain, yet their ability to perform low-level listening, such as pitch and duration detection, remains underexplored.
However, low-level listening is critical for real-world, out-of-distribution tasks where models must reason about unfamiliar sounds based on fine-grained acoustic cues.
To address this gap, we introduce the World-of-Whale benchmark (WoW-Bench) to evaluate low-level auditory perception and cognition using marine mammal vocalizations. WoW-bench is composed of a Perception benchmark for categorizing novel sounds and a Cognition benchmark, inspired by Bloom’s taxonomy, to assess the abilities to remember, understand, apply, and analyze sound events. For the Cognition benchmark, we additionally introduce distractor questions to evaluate whether models are truly solving problems through listening rather than relying on other heuristics. Experiments with state-of-the-art LALMs show performance far below human levels, indicating a need for stronger auditory grounding in LALMs.
Authors: Siqi Ouyang, Xi Xu, Lei Li
Abstract:
Simultaneous translation of unbounded streaming speech remains a challenging problem due to the need for effectively processing the historical speech context and past translations so that quality and latency, including computation overhead, can be balanced. Most prior works assume pre-segmented speech, limiting their real-world applicability. In this paper, we propose InfiniSST, a novel approach that formulates SST as a multi-turn dialogue task, enabling seamless translation of unbounded speech. We construct translation trajectories and robust segments from MuST-C with multi-latency augmentation during training and develop a key-value (KV) cache management strategy to facilitate efficient inference. Experiments on MuST-C En-Es, En-De, and En-Zh demonstrate that InfiniSST reduces computation-aware latency by 0.5 to 1 second while maintaining the same translation quality compared to baselines. Ablation studies further validate the contributions of our data construction and cache management strategy.
Authors: Tsung-Han Wu, Joseph E. Gonzalez, Trevor Darrell, David M. Chan
Abstract:
Automated Audio Captioning (AAC) aims to generate natural language descriptions of audio. Evaluating these machine-generated captions is a complex task, demanding an understanding of audio-scenes, sound-object recognition, temporal coherence, and environmental context. While existing methods focus on a subset of such capabilities, they often fail to provide a comprehensive score aligning with human judgment. Here, we introduce CLAIR-A, a simple and flexible approach that uses large language models (LLMs) in a zero-shot manner to produce a "semantic distance" score for captions. In our experiments, CLAIR-A more closely matches human ratings than other metrics, outperforming the domain-specific FENSE metric by 5.8\% and surpassing the best general-purpose measure by up to 11\% on the Clotho-Eval dataset. Moreover, CLAIR-A allows the LLM to explain its scoring, with these explanations rated up to 30\% better by human evaluators than those from baseline methods.
Authors: Yi-Jen Shih, David Harwath, Alex Dimakis, Zoi Gkalitsiou
Abstract:
Clinical diagnosis of stuttering requires an assessment by a licensed speech-language pathologist. However, this process is time-consuming and requires clinicians with training and experience in stuttering and fluency disorders. Unfortunately, only a small percentage of speech-language pathologists report being comfortable working with individuals who stutter, which is inadequate to accommodate for the 80 million individuals who stutter worldwide. Developing machine learning models for detecting stuttered speech would enable universal and automated screening for stuttering, enabling speech pathologists to identify and follow up with patients who are most likely to be diagnosed with a stuttering speech disorder. Previous research in this area has predominantly focused on utterance-level detection, which is not sufficient for clinical settings where word-level annotation of stuttering is the norm. In this study, we curated a stuttered speech dataset with word-level annotations and introduced a word-level stuttering speech detection model leveraging self-supervised speech models. Our evaluation demonstrates that our model surpasses previous approaches in word-level stuttering speech detection. Additionally, we conducted an extensive ablation analysis of our method, providing insight into the most important aspects of adapting self-supervised speech models for stuttered speech detection.
Authors: Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, David Harwath
Abstract:
We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot text-to-speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://voicecraft-x.github.io/.
Authors: Zijia Liu, Xiaocheng Yang, Dilek Hakkani-Tür
Abstract:
Enhancing human-computer interaction to be as natural as human-to-human conversation requires models that can perceive and appropriately react to nuanced emotional cues. A primary challenge lies in recognizing emotion from a user's speech and generating empathetic responses in real-time to maintain conversational flow. To address this, we propose a novel multimodal large speech-language model capable of real-time emotion tracking and empathetic responding within multi-turn dialogues. Our method utilizes a three-stage training framework to progressively build the model's capabilities. The initial stage involves supervised finetuning on text and emotion-based objectives. This is followed by an unsupervised finetuning stage to align speech and text embeddings. The final stage employs supervised multi-turn finetuning to effectively process conversational history and context. The model's architecture integrates a speech encoder and a large language model , and is trained on benchmark multimodal datasets including MELD, CMU-MOSEI, and IEMOCAP. This work contributes an end-to-end solution for developing more socially aware and emotionally intelligent conversational agents.
Authors: Kehinde Abdulsalam Elelu, Joshua E Siegel, Saffary Ali, Luong, Duc Hung, Babatunde Simeon, Ebuka Okpala
Abstract:
Evaluating the perceptual quality of AI-generative music remains a challenge in music information retrieval and computational creativity applications. Approaches such as those adopted in the MusicEval and AudioMOS challenges primarily rely on CLAP, a contrastive audio-text model to extract embeddings for Mean Opinion Score (MOS) prediction. While CLAP excels at coarse audio-text alignment, it struggles to capture fine-grained musical attributes such as timbral richness, rhythmic precision, and structural coherence, leading to suboptimal alignment with expert human evaluations. We introduce ConvM2D2, a novel dual-branch neural architecture that leverages M2D2, a second-generation masked modeling framework, as the upstream audio encoder for MOS prediction. M2D2 is trained to reconstruct masked audio segments, enabling it to capture temporally- and acoustically-detailed features that more closely reflect human perceptual criteria. The ConvM2D2 model processes audio and text embeddings jointly through specialized convolutional and multi-layer perceptron pathways to predict both Overall Musical Quality and Textual Alignment scores. We evaluate ConvM2D2 on the MusicEval benchmark, comparing its performance against other models and achieve improvements across all evaluation metrics (MSE, LCC, SRCC, and KTAU) at both utterance- and system-level evaluation. ConvM2D2 reaches a system-level LCC of 0.964 and reduces MSE by 88\% compared to the baseline, demonstrating strong alignment with human judgments across both overall musical quality and textual alignment tasks. This big improvement indicates ConvM2D2 can judge AI-generated music much more like a musical expert, making it easier to find, improve, and recommend better-sounding music.
Authors: Nevasini Sasikumar
Abstract:
We propose TriStream-Omni, a novel architecture that extends LLaMA-Omni2-0.5B's speech-language capabilities to include vision processing while maintaining sub-600ms latency. Our approach introduces three groundbreaking innovations:
First, we implement Sparse Temporal Vision Encoding (STVE), which processes visual inputs through a lightweight MobileViT backbone with temporal pooling, reducing computational overhead by 73% compared to traditional vision transformers. STVE extracts only salient visual tokens using learned importance masks, dynamically adjusting token density based on image complexity.
Second, our Asynchronous Tri-Modal Fusion (ATF) mechanism enables parallel processing of speech, text, and vision streams through independent encoding pathways that converge via learned routing weights. Unlike conventional sequential processing, ATF employs a novel "fusion-on-demand" strategy where modalities are combined only when cross-modal reasoning is required, preserving the model's original 583ms speech latency for audio-only queries.
Third, we introduce Cascaded Mixture-of-Experts (CMoE) routing, where specialized expert networks handle different modal combinations. Each expert (speech-only, vision-only, speech-vision, full tri-modal) is activated based on input characteristics, allowing the 0.5B model to achieve performance comparable to 3B parameter models. The cascade design processes simple queries through lightweight experts first, engaging complex tri-modal experts only when necessary, reducing average compute by 67%.
Authors: Jingyi Chen, Pichao Wang, Andrew Perrault, Micha Elsner
Abstract:
Speech-to-speech emotion transfer modifies the emotional tone of a speech signal while preserving its linguistic content and speaker identity. This task is vital for expressive speech synthesis, emotionally adaptive voice conversion, and naturalistic human-computer interaction. However, models trained directly on real speech often struggle to generalize due to high variability in emotional expression, caused by factors such as speaker individuality, cultural norms, and contextual nuances. These challenges make it difficult for models to disentangle and manipulate emotional signals without compromising speech naturalness or identity.
In this work, we introduce EmoTransfer, an end-to-end audio-conditioned emotion transfer model that bypasses the need for text input and directly operates on acoustic features. To overcome the limitations of real-world data, we propose a curriculum learning paradigm in which the model is first trained on synthetic speech generated by a controllable text-to-speech (TTS) system with predefined emotional attributes. This phase provides a low-noise, highly structured environment for learning foundational emotion representations. The model is then fine-tuned on a progressively mixed dataset of real and synthetic speech to adapt to the complexity of real-world emotional variation.
Our experiments show that curriculum learning significantly enhances the model’s generalization ability. EmoTransfer achieves a top-1 emotion similarity of 98% in the same-speaker/text setting and 97% in cross-speaker/text conditions, outperforming state-of-the-art baselines. Moreover, it receives the highest Mean Opinion Score (MOS) across emotion naturalness and speaker similarity in subjective evaluations.
To analyze how curriculum learning shapes emotion representation, we perform a t-SNE analysis on the latent emotion embeddings across four training settings (Figure 1). The model trained solely on synthetic data transfers emotion well within synthetic speech but exhibits poor clustering and generalization to real speech. Conversely, the model trained only on real speech displays entangled and poorly separated emotional clusters, indicating difficulty in learning distinct emotion patterns. When trained on a mixed dataset with 60% real and 40% synthetic data, the emotional embeddings begin to separate more clearly. Finally, fine-tuning on a dataset with 80% real and 20% synthetic speech results in well-structured and clustered emotion representations in t-SNE space, corresponding to improved emotion transfer performance. These findings underscore the effectiveness of curriculum learning in bridging the gap between synthetic and real speech and establishing robust emotional representations.
In addition to the proposed methodology, we contribute: (1) EmoTransfer, a novel audio-conditioned speech emotion transfer model; (2) a curriculum learning framework that strategically combines synthetic and real data; (3) an in-depth analysis of emotion representation learning using t-SNE visualizations; and (4) a publicly available dataset of 27 speakers, 9 emotions, and 27,000 audio samples, released at https://huggingface.co/datasets/anonymousforemotion/ET. Demos and source code are available at https://demopagea.github.io/EmoTransfer-demo/ and https://anonymous.4open.science/r/EmoTransfer-F882/.
Authors: Jiatong Shi, Bo-Hao Su, Shikhar Bharadwaj, Yiwen Zhao, Shih-Heng Wang, Jionghao Han, Haoran Wang, Wei Wang, Wenhao Feng, Yuxun Tang, Siddhant Arora, Jinchuan Tian, William Chen, Hye-jin Shim, Wangyou Zhang, Wen-Chin Huang, Shinji Watanabe
Abstract:
We present VERSA-v2, a major upgrade of the Versatile Evaluation of Speech and Audio (VERSA) toolkit for standardized and scalable evaluation across speech, audio, and music tasks. It features a modular, object-oriented architecture that simplifies metric integration and now supports over 100 metrics, organized into curated task-specific packs. VERSA-v2 also introduces interactive visualizations, per-metric profiling, and prompt-based evaluation using both text- and audio-based large language models (LLMs). These advancements make VERSA-v2 a robust, extensible, and LLM-enabled platform for comprehensive and interpretable speech and audio evaluation.
Authors: William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, Shinji Watanabe
Abstract:
Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. Scaling to larger models can improve ASR performance across the board, in both low and high resource languages, improving the accessibility of speech technologies. Finally, we show how OWLS can be used to power new research directions by discovering emergent abilities in large-scale speech models. Model checkpoints will be released on https://huggingface.co/collections/espnet/owls-scaling-laws-for-speech-recognition-and-translation-67ab7f991c194065f057ce8d for future studies.
Authors: Chibuzor Okocha
Abstract:
Recent advances in multimodal and speech-native large language models (LLMs) have delivered impressive speech recognition, translation, understanding, and question-answering capabilities for high-resource languages. However, African languages and non-native French or English accents remain dramatically underrepresented in benchmarks, limiting the understanding and applicability of leading LLMs for millions of francophone and anglophone users in low-resource settings. We present AfriVox, an open-source benchmark (including novel domain-specific and unscripted datasets) across 20 African languages, African-accented French, Arabic, and 100+ African English accents, contrasting leading multimodal speech LLMs with traditional unimodal automatic speech transcription (ASR) and translation (AST) models. Our analysis reveals significant language coverage variation, surprising LLM translation performance gains (e.g., Gemini), robustness concerns with unscripted speech, and substantial performance disparities for "supported" African languages. We profile the strengths, limitations, and language support of each model, and conduct the first targeted fine-tuning of a modern speech LLM (Qwen2.5-Omni) for three Nigerian languages, exceeding SOTA, and achieving up to 54% relative WER reduction and significant BLEU gains, offering practical guidance for implementers seeking to serve local language users.
Authors: Hashim Ali, Surya Subramani, Raksha Varahamurthy, Nithin Sai Adupa, Lekha Bollinani, Hafiz Malik
Abstract:
Recent advances in speech synthesis have introduced unprecedented challenges in maintaining voice authenticity, particularly concerning public figures who are frequent targets of impersonation attacks. This paper presents a comprehensive methodology for collecting, curating, and generating synthetic speech data for political figures, along with a detailed analysis of the challenges encountered. We introduce a systematic approach that incorporates an automated pipeline for collecting high-quality bona fide speech samples, featuring transcription-based segmentation that significantly improves the quality of synthetic speech. We experimented with various synthesis approaches, from single-speaker to zero-shot synthesis, and documented the evolution of our methodology. The resulting dataset comprises bonafide and synthetic speech samples from ten public figures, demonstrating superior quality with an NISQA-TTS naturalness score of 3.69 and the highest human misclassification rate of 61.9%.
Authors: Jinchuan Tian, William Chen, Yifan Peng, Jiatong Shi, Siddhant Arora, Shikhar Bharadwaj, Takashi Maekaku, Yusuke Shinohara, Keita Goto, Xiang Yue, Chao-Han Huck Yang, Shinji Watanabe
Abstract:
This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B.
Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens.
We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in speech recognition, speech synthesis, and text-only capabilities.
Technically, this paper articulates our SpeechLM designs on tokenization, multi-stream language models, and multi-stage training strategies. We experimentally demonstrate the importance of model size scaling and the effect of annealing data selection.
The OpusLMs are all built from publicly available materials and are fully transparent models. We release our code, data, checkpoints, and training logs to facilitate open SpeechLM research.
Authors: Chung-Ming Chien, Karen Livescu
Abstract:
Research about joint speech-text generation with language models has gained significant interest in recent years. These models aim to leverage the content generation capabilities acquired through text-based pre-training to improve long-context coherence in speech generation, a known challenge for pure speech models such as generative spoken language models (GSLMs). Additionally, information from the speech modality can provide valuable insights that do not exist in written language, potentially enhancing the model's capabilities in understanding and generating language. However, adapting pre-trained text-based language models to handle new sequence formats, often consisting of interleaved text and speech tokens, requires substantial training data and computational resources. In this research, we explore the possibility of decomposing the task into two parts, each handled by a model focused on a specific modality—one for text and one for speech. While both models have access to information from both modalities, they remain focused on generation within their respective domains. By avoiding the need to adapt models to new sequence formats, we aim to reduce the computational costs and resources required to develop joint speech-text generation frameworks, with the goal of facilitating the development of speech conversation systems using academic-level resources in the future.
Authors: Prasanth
Abstract:
The Conformer architecture has set a high standard in automatic speech recognition (ASR) by effectively combining convolutional neural networks with multi-head self-attention modules, enabling the modeling of both local and global dependencies. However, the quadratic computational and memory complexity of standard multi-head self-attention limits the scalability of Conformer models, especially for long audio sequences and real-time applications. In this work, we propose integrating Multi-Head Latent Attention (MLA), a low-rank attention approximation, into the Conformer encoder to reduce complexity without sacrificing performance. MLA introduces a fixed number of latent vectors that mediate attention computation, reducing the attention cost from $\mathcal{O}(n^2)$ to $\mathcal{O}(nk)$, where $k \ll n$. We describe the architectural modifications for seamless integration and present comprehensive experiments on the LibriSpeech dataset. Our MLA-Conformer achieves word error rates of 2.3\% and 4.7\% on the test-clean and test-other subsets, respectively, compared to the baseline Conformer's 2.1\% and 4.3\%. These results demonstrate that MLA-Conformer provides an effective trade-off between efficiency and accuracy, making it suitable for deployment in resource-constrained and real-time speech recognition scenarios.
Authors: Takyoung Kim, Dilek Hakkani-Tür
Abstract:
This research focuses on the development of **explainable and automatic interruption criteria** for conversational AI systems. Current half-duplex dialogue agents often struggle with turn-taking, resulting in interactions that feel unnatural or inefficient. To address this, we propose a framework that enables AI to identify and execute **"good interruptions",** those that are timely, context-aware, and perceived as cooperative rather than disruptive. Crucially, these interruptions are grounded in observable conversational cues, ensuring that the system’s behavior is both interpretable and justifiable.
By formalizing the conditions under which an interruption is appropriate, this work supports large-scale dialogue generation systems with transparent decision-making processes. A key application is in language support scenarios. For instance, when a non-native speaker hesitates while searching for a word, an AI agent equipped with our model could recognize the pause, infer the context, and offer the appropriate term in a way that feels helpful rather than intrusive. Similarly, in educational settings, an AI tutor could provide real-time feedback or clarification without breaking the conversational flow.
Ultimately, this research advances the goal of more natural and trustworthy human-AI interaction by integrating explainability directly into the mechanics of conversational behavior, with particular attention to the nuanced dynamics of interruption. This is expected to reduce an excessive dependency on black-box proprietary language models for generating interruptive behavior.
Authors: Masao Someki, Shikhar Bharadwaj, Atharva Anand Joshi, Chyi-Jiunn Lin, Jinchuan Tian, Jee-weon Jung, Markus Müller, Nathan Susanj, Jing Liu, Shinji Watanabe
Abstract:
Speech foundation models achieve strong generalization across languages and acoustic conditions, but require significant computational resources for inference. In the context of speech foundation models, pruning techniques have been studied that dynamically optimize model structures based on the target audio leveraging external context. In this work, we extend this line of research and propose context-driven dynamic pruning, a technique that optimizes the model computation depending on the context between different input frames and additional context during inference. We employ the Open Whisper-style Speech Model (OWSM) and incorporate speaker embeddings, acoustic event embeddings, and language information as additional context. By incorporating the speaker embedding, our method achieves a reduction of 56.7 GFLOPs while improving BLEU scores by a relative 26.1\% compared to the fully fine-tuned OWSM model.
Authors: Zhu Zhu, Ling Sun, Shuju Shi
Abstract:
Speech foundation models like Whisper have set new benchmarks in various ASR tasks, but they often underperform for second-language (L2) learners due to accent variation, disfluencies, and mispronunciations — speech characteristics that are underrepresented during current model pretraining. In this study, we examine how to adapt large speech foundation models—specifically Whisper—to better serve L2 speakers.
Our framework begins with fine-grained error analysis across speaker proficiency levels, which identifies systematic failure modes such as hesitation insertions and high deletion, insertion & substitution in low-proficiency groups. This motivates adaptation strategies that explicitly account for proficiency-driven variation in L2 speech. Based on these insights, we implement: (1) parameter-efficient multitask learning via LoRA to jointly model transcription and speaker proficiency, and (2) targeted data augmentation simulating disfluency patterns to mitigate recognition bias toward fluent speech.
Preliminary results show that our proficiency-aware multitask model reduces WER across all proficiency levels, with the largest absolute improvement of 4.7% observed in the low proficiency group.
Building on our current framework, we plan to explore several extensions to further enhance adaptation for low-proficiency L2 speech. These include prompt-based decoding with speech-aware LLMs and N-besthypothesis reranking using both phoneme- and word-level representations. We will also investigate dynamic thresholding mechanisms to better handle hesitation phenomena during decoding. These directions aim to expand the adaptability and interpretability of our pipeline, and provide deeper insights into modeling underrepresented L2 speaker populations.
Authors: Qianyi He, Yuan Chang Leong, Monica Rosenberg
Abstract:
The Algonauts 2025 Challenge called on the community to develop encoding models that predict whole-brain fMRI responses to naturalistic multimodal movies. In this submission, we propose a sequence-to-sequence Transformer that autoregressively predicts fMRI activity from visual, auditory, and language inputs. Stimulus features were extracted using pretrained models including VideoMAE, HuBERT, Qwen, and BridgeTower. The decoder integrates information from prior brain states and current stimuli via dual cross-attention mechanisms that attend to both perceptual information extracted from the stimulus as well as narrative information provided by high-level summaries of the content. One core innovation of our approach is the use of sequences of multimodal context to predict sequences of brain activity, enabling the model to capture long-range temporal structure in both stimuli and neural responses. Another is the combination of a shared encoder with partial subject-specific decoder, which leverages common representational structure across subjects while accounting for individual variability. Our model achieves strong performance on both in-distribution and out-of-distribution data, demonstrating the effectiveness of temporally-aware, multimodal sequence modeling for brain activity prediction. The code is available at https://github.com/Angelneer926/Algonauts_challenge.
Authors: Chin-Jou Li
Abstract:
We present **POWSM**, a multitask speech foundation model for phonetic transcription. Trained from scratch on 17k hours of multilingual speech from the IPAPack++ dataset, POWSM jointly learns tasks including phone recognition, ASR, and audio-guided phoneme-to-grapheme (P2G) and grapheme-to-phoneme (G2P) mappings.
Preliminary results show that training from scratch outperforms fine-tuning Whisper, and that multitask learning improves both phone error rate (PER) and articulatory feature edit distance (PFER). Future directions include analyzing the benefits and trade-offs of multitask learning, and scaling to additional tasks to further enhance phonetic alignment and generalization.
Authors: Massa Baali, Shuo Han, Syed Abdul Hannan, Soham Deshmukh, Rita Singh, Bhiksha Raj
Abstract:
Speaker recognition systems are often limited to classification tasks and struggle to generate detailed speaker characteristics or provide context-rich descriptions. These models primarily extract embeddings for speaker identification but fail to capture demographic attributes such as dialect, gender, and age in a structured manner. This paper introduces CoLMbo, a Speaker Language Model (SLM) that addresses these limitations by integrating a speaker encoder with prompt-based conditioning. This allows for the creation of detailed captions based on speaker embeddings. CoLMbo utilizes user defined prompts to adapt dynamically to new speaker characteristics and provides customized descriptions, including regional dialect variations and age-related traits. This innovative approach not only enhances traditional speaker profiling but also excels in zero-shot scenarios across diverse datasets, marking a significant advancement in the field of speaker recognition. The code is available at: https://github.com/massabaali7/CoLMbo
Authors: Wenchen Shi
Abstract:
Translanguaging Automatic Speech Recognition (CS-ASR) remains a technological challenge, especially for low-resource language pairs. Jopara, the common code-switching mode in Paraguayan daily life, is still underdeveloped in both sociolinguistic and computational research. This project aims to develop the first CS-ASR system for Spanish-Guarani speech (Jopara) using a hybrid Wav2Vec2-Transformer architecture and a novel Language Alignment Loss (LAL) framework.
Our approach introduces two key innovations: (1) adapting LAL to a Romance-Indigenous pair by focusing on acoustic differences, and (2) implementing a semi-automated, time-efficient pipeline for token-level language tagging using OpenAI API preprocessing and manual correction. Crucially, this approach alleviates the data scarcity problem that often limits low-resource code-switching research, since it requires only intonation-unit-level language annotation rather than costly and labor-intensive frame-level labeling. This reduces annotation and force alignment effort while still enabling effective model supervision.
So far, we have successfully completed tokenization and language labeling of code-switched audio, created gold-standard language tags, and benchmarked multilingual ASR baselines such as facebook/mms-1b-all, which yielded high error rates (WER 86.74%, CER 49.37%), highlighting the difficulty of the task.
Ongoing work focuses on several fronts. The full implementation of LAL in model training ensures that cross-attention assigns accurate frame-level language labels from token-level annotations. Training optimizes a combined loss function:
$$
\mathcal{L}_{\text{total}} = \alpha \mathcal{L}_{\text{CTC}} + (1-\alpha) \mathcal{L}_{\text{Attn}} + \beta \mathcal{L}_{\text{LAL}}
$$
where $\alpha$ and $\beta$ control the contributions of CTC, attention, and LAL objectives. We are also tuning batch size, label smoothing, and learning rate schedule for optimal performance.
In addition, we are experimenting with alternative encoder architectures, such as the Conformer model, to better capture the acoustic variability in code-switched speech. Planned post-processing includes evaluating the impact of language-aware large language models (LLMs) for n-best hypothesis rescoring and error correction on ASR outputs.
Authors: Chutong Meng, Antonios Anastasopoulos
Abstract:
Recent advances in multilingual automatic speech recognition (ASR), with models such as Whisper, MMS, SeamlessM4T, OWSM, OWLS, etc, have extended the number of supported languages to 100+ and the number of model parameters to 1B+.
While scaling up model and data sizes helps alleviate the curse of multilinguality and improves overall performance, several issues linger. In particular, we explore whether (large-scale) massive multilingual ASR models suffer from optimization imbalance across languages, and whether model performance be further boosted by resolving such issues.
Importantly, any solution must be efficient and scalable with respect to both the number of model parameters and the number of languages.
The curse of multilinguality has been a long-standing challenge in multilingual natural language and speech processing. Put simply, it states that when model capacity is fixed, supporting more languages can hurt performance in some languages -- it happens because languages compete for model capacity.
Apart from increasing model capacity, another possible direction is to proactively resolve optimization imbalance between languages.
Existing approaches include adding language-specific modules (adapters, MoE) to reduce negative transfer between languages, and applying multi-task learning algorithms that better balance the training dynamics across languages.
However, training language-specific modules can be troublesome as the number of languages grows.
Grouping ``similar" languages can partly resolve the issue, but it raises a new question of how to find optimal groups.
Therefore, this work focuses on the multi-task learning aspect.
From a multi-task learning perspective, multilingual ASR can be framed as learning multiple tasks simultaneously, with each language treated as a separate task.
In this view, task imbalance may arise when some tasks/languages are severely under-optimized.
To address this, a series of optimization algorithms have been proposed.
Some modify the gradients to resolve gradient conflicts (PCGrad, Gradient Vaccine), while some reweight task losses to balance optimization (FAMO, GEO4Align).
To the best of our knowledge, there have been few works trying to apply them in the context of multilingual ASR.
Moreover, most of the optimization methods do not scale well, as they introduce significant memory and/or computation overhead when the number of tasks and/or the model size increases. This makes them difficult to apply in multilingual ASR, where 100+ tasks are present and models have 1B+ parameters.
In this project, we:
- Investigate whether language/task imbalance exists in massive multilingual ASR models.
We will do so by investigating gradient conflicts and training dynamics.
- Incorporate multi-task optimization methods into multilingual ASR training, and evaluate their impact on both overall and per-language performance.
- Explore the development of more efficient alternatives if the current methods are not suitable for multilingual ASR.
- Evaluate scalability of the methods with respect to the number of languages, data size, and model size.
- Explore broader applications: if the direction can be proved effective, we will consider extending it to multilingual self-supervised training, as well as multilingual multi-task models with ASR and speech translation support.
Authors: Jiamin Yang, Marcelo Beramendi Caballero, Karen Livescu
Abstract:
In speech models, CNNs are widely used as local feature extractors. Recent work has shown that representations across different models seem to be converging, even when trained on different data. We hypothesize that CNN distributions across speech models have high similarity, suggesting that they could be replaced by one single model with universal applicability. Additionally, with previous work showing that convolutional layers take 33% of multiply-accumulate operations in the entire forward computation, there is room for improvement in the efficiency of the universal model. We offer indicative support on the hypothesis through similarity analysis, and develop a simple three-layer model through distillation from the transformer encoder input of HuBERT-base, Data2vec-base, and WavLM-base as the universal feature extractor. Tested on SUPERB, the model is able to largely retain the performance of three vanilla teacher models while achieving a 20x reduction in memory usage and a 10x decrease in runtime.