AAAI 2026 Workshop 📍 Singapore 📅26 Janurary 2026
About
Topics of Interest
The topics of interest include, but are not limited to:
• Robust audio and multimodal scene analysis under real-world constraints (asynchronous microphones, reverberant environments, video-audio desynchronization)
• Data augmentation and synthetic generation for spatial audio and video-audio learning
• Multimodal large-scale models (LLMs, VLMs) for audio-language(-video) retrieval and understanding
• Perceptual representation learning for 3D sound event localization, detection, and audio-visual grounding
• Foundation models and adaptation for audio, speech, and video-audio tasks
• Speech, audio, and video-conditioned generation (avatars, dubbing, cross-modal synthesis)
• Audio-Visual scene understanding and reasoning
• Audio and speech quality assessment: evaluation, metrics, and perception-driven benchmarks
• Multimodal safeguard and robustness in audio-video-language modeling
• Efficient evaluation frameworks for reasoning, generation, and multimodal integration (LLM + audio + video)
• Benchmarking dataset creation and sharing across audio, speech, and video modalities
• Applications of microphone array processing and virtual microphone techniques in multimodal systems
• Real World Applications: entertainment & media (karaoke, video avatars), manufacturing (process monitoring), sustainability (forest restoration, biodiversity monitoring), education (classroom video-audio analysis), healthcare (elderly care monitoring), security (robotic patrolling, surveillance with video-audio fusion)
Submission Requirements
Please prepare your submission using the AAAI template available at https://aaai.org/conference/aaai/aaai-26/main-technical-track-call/. Full papers should not exceed 8 pages (excluding references), while short papers are limited to 4 pages (excluding references). Accepted papers may be submitted under either the archival track or the non-archival track. Non-archival track papers will not be included in the proceedings and may consist of previously published work or preliminary studies. Archival track submissions must present original work that has not been published or submitted elsewhere. The Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.
Submission Portal
Workshop Format
The workshop is a one-day event for balancing depth and participant engagement, including invited talks, paper presentations and discussions.
Invited Speakers
Full Professor, University of Surrey
Wenwu Wang (Fellow, IEEE) is a Professor in Signal Processing and Machine Learning, Associate Head of External Engagement, School of Computer Science and Electronic Engineering, University of Surrey, UK. He is also an AI Fellow at the Surrey Institute for People Centred Artificial Intelligence. His current research interests include signal processing, machine learning and perception, artificial intelligence, machine audition (listening), and statistical anomaly detection. He has (co)-authored over 300 papers in these areas. His work has been recognized with more than 15 accolades, including the Audio Engineering Society Best Technical Paper Award (2025), IEEE Signal Processing Society Young Author Best Paper Award (2022), ICAUS Best Paper Award (2021), DCASE Judge’s Award (2020, 2023, and 2024), DCASE Reproducible System Award (2019 and 2020), and LVA/ICA Best Student Paper Award (2018). He is a Senior Area Editor (2025-2027) of IEEE Open Journal of Signal Processing and an Associate Editor (2024-2026) for IEEE Transactions on Multimedia. He was a Senior Area Editor (2019-2023) and Associate Editor (2014-2018) for IEEE Transactions on Signal Processing, and an Associate Editor (2020-2025) for IEEE/ACM Transactions on Audio Speech and Language Processing. He is Chair (2025-2027) of the EURASIP Technical Area Committee on Acoustic Speech and Music Signal Processing, and an elected Member (2021-2026) of the IEEE SPS Signal Processing Theory and Methods Technical Committee. He was the elected Chair (2023-2024) of IEEE Signal Processing Society (SPS) Machine Learning for Signal Processing Technical Committee, and a Board Member (2023-2024) of IEEE SPS Technical Directions Board. He has been on the organising committee of INTERSPEECH 2022, IEEE ICASSP 2019 & 2024, IEEE MLSP 2013 & 2024, and SSP 2009. He was Technical Program Co-Chair of IEEE MLSP 2025. He has been elected to IEEE Fellow for contributions to audio classification, generation and source separation. He has been an invited Keynote or Plenary Speaker on more than 20 international conferences and workshops.
Abstract: Large Language Models (LLMs) are increasingly being applied to audio processing, where they help interpret and generate meaningful patterns from complex sound inputs such as speech, music, and sound effects. When combined with acoustic models, LLMs offer significant potential for solving a wide range of challenges in audio processing, understanding and generation. This talk will highlight several recent developments in large audio-language models (LALMs), focusing on new algorithms and their applications to audio-centric tasks. Topics will include audio-text fusion and alignment, cross-modality audio applications, the construction of audio-language datasets, and emerging research directions in audio-language learning. We will showcase our recent work in areas such as audio generation and storytelling (e.g., AudioLDM, AudioLDM2, WavJourney), audio source separation (e.g., AudioSep), audio captioning and reasoning/question answering (e.g., ACTUAL and APT-LLMs), neural audio coding (e.g., SemantiCodec), audio editing (e.g., WavCraft), and the datasets (e.g., WavCaps, Sound-VECaps, AudioSetCaps) used to train and evaluate large audio-language models.
Full Professor, National Taiwan University
Hung-yi Lee is a professor of the Department of Electrical Engineering at National Taiwan University (NTU), with a joint appointment at the Department of Computer Science & Information Engineering of the university. His recent research focuses on developing technology that can reduce the requirement of annotated data for speech processing (including voice conversion and speech recognition) and natural language processing (including abstractive summarization and question answering). He won Salesforce Research Deep Learning Grant in 2019, AWS ML Research Award in 2020, Outstanding Young Engineer Award from The Chinese Institute of Electrical Engineering in 2018, Young Scholar Innovation Award from Foundation for the Advancement of Outstanding Scholarship in 2019, Ta-You Wu Memorial Award from Ministry of Science and Technology of Taiwan in 2019, and The 59th Ten Outstanding Young Person Award in Science and Technology Research & Development of Taiwan. He is a Fellow of International Speech Communication Association (ISCA). He owns a YouTube channel teaching deep learning technology in Marian, which has more than 300,000 subscribers.
Abstract: This talk highlights recent advancements in Spoken Language Models (SLMs), focusing on enabling text-based Large Language Models (LLMs) to seamlessly process and generate speech while retaining their universal capabilities. Starting from traditional text-based LLMs, we explore methods to integrate speech comprehension and generation without causing catastrophic forgetting of their original skills. We introduce novel speech representation learning techniques specifically tailored for SLMs and present analyses of their internal representations. Additionally, we discuss benchmark evaluations designed for SLMs, assessing their universal capabilities, instruction-following proficiency, reasoning abilities, and effectiveness in full-duplex dialogues. Finally, we will discuss how to enable SLMs to think and speak simultaneously.
Principal Researcher at Acompany Co., Ltd
Tsubasa Takahashi is a Principal Research Scientist at Acompany Co., Ltd. He received his PhD in Computer Science from the University of Tsukuba, Japan, in 2014. He has held research and R&D leadership positions at major technology companies, including NEC, LINE, and Turing, and was a visiting scholar at Carnegie Mellon University from 2015 to 2016. His research interests include privacy-preserving machine learning, Confidential AI, adversarial robustness, and the security of large-scale multimodal and autonomous systems.
Abstract: Alongside the rapid adoption of generative AI systems, physical AI systems that perceive and act in the real world are advancing at an accelerating pace. As these systems are increasingly deployed in everyday and industrial environments, they are required to process rich contextual information from both personal and operational domains, making safety and trustworthiness central technical challenges. Looking ahead, a key question is how to enable advanced AI capabilities while preserving the confidentiality of personal data and sensitive business information. This talk first reviews recent research trends aimed at improving the safety and reliability of multimodal AI systems. It then introduces Confidential AI, an approach in which AI models are executed within trusted execution environments (TEEs). By providing end-to-end protection from prompt inputs to inference outputs, Confidential AI enables secure AI deployment while maintaining data confidentiality. The talk concludes by discussing emerging considerations for jointly addressing safety and confidentiality when deploying physical AI systems across diverse application domains.
Full Professor, Imperial College London / UK & TUM / Germany
Björn Schuller (Fellow, IEEE) received the Diploma, Ph.D., and Habilitation degrees in electrical engineering and information technology from the Technical University of Munich, Munich, Germany, in 1999, 2006, and 2012, respectively. He was subsequently appointed as an Adjunct Teaching Professor. He is currently a Full Professor of artificial intelligence and the Head of the Group on Language, Audio, and Music (GLAM), Imperial College London, London, U.K., and the Chair of Health Informatics, Technical University of Munich. He also holds various other professorships and affiliations worldwide. He has (co-)authored more than 1 400 publications, with more than 60 000 citations and an H-index of 110. He is a fellow of ACM, BCS, ELLIS, ISCA, and AAAC, where he also served as the President. He is a Golden Core Awardee of the IEEE Computer Society and an Elected Full Member of Sigma Xi.
Abstract: Audio is the modality that refuses to sit quietly in the background. And it has an unfair advantage: it arrives early, travels far and lightweight, and carries structure that vision often misses—rhythm, prosody, timing, interaction, and the subtle signatures of context. It leaks emotion, state, and intention—often before a single word is recognised and long after the camera fails. Yet, in most multimodal pipelines, audio is treated as an accessory stream: aligned late, fused shallowly, and evaluated on curated benchmarks that rarely resemble the messiness of the real world. This talk therefore argues for an “audio first” multimodal stack: let intelligent listening set the rhythm for fusion, guide attention across modalities, and keep reasoning on track when the world is noisy, messy, or incomplete. We will connect recent progress in self-supervised audio representation learning, paralinguistic and affective computing, reasoning, and multimodal alignment to concrete application patterns: health and wellbeing monitoring, human–AI interaction, and media and music understanding. Be invited to listen – and watch…
Deputy Director, Adademia Sinica.
Yu Tsao (Senior Member, IEEE) received the B.S. and M.S. degrees in Electrical Engineering from National Taiwan University, Taipei, Taiwan, in 1999 and 2001, respectively, and the Ph.D. degree in Electrical and Computer Engineering from the Georgia Institute of Technolo-gy, Atlanta, GA, USA, in 2008. From 2009 to 2011, he was a Researcher at the National In-stitute of Information and Communications Technology (NICT), Tokyo, Japan, where he con-ducted research and product development in multilingual speech-to-speech translation sys-tems, focusing on automatic speech recognition. He is currently a Research Fellow (Professor) and the Deputy Director at the Research Center for Information Technology Innovation, Ac-ademia Sinica, Taipei, Taiwan. He also holds a joint appointment as a Professor in the De-partment of Electrical Engineering at Chung Yuan Christian University, Taoyuan, Taiwan. His research interests include assistive oral communication technologies, audio coding, and bio-signal processing. He serves as an Associate Editor for IEEE Transactions on Consumer Elec-tronics and IEEE Signal Processing Letters. He received the Outstanding Research Award from Taiwan’s National Science and Technology Council (NSTC), the 2025 IEEE Chester W. Sall Memorial Award, and served as the corresponding author of a paper that won the 2021 IEEE Signal Processing Society Young Author Best Paper Award.
Abstract: This presentation is organized into two major sections: assistive hearing technologies and assis-tive speaking technologies. The first section addresses recent advancements in assistive hear-ing technologies, focusing on the application of cutting-edge AI-based classification algo-rithms for diagnosing hearing disorders. We will present case studies including the automated detection of otitis media with effusion and the assessment of vestibular hypofunction in adults. In addition, we will introduce AI-driven methods for speech generation and quality assessment tailored to hearing assistive devices, such as hearing aids and cochlear implants. The second section explores assistive speaking technologies, emphasizing AI-enabled diagnos-tic tools for speech disorders. Topics include the classification of pathological speech types and the evaluation of speech severity. We will also highlight speech enhancement techniques targeting disordered speech, including conditions resulting from oral surgery, dysarthria, and electrolaryngeal voice production. Overall, this presentation aims to demonstrate the trans-formative potential of neural-based approaches in improving communication accessibility for individuals with speech and hearing impairments and to foster continued interdisciplinary re-search in this emerging field.
Program
Time Session
8:45 - 9:00 Opening Remarks
9:00 - 9:30 Invited Talk 1: Hung-yi Lee
9:30 - 10:30 Poster Session 1 (6 posters)
Lina-Speech: Gated Linear Attention and Initial-State Tuning for Multi-Sample Prompting Text-To-Speech Synthesis
AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio Embedding Sequences
Semi-supervised Acoustic Scene Classification under Spatial-Temporal Variability with a CRNN-based Model
Online Independent Low-Rank Matrix Analysis as a Lightweight and Trainable Model for Real-Time Multichannel Music Source Separation
Granular Control of Nonverbal Expressions for Achieving Natural Emotional Text-to-Speech System
Toward High-Quality Cross-lingual Text-to-Speech Synthesis In Low-Resource Scenarios
10:30 - 11:00 Coffee Break
11:00 - 11:30 Invited Talk 2: Yu Tsao
11:30 - 11:40 Photo Session
11:40 - 12:10 Invited Talk 3: Tsubasa Takahashi
12:10 - 2:00 Lunch Break
2:00 - 3:00 Poster Session 2 (7 posters)
Train multi-modal LLM to understand diverse speech paralinguistics by distilling from teacher with meta-information prompt
Latent-RQ: Enhancing Speech Pre-training with Latent Representations and Random Quantization
Can You Hear Naples? Building and Benchmarking a Neapolitan Speech Corpus
AudioRAG: A Challenging Benchmark for Audio Reasoning and Information Retrieval
BSLM: A Bi-Level Speech-Language Model for the Joint Modeling of Discrete and Continuous Tokens
Residual Speech Embeddings for Tone Classification: Removing Linguistic Content to Enhance Paralinguistic Analysis
Generalizable speech deepfake detection via meta-learned LoRA
3:00 - 3:30 Invited Talk 4: Björn Schuller
3:30 - 4:00 Coffee Break (Award Voting)
4:00 - 4:30 Invited Talk 5: Wenwu Wang
4:30 - 5:30 Best Paper/Poster/Presentation Award Announcement + Closing Remarks
5:30 - 5:40 Photo Session
Important Dates
24 October 2025: Paper Submission
9 November 2025: Paper Notification
16 November 2025: Early Bird Registration
26 January 2026: Workshop Program
Workshop Committee
Nancy F. Chen, A*STAR Institute for Infocomm Research (A*STAR I²R), Singapore, nfychen@a-star.edu.sg
Nobutaka Ono, Tokyo Metropolitan University, Japan, onono@tmu.ac.jp
Xiaoxue Gao, A*STAR Institute for Infocomm Research (A*STAR I²R), Singapore, Gao_Xiaoxue@a-star.edu.sg
Keisuke Imoto, Kyoto University, Japan, keisuke.imoto@ieee.org
Tatsuya Komatsu, LY Corporation, Japan, komatsu.tatsuya@lycorp.co.jp