Audio-Centric AI: Towards Real-World Multimodal Reasoning and Application Use Cases (Audio-AAAI)

AAAI 2026 Workshop 📍 Singapore 📅26 Janurary 2026

This Audio-AAAI workshop focuses on recent and emerging directions in the field of audio understanding, particularly beyond conventional tasks such as sound event detection (SED) and acoustic scene classification (ASC). There is growing interest in intelligent systems that can understand and act on rich auditory scenes in realistic environments. With the increasing prevalence of audio-capable edge devices, smart assistants, and wearable sensors, understanding audio in a robust, multimodal, and privacy-conscious manner is more crucial than ever. While major conferences like AAAI, NeurIPS, and ICASSP have explored isolated components of this area, this workshop aims to provide a comprehensive platform that brings together researchers from AI, audio signal processing, and machine learning communities. Related workshops in the past include: DCASE Challenges and Workshops focuses on sound event and scene recognition; WASPAA and EUSIPCO workshops focus on audio processing; and 2023 NeurIPS Workshop on Machine Learning for Audio focused on generation models.

This workshop builds on prior community-building efforts to create a focused venue for advancing audio intelligence in real-world applications. It aims to tackle key technical challenges including multimodal contextualization, the development of robust evaluation benchmarks, and strategies for successful industry deployment. Such a dialogue is especially timely given the rapid advancements in AI and the growing demand for intelligent, context-aware systems across diverse domains ranging from robotics to healthcare.

Topics of Interest

The topics of interest include, but are not limited to:

• Robust audio and multimodal scene analysis under real-world constraints (asynchronous microphones, reverberant environments, video-audio desynchronization)

• Data augmentation and synthetic generation for spatial audio and video-audio learning

• Multimodal large-scale models (LLMs, VLMs) for audio-language(-video) retrieval and understanding

• Perceptual representation learning for 3D sound event localization, detection, and audio-visual grounding

• Foundation models and adaptation for audio, speech, and video-audio tasks

• Speech, audio, and video-conditioned generation (avatars, dubbing, cross-modal synthesis)

• Audio-Visual scene understanding and reasoning

• Audio and speech quality assessment: evaluation, metrics, and perception-driven benchmarks

• Multimodal safeguard and robustness in audio-video-language modeling

• Efficient evaluation frameworks for reasoning, generation, and multimodal integration (LLM + audio + video)

• Benchmarking dataset creation and sharing across audio, speech, and video modalities

• Applications of microphone array processing and virtual microphone techniques in multimodal systems

• Real World Applications: entertainment & media (karaoke, video avatars), manufacturing (process monitoring), sustainability (forest restoration, biodiversity monitoring), education (classroom video-audio analysis), healthcare (elderly care monitoring), security (robotic patrolling, surveillance with video-audio fusion)

Submission Requirements

Please prepare your submission using the AAAI template available at https://aaai.org/conference/aaai/aaai-26/main-technical-track-call/. Full papers should not exceed 8 pages (excluding references), while short papers are limited to 4 pages (excluding references). Accepted papers may be submitted under either the archival track or the non-archival track. Non-archival track papers will not be included in the proceedings and may consist of previously published work or preliminary studies. Archival track submissions must present original work that has not been published or submitted elsewhere. The Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.

Submission Portal

http://cmt3.research.microsoft.com/AudioAAAI2026/

Workshop Format

The workshop is a one-day event for balancing depth and participant engagement, including invited talks, paper presentations and discussions.

Invited Speakers

Wenwu Wang, Full Professor, University of Surrey, UK.
Tsubasa Takahashi, Principal Researcher, Tutoring Inc, USA.
Björn Schuller, Full Professor, Technical University of Munich, Germany.
Hung-yi Lee, Full Professor, National Taiwan University.
Yu Tsao, Deputy Director, Adademia Sinica.

Important Dates

24 October 2025: Paper Submission
 9 November 2025: Paper Notification
16 November 2025: Early Bird Registration
26 January 2026: Workshop Program

Workshop Committee

Nancy F. Chen, A*STAR Institute for Infocomm Research (A*STAR I²R), Singapore, nfychen@a-star.edu.sg
Nobutaka Ono, Tokyo Metropolitan University, Japan, onono@tmu.ac.jp
Xiaoxue Gao, A*STAR Institute for Infocomm Research (A*STAR I²R), Singapore, Gao_Xiaoxue@a-star.edu.sg
Keisuke Imoto, Kyoto University, Japan, keisuke.imoto@ieee.org
Tatsuya Komatsu, LY Corporation, Japan, komatsu.tatsuya@lycorp.co.jp

Page updated

Google Sites

Report abuse