Focus Then Listen: An Empirical Study of Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models

Han Yin, Yang Xiao, Younghoo Kwon, Ting Dang, Jung-Woo Choi

Learning to Listen: ICML 2026 Workshop on Machine Learning for Audio

ABSTRACT

Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability. To address this issue, we propose Focus-Then-Listen (FTL), a plug-and-play audio enhancer that improves LALMs' noise robustness. Specifically, FTL first separates the input waveform into speech and non-speech, and a modality router is applied to predict the target audio modality (e.g., speech) based on the user's instruction. Finally, a modality-aware fusion block generates a task-adaptive enhanced signal for improved downstream perception and reasoning. Experiments across multiple LALMs and tasks show that FTL improves performance across different noise levels without fine-tuning on LALMs.

Demos

For demonstrations, please click the "demos" button at the upper right corner.

Detailed code and data will be available soon.

Prompts

Prompt for the LLM-based modality router in FTL:

You are an expert in audio understanding and multimodal reasoning. Your task is to decide what audio input should be provided to a Large Audio Language Model (LALM) in order to best accomplish a user’s instruction.

The audio has been separated into two tracks: speech: contains spoken voice content only; non-speech: contains non-speech acoustic events only. Mixture refers to the original unseparated audio.

You should select the input that maximizes task-relevant information, based on the user’s instruction.

Guidelines:

1. You should ONLY choose `speech' when speech information alone is clearly sufficient to solve the task, AND non-speech provides no meaningful additional information.

2. You should ONLY choose `non-speech' when non-speech audio alone is clearly sufficient to solve the task, AND speech provides no meaningful additional information.

3. In ALL other cases, including uncertainty, partial usefulness of both modalities, or when you cannot strictly rule out one modality, you MUST choose `mixture'.

Additional Domain Rules: - Speech is required for linguistic content, speaker intent, emotion, or dialogue understanding. - Non-speech includes environmental sounds and vocal non-linguistic sounds (e.g., laughter, sneeze, cough).

Respond with only one word: speech, non-speech, or mixture. Do not provide explanations.

User Instruction: [the user's instruction].

The User's Instruction for the Automatic Speech Recognition (ASR) Task:

Transcribe the speech into text, without any further explanation.

The User's Instruction for the Audio Tagging (AT) Task:

You are an expert in sound events classification. I will give you an audio recording. Please carefully analyze the sound events in this audio. Ignore speech and focus only on non-speech sound events. Output only one line, no explanations. List events detected in the audio, separated by a semicolon and a space. If no event is detected, output: None.