Under Review
Recent advancements in large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information. However, these models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources, which undermine their reliability and real-world application. To systematically evaluate these issues, we propose three distinct tasks: object existence, temporal order, and object attribute within audio. These tasks assess the models' comprehension of critical audio information aspects. Our experimental results reveal significant limitations in these fundamental tasks, underscoring the need for better models in recognizing specific sound events, determining event sequences, and identifying sound sources. To improve performance in these areas, we introduce a multi-turn chain-of-thought approach, which demonstrates significantly improved model performance across the proposed tasks.
Baselines: For open-source large audio-language models, we select Qwen2-Audio-7B-Instruct [5], Qwen-Audio-Chat-7B [6], SALMONN-7B [7], SALMONN-13B [7], LTU-AS [8], Gazelle [9], and GAMA [10]. For closed-source models, we select Gemini-1.5-flash-001 [11] and Gemini-1.5-pro-001 [11].
In addition to end-to-end models, we also include a cascade pipeline as a baseline for comparison. The cascade pipeline involves first using a specialized audio captioning model to generate captions for the audio. Then, the description of the audio is fed into text-based large language models, which use this information to answer the corresponding questions. In this setup, we used EnCLAP [12] as the specialized audio captioning model, and LLaMA-3.1-8B-Instruct [13] as the large language models. In all experiments, all models use greedy decoding without any system prompt settings.
As our questions are discriminative, we expect models to follow instructions and respond with “yes” or “no”. When parsing model responses, we use exact match to extract these answers. If an answer could not be extracted, the result is excluded and reflected in the instruction following rate.
In addition to the original question format, we designed the following experimental settings:
(a) Quotation Marks: Using quotation marks to emphasize the sound event in the question, for example, is there a “cat meowing” in the audio? We refer to this as the emphasis setting. In addition, we also use bold text to emphasize the sound event in the question, for instance, is there a cat meowing in the audio?
(b) Negative Questions: Reformulating the original question into a negative question, such as changing “Is there a cat meowing in the audio?” to “Isn’t there a cat meowing in the audio?”. We refer to this as negative questions setting.
(c) Silent Input: The original audio input is replaced with silent input fed into the model.
(d) Multi-turn and Thoughtful Chain of Hearings (MATCH): Previous work [14] highlights that large audio-language models are proficient at audio captioning but not as effective in answering discriminative questions. Thus, our approach first involves having the model describe the audio information before prompting it to answer the question.
Email: chunyi.kuan.tw@gmail.com