SSEU-Bench

Independent and Joint Understanding

To thoroughly investigate the understanding capabilities of LALMs, we propose two evaluation paradigms, namely "independent understanding'' and "joint understanding''. Specifically, for independent understanding, the LALM is required to focus on a single task (i.e., ASR, ASC or AT). While for joint understanding, the LALM is expected to consider the correlations among speech, scene, and events, and generate predictions for all three tasks.

Prompts: Independent Understanding

For ASR, we construct the prompt based on the default ASR prompt provided in each LALM’s example code. Details are described as follows.

(1) LTU-AS: "Closed-ended question: Can you identify the spoken text?"

(2) Qwen2-Audio-Instruct: "What does the person say? Directly output the spoken text."

(3) Kimi-Audio: "Please transcribe the following audio:"

(4) Step-Audio 2 Mini: The default prompt for ASR is in Mandarin ("请记录下你所听到的语音内容。"), but our data is in English. So we built another prompt with the similar meaning, which is "Transcribe all spoken text accurately. Output Format: Spoken text: [Transcribe all spoken words, phrases, and sentences clearly.]" We have tested these two prompts, and the second one presented a relatively lower average WER on all conditions, so we finally use the second prompt.

For ASC, we use the same prompt for all LALMs, which is "You are an expert in acoustic scene classification.\n I will give you an audio recording.\n Your task is to identify the environment where the audio was recorded.\n Directly output the name of the scene without any explanations.\n Please follow the required output format.\n Scene: <xxx>, where xxx represents the scene."

For AT, we use the same prompt for all LALMs, which is "You are an expert in sound events classification.\n I will give you an audio recording. Please carefully analyze the sound events in this audio.\n Ignore speech and focus only on non-speech sound events.\n Output only one line, no explanations.\n List distinguishable events detected in the audio, separated by a semicolon and a space."

Prompts: Joint Understanding

Directly Prompting

**Task Instructions:**

You are an advanced audio analysis system. Listen carefully to the provided audio clip, which contains both spoken content and background environmental sounds.

Analyze the audio comprehensively and provide responses for all three categories below in the exact format specified.

**Audio Analysis Requirements:**

The audio contains a user speech combined with non-speech background audio.

Your task is to:

1. Transcribe all spoken text accurately

2. Identify the acoustic scene/environment

3. Detect and list all audio events present, ignore speech and focus only on non-speech sound events

**Output Format:**

Please provide your analysis in exactly this format:

Spoken text: [Transcribe all spoken words, phrases, and sentences clearly]

Scene: [Identify the acoustic environment/setting where this audio was recorded]

Events: [List all distinguishable sound events separated by semicolons]

Chain-of-Thought Prompting

**Task Instructions:**

You are an advanced audio analysis system. Listen carefully to the provided audio clip, which contains both spoken content and background environmental sounds.

Analyze the audio comprehensively and provide responses for all three categories below in the exact format specified.

**Audio Analysis Requirements:**

The audio contains a user speech combined with non-speech background audio.

Your task is to:

1. Transcribe all spoken text accurately

2. Identify the acoustic scene/environment

3. Detect and list all audio events present, ignore speech and focus only on non-speech sound events

Please address these tasks step by step:

Step 1 — Energy & Onset: Compare energy between speech and non-speech background; estimate when the speaker starts talking.

Step 2 — ASR: Transcribe all spoken text accurately.

Step 3 — Scene candidates: Focus on non-speech audio; generate a short ranked list of plausible scene candidates.

Step 4 — Event candidates: Detect non-speech distinct sound events; create a ranked candidate list of events present.

Step 5 — Correlation check: Cross-validate scene and events; remove inconsistent items; choose one final scene and multiple events that are mutually consistent.

**Output Format:**

Please provide your analysis in exactly this format:

Reasoning process: [Short descriptions of reasoning process for each step, from 1 to 5]

Spoken text: [Transcribe all spoken words, phrases, and sentences clearly]

Scene: [Identify the acoustic environment/setting where this audio was recorded]

Events: [List all distinguishable sound events separated by semicolons]

Page updated

Google Sites

Report abuse