We show two demos here, more demos can be obtained by running inference code. Links to the code and data please refer to the "Links" page.
We show two demos here, more demos can be obtained by running inference code. Links to the code and data please refer to the "Links" page.
Independent Understanding (SNR=0dB)
ASR (GT: This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue.)
LTU-AS: This is a very common type of post. One shuiny miny red and yellow with little or no green colors.
Qwen2-Audio-Instruct: This is a very common type of tool, one showing mainly red and yellow with a little bit green and blue.
Kimi-Audio: This is a very common type of film, one showing mainly red and yellow with little or no green or blue.
Step-Audio 2 Mini: this is a very common type of food one showing mainly red and yellow with little or no green at all.
ASC (GT: home)
LTU-AS: Outside, rural or natural
Qwen2-Audio-Instruct: elevator
Kimi-Audio: fishing
Step-Audio 2 Mini: kitchen
AT (GT: Vacuum Cleaner)
LTU-AS: Male speech, man speaking; Mechanisms; Generic impact sounds; Tick; Surface contact; Human voice
Qwen2-Audio-Instruct: mechanisms; printer; clock ticking; surface contact; generic impact sounds; tap
Kimi-Audio: Drill; Power_tool; Tools
Step-Audio 2 Mini: A man speaks; a vacuum cleaner runs
Joint Understanding (SNR=0dB)
GT
Spoken Text: The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain.
Scene: City Center
Events: Car, Large Vehicle
Directly Prompting
Qwen2-Audio-Instruct:
Spoken text: The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain.
Scene: car
Events: car; tire squeal; engine accelerating - revving - vroom; car racing - auto racing
Kimi Audio:
Spoken text: The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain.
Scene: outdoor
Events: rain; vehicle
Step-Audio 2 Mini:
Spoken text: The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain.
Scene: Outdoor
Events: Rain falling; vehicle traffic passing by; thunder in the distance; birds chirping; leaves rustling
Chain-of-Thought Prompting
Kimi-Audio:
Reasoning process:
1. The energy of the audio is dominated by the sound of rain; the speech by the woman is quieter and occurs later. The onset of speech is around 1.5 seconds.
2. The spoken text is \"The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain.\"
3. The acoustic scene is likely an outdoor environment with heavy rain, possibly near a road given the faint sound of a car.
4. The distinct sound events detected are 'car; rain'
5. Cross-validating the scene and events, the rain event is consistent with the outdoor acoustic scene, and the car event could be a distant vehicle sound. Thus, the final scene is 'outdoor with heavy rain' and the final events are 'car; rain'.
Spoken text: The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain.
Scene: outdoor with heavy rain
Events: car; rain
Step-Audio 2 Mini:
Reasoning process:
1. The energy of the speech is much higher than the background noise, indicating that the speaker started talking after the background noise had already begun.
2. The ASR transcription is \"The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain.\"
3. The acoustic environment has a mix of speech and background noise. The background noise is characterized by low-frequency rumbling, which suggests a potential outdoor setting like a street or a busy area.
4. The events detected include: car engine; footsteps; door closing; music playing; phone ringing.
5. The scene and events are mutually consistent, as the presence of a car engine, footsteps, door closing, music playing, and phone ringing are typical of an outdoor or semi-public space.
Spoken text: The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain.
Scene: Street
Events: Car engine; footsteps; door closing; music playing; phone ringing