"Can Omni-MLLMs reasoning like humans from sight and sound"?
Question: Please identify the categories of sound-emitting objects in the given audio and how many different instances of each category are making sounds.
Answer: {"keyboard": "1", "dog": "2"}
Question: Please identify the categories of visible objects in the given image and how many different instances of each category are visible.
Answer: { "horse": "2", "man": "3" }
Question: Please generate the bounding-box of sound-emitting instances in the given image with the help of the given audio.
Answer: {"dog_1": [40.0, 169.0, 143.0, 90.0], "dog_2": [423.0, 67.0, 177.0, 92.0]}
Question: Are the contexts of audio and visual content matching?
Answer: No.
Question: Based on the given image as a reference, identify which sounds in the provided audio list contain the objects corresponding to the image.
Answer: [2, 4, 5]
Question: Based on the given audio as a reference, identify which images in the provided image list contain the objects corresponding to the sound.
Answer: [1, 2, 6, 7]
Question: Based on the given audio as a reference, identify which images in the provided image list contain the objects corresponding to the sound.
Answer: A black cat is playing with water in the sink while a man and woman talk quietly in the background.
Question: Is the talking people visible in the video?
Answer: No.
Question: Is the man making sound in the audio?
Answer: No.
Question: Is the instrument on the left more rhythmic than the instrument on the right?
Answer: Yes .
Question: Segment the object in the given framse based on the given text reference. Reference: The object making the loudest sound.
Answer: { "frame_0": [ 452, 198, 433, 520 ], "frame_1": [ 318, 146, 433, 574 ], "frame_2": [ 368, 123, 456, 597 ], "frame_3": [ 354, 120, 455, 600 ], "frame_4": [ 342, 101, 416, 579 ], "frame_5": [ 317, 79, 619, 574 ], "frame_6": [ 208, 104, 890, 568 ], "frame_7": [ 199, 9, 995, 711 ], "frame_8": [ 114, 21, 870, 699 ], "frame_9": [ 0, 54, 571, 666 ] }
Question: Please compare the given two sound segments, which one has a softer volume?
Answer: Last.
Question: What is the surface roughness of the rotating object?
Answer: Rough.
Question: which object is making the louder sound?
Answer: Pyramid.