"Can Omni-MLLMs reasoning like humans from sight and sound?"
Current evaluations of audio-visual intelligence in Omni-MLLMs typically rely on one or more isolated tasks,
overlooking the homogeneity and heterogeneity among these tasks.
This limitation hinders a deeper assessment of Omni-MLLMs’ audio-visual intelligence and restricts further exploration of the underlying capabilities.
Inspired by human perceptual processes, we categorize the evaluation of audio-visual tasks into three stages:
Perception, Understanding, and Reasoning.
In addition, to assess whether Omni-MLLMs exhibit generalization abilities akin to human audio-visual intelligence, we additionally construct out-of-domain datasets to evaluate their Primitive Sensation capabilities.
Experimental results on Omni-MLLMs demonstrate that the (1) reasoning abilities are constrained by their perception and understanding capacities. Moreover, all tested Omni-MLLMs perform poorly in the primitive sensation evaluation, indicating that (2) out-of-domain generalization of audio-visual intelligence remains a significant challenge for current Omni-MLLMs.
AVI-Bench construction pipeline. The media data collected online is assigned as in-domain data, while the manually constructed media data is considered out-of-domain data. Both types will undergo manual verification, and for the online collected data, re-annotation will be required as necessary.