"Can Omni-MLLMs reasoning like humans from sight and sound"?
AVI-Bench will be continuously maintained as a long-term, open community resource to advance cutting-edge research and technological progress in human-like audio-visual intelligence.
Task Adaptive
Models demonstrate effective overall performance across a wide range of audio-visual tasks.
Modal Adaptive
Models demonstrate strong performance on both audio and visual modalities.
Stage Adaptive
Models illustrate strong performance on both perception and understand for better audio-visual reasoning.
Domain Adaptive
Models show human-like domain generalization.