SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson

Abstract

The rapid advancement of large vision-language models (VLMs) has introduced challenges in evaluating their reasoning across multiple modalities. Existing benchmarks provide limited insights into how models understand and reason over semantically equivalent information across modalities, which is crucial because a robust model should demonstrate consistent comprehension regardless of how information is represented. To address this gap, we introduce SEAM, a benchmark dataset for cross-modal reasoning that ensures semantically equivalent inputs are presented in distinct and standardized notations. By employing fundamentally distinct notation systems across modalities, in contrast to OCR-based image-text pairing, our benchmark provides a rigorous assessment of the textual-symbolic versus visual-spatial reasoning capabilities of VLMs in chess, chemistry, music, and graph theory. Our findings highlight key limitations in current VLMs and inform future advancements in cross-modal reasoning.

Key Contribution

We present SEAM (Semantically Equivalent Across Modalities Benchmark), a novel dataset designed to evaluate the cross-modal reasoning abilities of vision-language models (VLMs). Unlike existing benchmarks, SEAM ensures that inputs across vision and language modalities are semantically equivalent, providing a rigorous and fair assessment of how well VLMs process identical information presented in different forms.