The rapid advancement of large vision-language models (VLMs) has introduced challenges in evaluating their reasoning across multiple modalities. Existing benchmarks provide limited insights into how models understand and reason over semantically equivalent information across modalities, which is crucial because a robust model should demonstrate consistent comprehension regardless of how information is represented. To address this gap, we introduce SEAM, a benchmark dataset for cross-modal reasoning that ensures semantically equivalent inputs are presented in distinct and standardized notations. By employing fundamentally distinct notation systems across modalities, in contrast to OCR-based image-text pairing, our benchmark provides a rigorous assessment of the textual-symbolic versus visual-spatial reasoning capabilities of VLMs in chess, chemistry, music, and graph theory. Our findings highlight key limitations in current VLMs and inform future advancements in cross-modal reasoning.
We present SEAM (Semantically Equivalent Across Modalities Benchmark), a novel dataset designed to evaluate the cross-modal reasoning abilities of vision-language models (VLMs). Unlike existing benchmarks, SEAM ensures that inputs across vision and language modalities are semantically equivalent, providing a rigorous and fair assessment of how well VLMs process identical information presented in different forms.
FEN:
1k6/ppp3p1/8/1P5p/8/P3n2P/2P1r1P1B2rNRK1 w - - 5 32
SMILES:
C1=CC(=CC=C1C[C@@H](C(=O)O)N)N.O
ABC:
X:1829\nL:1/16\nM:2/4\nK:G\n (3(BcB) A2d>c ... |]
Adjacency Matrix:
[[0, 1, 0, 0, 0, 1],
[1, 0, 1, 1, 1, 0],
[0, 1, 0, 0, 1, 0],
[0, 1, 0, 0, 0, 0],
[0, 1, 1, 0, 0, 1],
[1, 0, 0, 0, 1, 0]]
We crafted SEAM to test VLMs fairly across text and vision—here’s how:
Picking Domains: We selected chess, chemistry, music, and graph theory for their real-world, standardized notation systems.
Ensuring Equivalence: Tools like python-chess and RDKit convert text and vision inputs to ensure identical meaning.
Crafting Questions: We built 3,200 multiple-choice questions across 16 tasks, sourced from datasets like Lichess or generated with domain tools.
Balancing Difficulty: Incorrect options are tweaked to be plausible, using offsets or semantic similarity to reveal modality gaps.
Standardizing Visuals: Images are rendered at 400x400 pixels (600x600 for music), matching clean text inputs.
Contact [email] to get more information on the project