We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation. Since perspective taking requires integrating multiple cognitive capabilities, such as recognizing objects across views, grounding their relative positions, and mentally simulating their transformations, SpinBench decomposes this ability into a set of fine-grained diagnostic categories. These categories target core geometric primitives, including translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 37 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, inconsistencies under symmetrical and syntactic reformulations, and failures in premise-based linguistic reasoning. Scaling analysis shows both smooth improvements and emergent capabilities. Together, our findings highlight the need for structured, cognitively inspired diagnostic tools to advance spatial reasoning in multimodal foundation models.
In the figure below, Cohen's kappa values (κ) measure chance-adjusted performance, where κ=0 indicates chance-level and κ=1 perfect accuracy.
In the figure below, the left part shows model rankings by overall accuracy (top) and pair-wise consistency percentage (bottom), with colors indicating consistency levels. The right scatter plot reveals a robust positive correlation (Pearson r=0.874, p<0.05) between the two metrics.
In the figure below, each line shows Cohen’s κ (chance-adjusted accuracy) with respect to model size for four model families. While overall performance increases gradually with scale, different task types show distinct scaling patterns.