Spatial Reasoning in VLMs & MLLMs

Spatial Reasoning:

Spatial reasoning is the capability to understand, manipulate, and infer relationships between objects and oneself within a space: for example where things are, how they relate (above/below, left/right, near/far), how they would move or transform.

Spatial reasoning in VLMs:

In the context of VLMs, it refers to reasoning about geometry, position, and relations in visual scenes, beyond simply recognizing objects or matching text to images.

VLMs often lack real spatial reasoning because:

They focus on semantics, not geometry: More precisely, image encoders (like CLIP, ViT) learn what objects are, but lose precise spatial layouts when flattening patches into tokens.
Positional info gets diluted : In VLMs, cross-attention mixes everything, so spatial order fades.
Training data has weak spatial supervision: For example captions describe “what” but rarely “where.”
Models rely on text correlation as they guess spatial relations from language co-occurrence instead of visual structure.
Finally, there is no 3D grounding in VLMs and most models only see 2D images, not depth or multiple views, so they can’t reason about occlusion or viewpoint.

In short, today’s VLMs see scenes like labels, not like space.

How to make them better?

There are many different approaches to improve spatial reasoning in VLMs/MLLMs. Some focus on providing better spatial data, others redesign the model architecture, and some enhance reasoning during inference or through embodied interaction. Below are the main categories of methods to improve spatial reasoning, along with exemplar research papers for each:

1. Data-centric (add explicit spatial supervision):

Build large-scale synthetic or auto-generated spatial VQA/relational pairs; include multi-view/3D cues (depth, point clouds).

Example: SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities which uses ~2 billion 3D-aware VQA pairs on 10 M images to train VLMs with quantitative spatial reasoning (CVPR 2024 (CVF Open Access))

2. Architecture / Feature Design (make space first-class):

Modify model architectures or features to better capture spatial cues: preserve positional signals, integrate region segmentation, depth modules, 3D scene-graphs, or multi-view fusion.

Example: Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models which analyzes how VLMs treat vision embeddings as semantic “bags of tokens” and proposes token-norm normalization and mid-layer spatial features to restore spatial awareness (arXiv 2025)

3. Reasoning-Time Methods (better inference):

Apply chain-of-thought (CoT) decompositions for spatial steps, self-consistency / test-time search, or perspective-aware prompting (mental imagery).

Example: Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning which explores CoT-based reasoning and self-consistency to enhance spatial reasoning without retraining (arXiv 2025)

4. Evaluation & Diagnostics (targeted benchmarks):

Develop purpose-built benchmarks to test spatial reasoning: left/right, above/below, metric distance, and frames of reference (egocentric vs allocentric).

Example: Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models which defines key spatial reasoning elements and benchmarks 13 state-of-the-art VLMs, revealing major deficits (arXiv 2025).

5. Embodied / 3D Scene Grounding (close the loop):

Tie VLMs to real 3D geometry and actions (robotics, interactive environments, multi-view scenes) so spatial reasoning becomes physically grounded.

Example: SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models. It introduces a region-aware plugin and 3D scene-graph data to enable spatial reasoning for embodied VLMs. (NeurIPS 2024)

Page updated

Google Sites

Report abuse