Masoud Jafaripour - Spatial Reasoning

Toward Deliberate Multimodal Reasoning: Integrating Imagination and Inference:

Recent multimodal LMs demonstrate remarkable progress in combining visual perception with linguistic understanding, yet they still fall short of genuine spatial and temporal reasoning: the ability to imagine, simulate, and reason about how 3D the world changes over time. While current systems can describe what they see, they struggle to anticipate what might happen next or to infer unseen relationships across viewpoints and objects. Human cognition, by contrast, depends heavily on the interplay between language and visual imagination: when reasoning about a physical scene, we not only think in words but also visualize transformations, interactions, and causal sequences [1, 2] (see the experiments below).

My ongoing research explores a new inference-time paradigm that aims to bring this type of deliberative multimodal reasoning to large models. Instead of producing an answer in a single forward pass, the model treats inference as an adaptive exploration process by allocating computational effort dynamically based on the complexity or uncertainty of the reasoning task. Through this lens, inference becomes a controlled form of “multimodal thinking,” in which the model constructs, evaluates, and refines hypothetical multimodal states that evolve through space, time, or perspective. At the core of this paradigm lies the idea that reasoning should unfold not only symbolically but also perceptually. The model alternates between textual deliberation (expressing logic, conditions, or intentions) and visual imagination (depicting or internally representing possible outcomes). These alternating steps form what can be viewed as multimodal cognitive trajectories, sequences of interleaved textual and visual states that the model can explore, compare, and prune according to coherence and plausibility criteria. Each trajectory embodies a distinct “train of thought” through both language and imagery—bridging two traditionally separate reasoning paradigms.

Conceptually, the framework unifies symbolic “Chain-of-Thought” reasoning in language models [3] with visual “Chain-of-Frames” reasoning emerging in video and world-model research [4]. In doing so, it enables a single process in which visual and linguistic reasoning develop jointly and guide each other. Unlike earlier approaches that relied on additional training or architectural modification, this formulation operates entirely at inference time, focusing on how reasoning depth and quality can scale with compute. It therefore provides a new lens on test-time computation: rather than fixing the amount of reasoning the model performs, inference itself becomes a controllable resource—allocating more steps, branches, or refinements when a problem demands deeper understanding.

From a practical standpoint, this direction is architecture-agnostic. It can be instantiated within autoregressive transformers, diffusion-based world models, or hybrid multimodal architectures, each interpreting “deliberation” in its own representational space. Across these variants, the same high-level principles hold: inference proceeds through structured exploration; reasoning outcomes are guided by consistency across modalities; and computation is distributed adaptively rather than uniformly. The result is a form of computational introspection, where a model can not only generate but also examine and refine its own intermediate reasoning states. More broadly, this research seeks to answer a fundamental question in the evolution of large models: how can we move from pattern recognition toward genuine multimodal cognition—systems capable of imagining, evaluating, and reasoning about the world they describe? By embedding visual imagination into the reasoning loop, this framework pushes beyond descriptive intelligence toward predictive and spatial understanding, enabling models to reason about cause, effect, and transformation in grounded contexts such as physical reasoning, robotics, or embodied interaction [5, 6].

This line of work contributes to the broader effort to make large models compute-aware, introspective, and deliberative—treating reasoning not as a static output but as a dynamic process that unfolds across space, time, and imagination. It offers a scalable pathway toward models that can explain their conclusions, visualize their reasoning, and adapt computation to the demands of each problem—bringing machine reasoning a step closer to the flexible, imaginative intelligence observed in humans.

Preliminary Experiment: Test-Time Multimodal Reasoning

To ground this paradigm in practice, we conducted an initial evaluation on a custom maze-based spatial reasoning dataset, where each task presents a 2D layout and a short sequence of agent actions. The model must infer which goal region (A–D) the red agent reaches.
Using the Qwen2.5-VL-3B vision–language model in its standard inference mode yielded 28% accuracy, indicating that the model could recognize visual cues but struggled with spatial transformation reasoning—understanding how an agent’s position evolves over time.

To probe the effect of test-time compute, we applied a simplified form of Multimodal Test-Time Scaling (MTTS) by majority voting: instead of generating one answer per input, the model explored multiple reasoning branches (three independent rollouts per sample) and selected the most self-consistent prediction.
Even without explicit image generation, this inference-time exploration improved accuracy to 37%, demonstrating that allocating additional reasoning steps can enhance multimodal consistency and spatial understanding without any retraining.
These results mark the first step toward deliberate multimodal reasoning, where models allocate computation adaptively—thinking longer when the scene demands deeper spatial reasoning.

Figure1. Example spatial-reasoning maze from the evaluation dataset.

Figure2. Accuracy vs. Test-Time Compute Budget (MTTS) on the maze dataset.

Here, the total test-time compute budget is expressed as k × T, where k is the number of reasoning branches explored (breadth) and T is the number of refinement rounds (depth). Increasing this budget allows the model to evaluate more reasoning paths before committing to a final answer.

References:

[1] Gardner, H. Frames of Mind: The Theory of Multiple Intelligences. Basic Books, 2011.
[2] Vasilyeva, M., & Lourenco, S. Development of Spatial Cognition. Wiley Interdisciplinary Reviews: Cognitive Science, 2012.
[3] Wei, J. et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS, 2022.
[4] Huang, Y. et al. VChain: Chain-of-Visual-Thought for Reasoning in Video Generation. arXiv:2501.03421, 2025.
[5] Zhou, Y. et al. MindJourney: Test-Time Spatial Reasoning via Coupled VLM–World Model Exploration. arXiv:2502.04317, 2025.
[6] Du, Y., Wu, J., & Li, Y. Embodied Reasoning via Multimodal World Models. ICRA, 2024.