Toward Deliberate Multimodal Reasoning: Integrating Imagination and Inference:

Recent multimodal language models demonstrate remarkable progress in combining visual perception with linguistic understanding, yet they still fall short of genuine spatial and temporal reasoning—the ability to imagine, simulate, and reason about how the world changes over time. While current systems can describe what they see, they struggle to anticipate what might happen next or to infer unseen relationships across viewpoints and objects. Human cognition, by contrast, depends heavily on the interplay between language and visual imagination: when reasoning about a physical scene, we not only think in words but also visualize transformations, interactions, and causal sequences [1, 2] (see the experiments below).


My ongoing research explores a new inference-time paradigm that aims to bring this type of deliberative multimodal reasoning to large models. Instead of producing an answer in a single forward pass, the model treats inference as an adaptive exploration process—allocating computational effort dynamically based on the complexity or uncertainty of the reasoning task. Through this lens, inference becomes a controlled form of “multimodal thinking,” in which the model constructs, evaluates, and refines hypothetical multimodal states that evolve through space, time, or perspective. At the core of this paradigm lies the idea that reasoning should unfold not only symbolically but also perceptually. The model alternates between textual deliberation (expressing logic, conditions, or intentions) and visual imagination (depicting or internally representing possible outcomes). These alternating steps form what can be viewed as multimodal cognitive trajectories, sequences of interleaved textual and visual states that the model can explore, compare, and prune according to coherence and plausibility criteria. Each trajectory embodies a distinct “train of thought” through both language and imagery—bridging two traditionally separate reasoning paradigms.


Conceptually, the framework unifies symbolic “Chain-of-Thought” reasoning in language models [3] with visual “Chain-of-Frames” reasoning emerging in video and world-model research [4]. In doing so, it enables a single process in which visual and linguistic reasoning develop jointly and guide each other. Unlike earlier approaches that relied on additional training or architectural modification, this formulation operates entirely at inference time, focusing on how reasoning depth and quality can scale with compute. It therefore provides a new lens on test-time computation: rather than fixing the amount of reasoning the model performs, inference itself becomes a controllable resource—allocating more steps, branches, or refinements when a problem demands deeper understanding.


From a practical standpoint, this direction is architecture-agnostic. It can be instantiated within autoregressive transformers, diffusion-based world models, or hybrid multimodal architectures, each interpreting “deliberation” in its own representational space. Across these variants, the same high-level principles hold: inference proceeds through structured exploration; reasoning outcomes are guided by consistency across modalities; and computation is distributed adaptively rather than uniformly. The result is a form of computational introspection, where a model can not only generate but also examine and refine its own intermediate reasoning states. More broadly, this research seeks to answer a fundamental question in the evolution of large models: how can we move from pattern recognition toward genuine multimodal cognition—systems capable of imagining, evaluating, and reasoning about the world they describe? By embedding visual imagination into the reasoning loop, this framework pushes beyond descriptive intelligence toward predictive and spatial understanding, enabling models to reason about cause, effect, and transformation in grounded contexts such as physical reasoning, robotics, or embodied interaction [5, 6].


This line of work contributes to the broader effort to make large models compute-aware, introspective, and deliberative—treating reasoning not as a static output but as a dynamic process that unfolds across space, time, and imagination. It offers a scalable pathway toward models that can explain their conclusions, visualize their reasoning, and adapt computation to the demands of each problem—bringing machine reasoning a step closer to the flexible, imaginative intelligence observed in humans.