Reinforcing Spatial Reasoning and Human Alignment in VLMs/MLLMs
We are developing practical methods to enhance spatial reasoning and human-aligned multimodal understanding in VLMs through a combination of structured reasoning and reinforcement fine-tuning. The goal is to move beyond surface-level perception and enable models to reason about geometry, relationships, and affordances in ways consistent with human spatial intuition.
We focus on parameter-efficient adaptation (LoRA, visual adapters) in supervised fine-tuning (SFT) followed by RL Fine-Tuning (RLFT) using DPO, GRPO, and related formulations as stable and compute-feasible mechanisms for refining multimodal reasoning policies. Building on recent advances, we observe that structured prompting and scene-level intermediate reasoning substantially improve consistency between visual inputs and textual reasoning traces. Following Ji et al. [1], we adopt scene-graph and optical-flow CoT structures to explicitly align perception with relational reasoning steps, leading to stronger generalization under domain shifts.
To further refine alignment with human spatial reasoning, we incorporate RLFT by preference-driven and group-relative rewards. Consistent with the findings of Tan et al. [2], this two-stage process (SFT followed by GRPO fine-tuning) yields superior robustness and data efficiency across counting, structure perception, and spatial transformation tasks. Unlike SFT, which overfits to linguistic surface patterns, RLFT encourages feedback-driven correction and better cross-modal grounding.
Finally, extending these ideas to 3D reasoning, we integrate physics-aware reward modeling as in Pan et al. [3]. The resulting 3D-SPO framework enforces spatial plausibility through object-level physics modulation and trajectory-level aggregation, allowing VLMs to internalize physical constraints and human-like spatial preferences. Together, these strategies enable systematic scaling of multimodal reasoning and alignment without prohibitive compute or data requirements.
References
[1] B. Ji, S. Agrawal, Q. Tang, Y. Wu. Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning. arXiv:2507.13362, 2025.
[2] H. Tan, Y. Ji, X. Hao, X. Chen, P. Wang, Z. Wang, S. Zhang. Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models. arXiv:2503.20752, 2025.
[3] Z. Pan, H. Liu. MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse. arXiv:2503.18470, 2025.