Reinforcing Spatial Reasoning and Human Alignment in VLMs/MLLMs
We are developing practical methods to enhance spatial reasoning and human-aligned multimodal understanding in VLMs through a combination of structured reasoning and reinforcement fine-tuning. The goal is to move beyond surface-level perception and enable models to reason about geometry, relationships, and affordances in ways consistent with human spatial intuition.
We focus on parameter-efficient adaptation (LoRA, visual adapters) followed by RLFT using DPO, GRPO, and related formulations as stable and compute-feasible mechanisms for refining multimodal reasoning policies. Building on recent advances, we observe that structured prompting and scene-level intermediate reasoning substantially improve consistency between visual inputs and textual reasoning traces. Following Ji et al. [1], we adopt scene-graph and optical-flow CoT structures to explicitly align perception with relational reasoning steps, leading to stronger generalization under domain shifts.
To further refine alignment with human spatial reasoning, we incorporate RLFT-based optimization using preference-driven and group-relative rewards. Consistent with the findings of Tan et al. [2], this two-stage process—SFT activation followed by GRPO-based enhancement—yields superior robustness and data efficiency across counting, structure perception, and spatial transformation tasks. Unlike SFT, which overfits to linguistic surface patterns, RLFT encourages feedback-driven correction and better cross-modal grounding.
Finally, extending these ideas to 3D reasoning, we integrate physics-aware reward modeling as in Pan et al. [3]. The resulting 3D-SPO framework enforces spatial plausibility through object-level physics modulation and trajectory-level aggregation, allowing VLMs to internalize physical constraints and human-like spatial preferences. Together, these strategies enable systematic scaling of multimodal reasoning and alignment without prohibitive compute or data requirements.
References
[1] B. Ji, S. Agrawal, Q. Tang, Y. Wu. Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning. arXiv:2507.13362, 2025.
[2] H. Tan, Y. Ji, X. Hao, X. Chen, P. Wang, Z. Wang, S. Zhang. Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models. arXiv:2503.20752, 2025.
[3] Z. Pan, H. Liu. MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse. arXiv:2503.18470, 2025.