Fine-Tuning and Scaling Toward Spatial Reasoning in Multimodal LLMs

Before examining test-time scaling and inference-level reasoning, my research explores the feasibility and scaling behavior of fine-tuning paradigms, Supervised Fine-Tuning (SFT), Parameter-Efficient Fine-Tuning (PEFT), and Reinforcement Learning Fine-Tuning (RLFT), as mechanisms for improving spatial reasoning in multimodal large language models.

The goal is to determine how these post-training adaptation techniques influence the emergence of structured spatial reasoning, scene comprehension, and action-grounded inference when models are applied to embodied or human-centered environments (see Practical Focus and Dataset Design below for details on the pilot VQA dataset and fine-tuning setup). 


Scaling as the Foundation

The foundation for this exploration originates from the scaling laws of neural language models [1]. These laws established that performance improves smoothly with model size, dataset size, and compute, following predictable power-law trends (Figure 1. in [1] see below). This regularity in pretraining efficiency forms the theoretical basis for examining how post-training methods such as SFT and RLFT can extend scaling behavior toward multimodal and reasoning tasks, particularly those requiring 3D spatial understanding.