Fine-Tuning and Scaling Laws in Spatial Reasoning in Multimodal LLMs

Before examining test/inference-time scaling, my research explores the feasibility and scaling behavior of fine-tuning paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning Fine-Tuning (RLFT), as mechanisms for improving spatial reasoning in multimodal large language models.

The goal is to determine how these post-training adaptation techniques influence the emergence of spatial reasoning, scene comprehension, and action-grounded inference when models are applied to embodied or human-centered environments (see Practical Focus and Dataset Design below for details on the pilot vision–language question answering (VQA) dataset and fine-tuning setup). 


Scaling as the Foundation

The foundation for this exploration originates from the scaling laws of neural language models paper [1]. These laws established that performance improves smoothly with model size, dataset size, and compute, following predictable power-law trends (Figure 1. in [1] see below). This regularity in pretraining efficiency forms the theoretical basis for examining how post-training methods such as SFT and RLFT can extend scaling behavior toward multimodal and reasoning tasks, particularly those requiring 3D spatial understanding.