Masoud Jafaripour - Fine-Tuning

Fine-Tuning and Scaling Laws in Spatial Reasoning in Multimodal LLMs

Before examining test/inference-time scaling, my research explores the feasibility and scaling behavior of fine-tuning paradigms: Supervised Fine-Tuning (SFT) and Reinforcement Learning Fine-Tuning (RLFT), as mechanisms for improving spatial reasoning in multimodal large language models.

The goal is to determine how these post-training adaptation techniques influence the emergence of spatial reasoning, scene comprehension, and action-grounded inference when models are applied to embodied or human-centered environments (see Practical Focus and Dataset Design below for details on the pilot vision–language question answering (VQA) dataset and fine-tuning setup).

Scaling as the Foundation

The foundation for this exploration originates from the scaling laws of neural language models paper [1]. These laws established that performance improves smoothly with model size, dataset size, and compute, following predictable power-law trends (Figure 1. in [1] see below). This regularity in pretraining efficiency forms the theoretical basis for examining how post-training methods such as SFT and RLFT can extend scaling behavior toward multimodal and reasoning tasks, particularly those requiring 3D spatial understanding.

Scaling of SFT and PEFT in Language Models

Recent studies on fine-tuning scaling reveal that SFT itself follows a distinct but shallower scaling law than pretraining. In When Scaling Meets LLM Finetuning [2], performance improves steadily with dataset size but saturates beyond a few hundred thousand examples, with exponents β typically between 0.05 and 0.3 depending on method (Figure 1. of paper [2] shown below).
Parameter-efficient methods such as LoRA and prompt tuning achieve faster convergence and higher data efficiency, yet exhibit earlier plateaus due to smaller effective capacity. These results establish the empirical foundation for data-limited adaptation, showing that even without retraining the full model, scaling trends remain measurable and predictable.

Extending Fine-Tuning Scaling to Multimodal Models

The Scaling Law for Multimodal LLM Supervised Fine-Tuning [3] extends these findings beyond text. Using large-scale vision-language pairs, the study demonstrates that multimodal SFT performance scales with dataset size up to roughly 3–5 million pairs for 3B–7B models (Figure 1 from paper [3] demonstrated below). Beyond that range, additional supervision yields diminishing returns or mild overfitting. This mirrors the behavior observed in pure language models but at a lower exponent, reflecting the higher complexity of aligning visual and linguistic representations.
My research builds on this insight to explore spatially grounded multimodal fine-tuning, where the goal is not only alignment but also the emergence of structured 3D reasoning, such as object relations, geometry, and spatial planning, under constrained data and compute

Scaling of Reinforcement Learning Fine-Tuning (RLFT)

Finally, fine-tuning through RL also obeys scaling regularities. The Art of Scaling Reinforcement Learning Compute for LLMs [4] shows that task success rates increase predictably with total RL compute (Figure 1 from [4] shown below). The relationship follows a saturating curve: each additional order of magnitude in GPU hours yields smaller gains once the model reaches reasoning stability. This provides the empirical motivation for my work on RLFT for multimodal reasoning, where feedback-driven learning refines planning and consistency without requiring further supervised data growth.

This is very different than SFT scaling laws!

Practical Focus and Dataset Design

Given the high computational requirements of full multimodal fine-tuning, which typically demands several million paired samples for models in the 3B to 7B parameter range, this research focuses on parameter-efficient fine-tuning (LoRA and visual adapter tuning) and RLFT as practical methods for improving visual–spatial reasoning, cross-modal alignment, and task-specific adaptation under limited compute and data conditions.

To support these experiments, a pilot VQA benchmark is being developed to evaluate and enhance multimodal reasoning in embodied and collaborative dynamic environments. The dataset consists of between 5k~10k RGB images, producing approximately 25k~50k question–answer pairs. It covers one 150~250 kitchen collaboration or multi-agent scenes, including handover, cooking, and cleaning tasks. Each scene contains about 20 images captured from diverse viewpoints, with 5~8 questions per image emphasizing reasoning, spatial relations, and context understanding.

The design of this benchmark prioritizes spatial reasoning and multimodal alignment rather than surface-level perception. It focuses on object relationships, relative positions, affordances, occlusion, and coordinated interactions between agents and objects in shared workspaces. These characteristics make it suitable for training and evaluating models such as LLaVA-7B or Qwen2-VL-7B on tasks that require consistent spatial understanding in human–robot collaboration.

For RLFT, the motivation extends beyond the limitations of SFT. While SFT provide means of aligning visual and textual modalities, they rely heavily on large, high-quality datasets, which are often less available for specialized domains such as spatial reasoning or human–robot collaboration. RL fine-tuning offers a complementary approach by improving reasoning behavior through feedback rather than direct supervision.

In this framework, reward functions are designed to encourage internal consistency between visual context and predicted spatial actions, reinforce multi-step reasoning chains, and penalize ambiguous or physically implausible predictions. This enables models to self-correct and generalize beyond the examples seen during fine-tuning. In this work, DPO and GRPO are being used as practical and stable formulations for scaling RLFT to multimodal settings. Incorporating these methods allows a systematic evaluation of how models refine their reasoning policies and spatial inference capabilities without requiring extensive additional supervision, while keeping training computationally feasible.

Results are in progress — stay tuned.

References

[1] Kaplan et al., Scaling Laws for Neural Language Models (2020).

[2] Zhang et al., When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method (ICLR 2024).

[3] Scaling Law for Multimodal Large Language Model Supervised Fine-Tuning (2025).

[4] Meta AI, The Art of Scaling Reinforcement Learning Compute for LLMs (2025).