Masoud Jafaripour - Fine-Tuning

Fine-Tuning and Scaling Toward Spatial Reasoning in Multimodal LLMs

Before examining test-time scaling and inference-level reasoning, my research explores the feasibility and scaling behavior of fine-tuning paradigms, Supervised Fine-Tuning (SFT), Parameter-Efficient Fine-Tuning (PEFT), and Reinforcement Learning Fine-Tuning (RLFT), as mechanisms for improving spatial reasoning in multimodal large language models.

The goal is to determine how these post-training adaptation techniques influence the emergence of structured spatial reasoning, scene comprehension, and action-grounded inference when models are applied to embodied or human-centered environments (see Practical Focus and Dataset Design below for details on the pilot VQA dataset and fine-tuning setup).

Scaling as the Foundation

The foundation for this exploration originates from the scaling laws of neural language models [1]. These laws established that performance improves smoothly with model size, dataset size, and compute, following predictable power-law trends (Figure 1. in [1] see below). This regularity in pretraining efficiency forms the theoretical basis for examining how post-training methods such as SFT and RLFT can extend scaling behavior toward multimodal and reasoning tasks, particularly those requiring 3D spatial understanding.

Scaling of SFT and PEFT in Language Models

Recent studies on fine-tuning scaling reveal that Supervised Fine-Tuning (SFT) itself follows a distinct but shallower scaling law than pretraining. In When Scaling Meets LLM Finetuning [2], performance improves steadily with dataset size but saturates beyond a few hundred thousand examples, with exponents β typically between 0.05 and 0.3 depending on method (Figure 1. of paper [2] shown below).
Parameter-efficient methods such as LoRA and prompt tuning achieve faster convergence and higher data efficiency, yet exhibit earlier plateaus due to smaller effective capacity. These results establish the empirical foundation for data-limited adaptation, showing that even without retraining the full model, scaling trends remain measurable and predictable.

Extending Fine-Tuning Scaling to Multimodal Models

The Scaling Law for Multimodal LLM Supervised Fine-Tuning [3] extends these findings beyond text. Using large-scale vision-language pairs, the study demonstrates that multimodal SFT performance scales with dataset size up to roughly 3–5 million pairs for 3B–7B models (Figure 1 from paper [3] demonstrated below). Beyond that range, additional supervision yields diminishing returns or mild overfitting. This mirrors the behavior observed in pure language models but at a lower exponent, reflecting the higher complexity of aligning visual and linguistic representations.
My research builds on this insight to explore spatially grounded multimodal fine-tuning, where the goal is not only alignment but also the emergence of structured 3D reasoning, such as object relations, geometry, and spatial planning, under constrained data and compute

Scaling of Reinforcement Learning Fine-Tuning (RLFT)

Finally, fine-tuning through reinforcement learning also obeys scaling regularities. The Art of Scaling Reinforcement Learning Compute for LLMs [4] shows that task success rates increase predictably with total RL compute (Fig. 3). The relationship follows a saturating curve: each additional order of magnitude in GPU hours yields smaller gains once the model reaches reasoning stability. This provides the empirical motivation for my work on RLFT for multimodal reasoning, where feedback-driven learning refines planning and consistency without requiring further supervised data growth.

Practical Focus and Dataset Design

Given the high computational requirements of full multimodal fine-tuning, which typically demands several million paired samples for models in the 3B to 7B parameter range, this research focuses on parameter-efficient fine-tuning (LoRA and visual adapter tuning) and RLFT as practical methods for improving visual–spatial reasoning, cross-modal alignment, and task-specific adaptation under limited compute and data conditions.

To support these experiments, a pilot vision–language question answering (VQA) benchmark is being developed to evaluate and enhance multimodal reasoning in embodied and collaborative environments. The dataset consists of between 5,000-10,000 static RGB images, producing approximately twenty-five thousand to fifty thousand question–answer pairs. It covers one hundred and fifty to two hundred and fifty kitchen collaboration or multi-agent scenes, including handover, cooking, and cleaning tasks. Each scene contains about twenty images captured from diverse viewpoints, with five to eight questions per image emphasizing reasoning, spatial relations, and context understanding.

The design of this benchmark prioritizes spatial reasoning and multimodal alignment rather than surface-level perception. It focuses on object relationships, relative positions, affordances, occlusion, and coordinated interactions between agents and objects in shared workspaces. These characteristics make it suitable for training and evaluating models such as LLaVA-7B or Qwen2-VL-7B on tasks that require consistent spatial understanding in human–robot collaboration.

For RLFT, the motivation extends beyond the limitations of supervised fine-tuning and LoRA-based adaptation. While SFT and LoRA provide efficient means of aligning visual and textual modalities, they rely heavily on large, high-quality datasets, which are often unavailable for specialized domains such as spatial reasoning or human–robot collaboration. Reinforcement learning fine-tuning offers a complementary approach by improving reasoning behavior through feedback rather than direct supervision.

In this framework, reward functions are designed to encourage internal consistency between visual context and predicted spatial actions, reinforce multi-step reasoning chains, and penalize ambiguous or physically implausible predictions. This enables models to self-correct and generalize beyond the examples seen during fine-tuning. In this work, DPO and GRPO are being used as practical and stable formulations for scaling RLFT to multimodal settings. Incorporating these methods allows a systematic evaluation of how models refine their reasoning policies and spatial inference capabilities without requiring extensive additional supervision, while keeping training computationally feasible.

Results are in progress — stay tuned.

References

[1] Kaplan et al., Scaling Laws for Neural Language Models (2020).

[2] Zhang et al., When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method (ICLR 2024).

[3] Scaling Law for Multimodal Large Language Model Supervised Fine-Tuning (2025).

[4] Meta AI, The Art of Scaling Reinforcement Learning Compute for LLMs (2025).