Practical Focus and Dataset Design
Given the high computational requirements of full multimodal fine-tuning, which typically demands several million paired samples for models in the 3B to 7B parameter range, this research focuses on parameter-efficient fine-tuning (LoRA and visual adapter tuning) and RLFT as practical methods for improving visual–spatial reasoning, cross-modal alignment, and task-specific adaptation under limited compute and data conditions.
To support these experiments, a pilot VQA benchmark is being developed to evaluate and enhance multimodal reasoning in embodied and collaborative dynamic environments. The dataset consists of between 5k~10k RGB images, producing approximately 25k~50k question–answer pairs. It covers one 150~250 kitchen collaboration or multi-agent scenes, including handover, cooking, and cleaning tasks. Each scene contains about 20 images captured from diverse viewpoints, with 5~8 questions per image emphasizing reasoning, spatial relations, and context understanding.
The design of this benchmark prioritizes spatial reasoning and multimodal alignment rather than surface-level perception. It focuses on object relationships, relative positions, affordances, occlusion, and coordinated interactions between agents and objects in shared workspaces. These characteristics make it suitable for training and evaluating models such as LLaVA-7B or Qwen2-VL-7B on tasks that require consistent spatial understanding in human–robot collaboration.
For RLFT, the motivation extends beyond the limitations of SFT. While SFT provide means of aligning visual and textual modalities, they rely heavily on large, high-quality datasets, which are often less available for specialized domains such as spatial reasoning or human–robot collaboration. RL fine-tuning offers a complementary approach by improving reasoning behavior through feedback rather than direct supervision.
In this framework, reward functions are designed to encourage internal consistency between visual context and predicted spatial actions, reinforce multi-step reasoning chains, and penalize ambiguous or physically implausible predictions. This enables models to self-correct and generalize beyond the examples seen during fine-tuning. In this work, DPO and GRPO are being used as practical and stable formulations for scaling RLFT to multimodal settings. Incorporating these methods allows a systematic evaluation of how models refine their reasoning policies and spatial inference capabilities without requiring extensive additional supervision, while keeping training computationally feasible.
Results are in progress — stay tuned.
References
[1] Kaplan et al., Scaling Laws for Neural Language Models (2020).
[2] Zhang et al., When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method (ICLR 2024).
[3] Scaling Law for Multimodal Large Language Model Supervised Fine-Tuning (2025).
[4] Meta AI, The Art of Scaling Reinforcement Learning Compute for LLMs (2025).