Practical Focus and Dataset Design
Given the high computational requirements of full multimodal fine-tuning, which typically demands several million paired samples for models in the 3B to 7B parameter range, this research focuses on parameter-efficient fine-tuning (LoRA and visual adapter tuning) and RLFT as practical methods for improving visual–spatial reasoning, cross-modal alignment, and task-specific adaptation under limited compute and data conditions.
To support these experiments, a pilot vision–language question answering (VQA) benchmark is being developed to evaluate and enhance multimodal reasoning in embodied and collaborative environments. The dataset consists of between 5,000-10,000 static RGB images, producing approximately twenty-five thousand to fifty thousand question–answer pairs. It covers one hundred and fifty to two hundred and fifty kitchen collaboration or multi-agent scenes, including handover, cooking, and cleaning tasks. Each scene contains about twenty images captured from diverse viewpoints, with five to eight questions per image emphasizing reasoning, spatial relations, and context understanding.
The design of this benchmark prioritizes spatial reasoning and multimodal alignment rather than surface-level perception. It focuses on object relationships, relative positions, affordances, occlusion, and coordinated interactions between agents and objects in shared workspaces. These characteristics make it suitable for training and evaluating models such as LLaVA-7B or Qwen2-VL-7B on tasks that require consistent spatial understanding in human–robot collaboration.
For RLFT, the motivation extends beyond the limitations of supervised fine-tuning and LoRA-based adaptation. While SFT and LoRA provide efficient means of aligning visual and textual modalities, they rely heavily on large, high-quality datasets, which are often unavailable for specialized domains such as spatial reasoning or human–robot collaboration. Reinforcement learning fine-tuning offers a complementary approach by improving reasoning behavior through feedback rather than direct supervision.
In this framework, reward functions are designed to encourage internal consistency between visual context and predicted spatial actions, reinforce multi-step reasoning chains, and penalize ambiguous or physically implausible predictions. This enables models to self-correct and generalize beyond the examples seen during fine-tuning. In this work, DPO and GRPO are being used as practical and stable formulations for scaling RLFT to multimodal settings. Incorporating these methods allows a systematic evaluation of how models refine their reasoning policies and spatial inference capabilities without requiring extensive additional supervision, while keeping training computationally feasible.
Results are in progress — stay tuned.
References
[1] Kaplan et al., Scaling Laws for Neural Language Models (2020).
[2] Zhang et al., When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method (ICLR 2024).
[3] Scaling Law for Multimodal Large Language Model Supervised Fine-Tuning (2025).
[4] Meta AI, The Art of Scaling Reinforcement Learning Compute for LLMs (2025).