VLM4VLA
VLM4VLA
Revisiting Vision-Language Models in ย Vision-Language-Action Models
Anonymous Submission
By integrating Vision-Language Models (VLMs) with their powerful multi-modal understanding capabilities, Vision-Language-Action (VLA) policies have demonstrated promising performance in various tasks. Recent work has largely focused on designing more sophisticated action networks and introducing diverse auxiliary training tasks, showcasing the effectiveness and generalization potential of VLA methods. However, the performance of VLAs is likely constrained by the underlying VLM backbone, which typically constitutes the largest part of the model. Furthermore, a lack of understanding of the foundational capabilities of these VLMs within the VLA context makes it difficult to discern whether performance gains are attributable to the complex architectural designs or the backbone itself. To bridge this gap in understanding, we introduce VLM4VLA, a unified training and evaluation framework designed for the systematic study of the VLM's impact on VLA performance. We designed a minimalist yet effective network architecture to adapt VLMs into VLAs, then conducted over 100 experiments under strictly controlled and identical settings to evaluate the performance of VLAs built upon seven advanced VLMs and seven different auxiliary multimodal tasks. Contrary to common expectations, we find that models excelling on general-purpose VQA tasks do not necessarily yield better VLA performance. For instance, we find that Kosmos (1.7B) outperforms Qwen-2.5VL (3.8B) and Paligemma (2.9B) in multiple environments. Moreover, beyond the ability to solve embodied tasks, VLAs also place high demands on the general skills of the VLM. Our findings reveal a significant gap between the current state of VLMs and their application in VLAs, a gap that necessitates collaborative efforts from both research communities to address.ย
Unified Evaluation Framework
Build VLM4VLA, a scalable and fair evaluation framework that integrates different VLMs into VLAs in a unified and lightweight manner.
Comprehensive Experimental Study
Conduct comprehensive experiments to study the influence of VLM backbone on embodied manipulation tasks, covering VLM architecture, post-training fine-tuning data, and vision modules.
Practical Insights
Analyze experimental results to provide practical insights, offering reference for backbone selection and performance baseline for the VLA community.
We find that the performance requirements for VLMs in embodied manipulation tasks do not fully align with their VQA capabilities. Specifically, and contrary to common expectations, VLMs that perform well on general VQA benchmarks are not necessarily better when used in VLAs. For instance, we find that Kosmos (1.7B) outperforms Qwen-2.5VL (3.8B) and Paligemma (2.9B) in multiple environments. Furthermore, in our experiments where we post-train Qwen-2.5VL on various auxiliary Embodied-QA tasks, we discover that fine-tuning on most of these tasks leads to a performance degradation in the resulting VLA.
Our research goal is to conduct a comprehensive and fair comparison of the performance of various VLMs on end-to-end manipulation tasks. To this end, our research design adheres to the following principles:
๐ฏ Fairness and Reproducibility: We employ a consistent model architecture and training/testing settings across multiple simulation environments to ensure fair and reproducible comparisons.
โก Minimalist Design: We encapsulate VLMs within a simple yet effective VLA framework, thereby minimizing the influence of complex, extraneous policy designs on the comparison.
๐ง Leveraging Inherent Knowledge: The VLA design fully leverages the inherent knowledge of the VLM. Crucially, we ensure that the input sequence format is consistent with what each VLM was exposed to during its instruction-SFT phase. We exclude any robotic priors beyond vision and language, such as proprioceptive state, tactile feedback, or environmental rewards.ย
We detail the method for constructing a consistent VLA from various VLMs within the VLM4VLA framework. Our objective is to build a VLA architecture that is generic across different VLMs, lightweight, and capable of fully leveraging the VLM's intrinsic knowledge.
๐ Learnable Action Query Token: We introduce a learnable action query token to extract embodiment-related knowledge from the VLM. The representation of this token is then decoded into an action chunk.
๐ Input Format Adaptation: To align with the pre-training input format of each model, we adapt a unique token concatenation scheme for each VLM4VLA instance.
โ๏ธ Minimalist Policy Head: We take the last_hidden_state of the <ActionQuery> token, as encoded by the VLM, and decode it into an action chunk using a small MLP-based policy head.
action = MLP(VLM([<img> ... <img> <text> ... <text> <ActionQuery>]))
Calvin ABC-D
We evaluate on the Calvin ABC-D task. We train the model for 30k steps on the ABC splits and evaluate it on 1000 task sequences, each with a length of 5. During testing, the policy is required to complete a sequence of 1-5 tasks. This setup challenges the VLM's ability to generalize to novel visual scenes.
SimplerEnv Bridge
To better differentiate the performance of various VLM-based policies, we choose the WindowX (Bridge V2) task suite, which is more challenging than the Fractal suite. We train for 50k steps on Bridge-V2 and run 24 trials with random initializations for each of the four scenes.
Libero-Long
Among the five task suites in Libero, we evaluate different models on the most challenging suite Libero-Long, which consists of 10 tasks involving a variety of objects and skills. All models are trained for 50k steps on the training split and evaluated with 50 trials with random initializations for each task.
We evaluated the performance of different VLMs across three simulation environments. The results show varying degrees of linear correlation between different evaluation environments and general VLM capabilities, indicating that the performance requirements for VLMs in embodied manipulation tasks do not fully align with their VQA capabilities.
We study the impact of different VLM auxiliary tasks on VLA performance. Recent work has proposed using robotic data to construct VQA datasets for improving VLM backbones, but few studies have investigated whether this additional continual finetuning actually benefits VLAs in downstream tasks. We construct or collect several SFT tasks for VLM, including VQA datasets and generation tasks.
RoboPoint
A pointing task dataset collected in simulator. Given an image and a target location, the model is required to output the 2D coordinates that satisfy the target requirement. Contains 1.432M samples.
Vica-332k
A spatial understanding dataset constructed from RGB-D datasets. It covers a wide range of capabilities, including size estimation, position understanding, distance estimation.
BridgeVQA
A spatial understanding question-answering dataset annotated from Bridge-v2, Fractal, and Calvin ABC data using VQASynth.
Robo2VLM
An action-oriented question-answering dataset built from 176k real robot trajectories, containing 667k VQA pairs.
RoboBrain2
A large-scale embodied VQA dataset and a VLM finetuned on Qwen2.5VL-7B. The tasks include pointing, planning, and trajectory marking.
Omni-Generation
Integrating a diffusion model into QwenVL-7B and training on image generation, depth map generation, and semantic segmentation map generation tasks together.
Overall, all models underperform the original baseline, with most exhibiting a slight degradation in performance. For Qwen2.5VL-3B, the model finetuned on Vica332k performs better than those finetuned on other datasets. This could be attributed to the dataset's broad data coverage and diverse task types, which may prevent the model from overfitting to a narrow set of capabilities and consequently degrading others.
Important Insight: Existing embodied VQA-style tasks do not offer a clear benefit for training end-to-end VLAs to execute downstream manipulation tasks. This suggests that VLAs may require broad, general capabilities, beyond just embodied skills, to perform well on downstream tasks.
We find that freezing the vision encoder during VLM4VLA training leads to significant performance degradation for all models on both the Calvin and Simpler benchmarks. This strongly suggests that finetuning the vision encoder is crucial when adapting a VLM into a VLA.
Our research reveals a significant gap between the capabilities of current VLMs and the demands of VLA embodied tasks. Specifically, we observe a notable discrepancy between a VLM's performance on standard VQA benchmarks and its actual effectiveness when deployed in a VLA.
Core Insight: To further advance the capabilities of VLA policies, it is not enough to simply design complex action networks; we must also understand how the critical VLM backbone component influences overall performance and strategically enhance it for embodied tasks.
We hope that this work will inspire future research in both the VLM and VLA domains, fostering collaborative development between the two research communities.