The failure modes of X-VLA observed during real-world deployment can be summarized as follows:
The current X-VLA-0.9B model is constrained by its relatively modest parameter count and learning capacity. Consequently, it may exhibit limited reasoning ability when handling long-horizon tasks or highly complex language instructions.
Similar to many VLA-based policies, X-VLA may struggle with precise positioning and may occasionally fail to execute accurate grasps. Nevertheless, we observe a consistent recovery behavior: the policy often retries its motions and can eventually “escape” failure states. This behavior is particularly evident in our cloth-folding video, where the model gradually corrects its actions over repeated attempts.
Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulation environments as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves state-of-the-art performance over a sweep of benchmark suites , demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks.
Figure 1: Outline of the X-VLA model.
Our design introduces a streamlined encoding pipeline that integrates soft prompts and explicitly disentangles high- and low-dimensional input streams. This architecture yields improved training stability and consistently stronger validation performance.
Figure 2: Detailed architecture of our X-VLA model.
Figure 3: Details of our evaluation benchmarks.
Our design introduces a streamlined encoding pipeline that integrates soft prompts and explicitly disentangles high- and low-dimensional input streams. This architecture yields improved training stability and consistently stronger validation performance.
Table 1: Comparison of specialize and generalize models on simulation benchmarks.
We also evaluate X-VLA-0.9B on physical robotic platforms follow the BridgeData-v2 benchmark. Our X-VLA surpass other baselines across all five tasks, each for testing distinct axis of capability, demonstrating the superior adaptability of our X-VLA.
Figure 4: We evaluate our X-VLA model on three distinct real-world embodiments, each under specific task setups, including simple manipulation, dexterous manipulation, and fast adaptation experiments using Parameter efficient finetuning (PEFT) techniques.
Figure 5: We provide qualitative results about our finetuned dexterous manipulation model from the pretrained X-VLA-0.9B and introduce a high-quality cloth folding dataset: Soft-FOLD.
Building on Soft Prompts, we introduce X-VLA, a neat VLA architecture designed for stable pretraining on heterogeneous datasets and efficient adaptation on new domains. In this section, we first present the overall architectural design, followed by several key techniques for large-scale pretraining. See our paper for more details.
Table 2: We evaluate the pretraining (PT) validation error and adaptation (AD) success rates on Simpler-WidowX benchmark. Green, Red and Gray denote positive, negative, moderate effects, respectively. Bold scores are SOTA results. We can see that naively training on heterogeneous data leads to degradation. Also, as validation error decreases during pretraining, the adaptation success rate increases progressively, demonstrating a strong correlation between the two. Therefore, we use the validation error as a proxy for pretraining performance throughout this paper. It is evident that each components in Section 4 contributes to positive improvements for pretraining.