QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
______________________________________________________________________________________________
Pengxiang Ding Han Zhao, Wenxuan Song, Wenjie Zhang, Siteng Huang, Ningxi Yang, Donglin Wang
Westlake University, Zhejiang University
ECCV 2024
Here, we offered demos of both simulation and real scenarios to
show the effectiveness and generalization of our work, including:
1. Effectiveness in seen scenes
2. Sim2Real transfer capabilities
3. Rubustness in different localization
4. Rubustness in different workspace
5. Rubustness in unseen scenes
1. Effectiveness in seen scenes
We show both simulation and real scenarios of all six tasks:
Go through, Crawl, Distinguish, Go avoid, Unload and Go to
Go through
Correct: Go through the square tunnel
Crawl the bar
Distinguish
Go avoid
Unload
Go to
2. Sim2Real transfer capabilities
(Failure Cases Analysis)
We here show results in different sim2real training paradigms and failure case analysis.
Comparing Model 3 with Models 1 & 2: Through fine-tuning, it is observed that the average motion length is significantly reduced, aligning closely with the performance of models trained exclusively on real data. This indicates that the incorporation of real data enhances the model's ability to recognize objects with greater precision and speed, thereby mitigating some discrepancies between real and simulated data.
Comparing Model 2 with Model 1: It is evident that with a reduced amount of simulation data, the model's proficiency in yaw control is compromised, leading to operational failures. This underscores the importance of an adequate dataset size for the effective acquisition of VLA skills.
Comparing Model 3 with Models 1 & 4: Upon examination, it becomes clear that utilizing a co-training approach, the model manages to learn a gait pattern in the simulated environment that closely resembles that of real-world data. This demonstrates the effectiveness of simulation-based learning, as the skills acquired can be effectively transferred to real-world scenarios.
Simulation + Real Data
2. 10% Simulation + Real Data
3. Simulation Data
4. Real Data
3. Rubustness in different localization
We compare the results in different initial localization.
It shows that our model is robust to different initial localization.
4. Rubustness in a different workspace
We compare the results in different size of workspace.
It shows that our model is robust to different size of workspace.
large workspace
small workspace
5. Rubustness in unseen scene
We compare the results in the unseen object and verbal information