QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

______________________________________________________________________________________________

Pengxiang Ding Han Zhao, Wenxuan Song, Wenjie Zhang, Siteng Huang, Ningxi Yang, Donglin Wang

Westlake University, Zhejiang University

ECCV 2024

Here, we offered demos of both simulation and real scenarios to

show the effectiveness and generalization of our work, including:

1. Effectiveness in seen scenes

2. Sim2Real transfer capabilities

3. Rubustness in different localization

4. Rubustness in different workspace

5. Rubustness in unseen scenes

1. Effectiveness in seen scenes

We show both simulation and real scenarios of all six tasks:

Go through, Crawl, Distinguish, Go avoid, Unload and Go to

Go through

go_through_square.mp4

61_1715934244.mp4

Correct: Go through the square tunnel

go_through_tri.mp4

Crawl the bar

crawl_2_1_无音轨版.mp4

crawl_2_1_仿真.mp4

Distinguish

distinguish_letter_A.mp4

63_1715934558.mp4

Go avoid

avoid_1.mp4

62_1715934457.mp4

Unload

unload_5_5.mp4

69_1715934784.mp4

Go to

goto - 使用 Clipchamp_1715952896350 制作.mp4

54_1715934121.mp4

2. Sim2Real transfer capabilities

(Failure Cases Analysis)

We here show results in different sim2real training paradigms and failure case analysis.

Comparing Model 3 with Models 1 & 2: Through fine-tuning, it is observed that the average motion length is significantly reduced, aligning closely with the performance of models trained exclusively on real data. This indicates that the incorporation of real data enhances the model's ability to recognize objects with greater precision and speed, thereby mitigating some discrepancies between real and simulated data.
Comparing Model 2 with Model 1: It is evident that with a reduced amount of simulation data, the model's proficiency in yaw control is compromised, leading to operational failures. This underscores the importance of an adequate dataset size for the effective acquisition of VLA skills.
Comparing Model 3 with Models 1 & 4: Upon examination, it becomes clear that utilizing a co-training approach, the model manages to learn a gait pattern in the simulated environment that closely resembles that of real-world data. This demonstrates the effectiveness of simulation-based learning, as the skills acquired can be effectively transferred to real-world scenarios.

go_1_5.mp4

Simulation + Real Data

go_1_6.mp4

2. 10% Simulation + Real Data

go_1_7.mp4

3. Simulation Data

go_1_8.mp4

4. Real Data

3. Rubustness in different localization

We compare the results in different initial localization.

It shows that our model is robust to different initial localization.

实验大腿们群！ 2024-05-05 00.08.45.mp4

实验大腿们群！ 2024-05-05 00.08.48.mp4

实验大腿们群！ 2024-05-05 00.08.41.mp4

4. Rubustness in a different workspace

We compare the results in different size of workspace.

It shows that our model is robust to different size of workspace.

67_1715934747.mp4

large workspace

goto - 使用 Clipchamp_1715952896350 制作.mp4

small workspace

5. Rubustness in unseen scene

We compare the results in the unseen object and verbal information

Unseen Verbal Information

77_1715951094.mp4

74_1715939378.mp4

Unseen Object

76_1715940284.mp4

73_1715938060.mp4