villa-X: Enhancing Latent Action Modeling in

Vision-Language-Action Models

Description Here

Latent Action Can Transfer Between Human and Embodiments

We use our LAM to extract latent actions from reference video(left), applying to simpler environment by generating action using our proprio state FDM(right), showing great movement transfer ability.

Bridge dataset action transfer to Willow-X Simpler robot

5207_raw.MP4

Human action transfer to RT-1 Simpler robot

542_1741263440_raw.mp4

Human action transfer to RT-1 Simpler robot

118_1741319541(Video in Original Quality).mp4

RT-1 dataset action transfer to RT-1 Simpler robot

5206_raw.MP4

Latent Action Expert Can Plan Movements for Different Instructions

Latent actions are predicted by latent action expert conditoned on first frame and language insturction. Then, use a small image reconstruction FDM to visualize the planned movement.

move the cone from the middle of the table to the left side of the table

1_move the cone from the middle of the table to the left side of the table.mp4

pick up the pot

2_pick up the pot.mp4

pick brown chip bag from top drawer and place on counter

6_pick brown chip bag from top drawer and place on counter.mp4

pick pepsi can from middle drawer

5_pick pepsi can from middle drawer.mp4

open bottom drawer

4_open bottom drawer.mp4

move the blue spoon into the bowl

3_move the blue spoon into the bowl.mp4

move the cone from the middle of the table to the upper side of the table

1_move the cone from the middle of the table to the upper side of the table.mp4

pick up the red object

2_pick up the red object.mp4

close top drawer

6_close top drawer.mp4

close middle drawer

5_close middle drawer.mp4

open middle drawer

4_open middle drawer.mp4

pick up the spoon

3_pick up the spoon.mp4

Our Policy Can Generalize Well in Unseen Real Environment

Generalization to different block color

"Put the blue block from the table into the blue bowl"

[Success[ [Color Generalization and Distractor] Pick the blue block from table into the blue bowl.mp4

Generalization to different background color

"Put the green block in the blue bowl onto the table"

[Success] [Background Generalization] Put the green block from blue bowl onto table.mp4

Our Policy Can Complete Tasks Successfully on Realman Robot Arm

"Put the green block from the table into the blue bowl"

[Success] Put the green block on table into blue bowl.mp4

"Put the green block in the blue bowl onto the table"

[Success] Put the green block from blue bowl onto table.mp4

"Push the green block to Position 4"

[Success] Push the green block to position 4.mp4

"Push the green block to Position 1"

[Success] Push the green block to position 1.mp4

"Stack the wooden block onto the green block"

[Success] Stack the wooden block onto the green block.mp4

"Unstack the wooden block fromthe green block"

[Success] Unstack the wooden block from the green block.mp4

Our Policy Can Complete Tasks Successfully on XHand

"Pouring orange juice into the cup"

2044.MP4

"Pick the onion into the basket"

2045.MP4

"Straighten the cup"

2039.MP4

"Pick the apple into the blue bowl"

2036.MP4

"Pick the yellow toy into the basket"

2037.MP4

"Pouring orange juice into the cup"

2043.MP4

"Stack the blue cube on the red cube"

2031.MP4

"Pick the mango into the green plate"

2035.MP4

"Stack the blue cube on the red cube"

2034.MP4

"Flick the ball"

2041.MP4

"Flick the ball"

2040.MP4

Page updated

Google Sites

Report abuse