We use our LAM to extract latent actions from reference video(left), applying to simpler environment by generating action using our proprio state FDM(right), showing great movement transfer ability.
Bridge dataset action transfer to Willow-X Simpler robot
Human action transfer to RT-1 Simpler robot
Human action transfer to RT-1 Simpler robot
RT-1 dataset action transfer to RT-1 Simpler robot
Latent actions are predicted by latent action expert conditoned on first frame and language insturction. Then, use a small image reconstruction FDM to visualize the planned movement.
move the cone from the middle of the table to the left side of the table
pick up the pot
pick brown chip bag from top drawer and place on counter
pick pepsi can from middle drawer
open bottom drawer
move the blue spoon into the bowl
move the cone from the middle of the table to the upper side of the table
pick up the red object
close top drawer
close middle drawer
open middle drawer
pick up the spoon
Generalization to different block color
"Put the blue block from the table into the blue bowl"
Generalization to different background color
"Put the green block in the blue bowl onto the table"
"Put the green block from the table into the blue bowl"
"Put the green block in the blue bowl onto the table"
"Push the green block to Position 4"
"Push the green block to Position 1"
"Stack the wooden block onto the green block"
"Unstack the wooden block fromthe green block"
"Pouring orange juice into the cup"
"Pick the onion into the basket"
"Straighten the cup"
"Pick the apple into the blue bowl"
"Pick the yellow toy into the basket"
"Pouring orange juice into the cup"
"Stack the blue cube on the red cube"
"Pick the mango into the green plate"
"Stack the blue cube on the red cube"
"Flick the ball"
"Flick the ball"