2round, 4s
move_coke_can_near_water_bottle
2round, 4s
move_blue_plastic_bottle_near_rxbar_blueberry
2round, 4s
open_top_drawer
3round, 6s
pick_green_can_from_bottom_drawer_and_place_on_counter
pick_coke_can
pick_brown_chip_bag
pick_pepsi_can_from_bottom_shelf_of_fridge
move_pepsi_can_near_paper_bowl
We do not include the special model design for ego-motion. However, we include the video data with keypoint annotation in our training data. We specially extract the intersection part of the keypoint and language annotation Ego-Exo4d in our training data (around 10k data clips). We believe this will be an insightful task for future research.
For discussion, there are a few methods that can help, adding key-point tokens, adding a special decoder, combining the motion capture in the VLM, together with motion guidance ControlNet for VDM, etc. However, limited by the data scale, we are still in the process of experimenting with these method.
In this section, we add a gesture capture model on EVA's egocentric videos and show that EVA's generation results have good quality in ego-gesture, and such gestures can be further turned to other future research.
❌Robot hands shows in human egocentric scenes
❌Unknow camera motion
We finetuned the EVA without Ensamble-LoRA, at EVA datasets. The failing cases show that the unbalanced data scale would confuse the model, like unknown compassion on the robot, or robot hands in an egocentric video. By separating the task-specific LoRA, our model can avoid this issue.
In the ablation study of the E-LoRA, we compare our model(EVA-Generator) with the model without Ensemble-LoRA that tunes all data together(Dynamicrafter-Tune) in Tab2. As shown in the table, the EVA-Generator sacrificed DD(14.28 lower), but was much better in all other scores, especially GCE(6.11 better) and FVD(58.24 better). Such a huge performance gap comes from the unbalanced data distribution of different embodied scenes, and the Ensemble-LoRA can efficiently solve this issue by separating different scenes in different LoRA.
Giving four prompts in the following order(zero-short):
1. Lift pink block
2. Put into drawer
3. Open the drawer
4. Lift the blue block
The pink block changes to blue in the 5 sec showing the weakness of short-tern and long-tern memory of the current world model.
In step 3, the drawer is already opened, so only touch the handle, shows that the world model have some reason ability.
The Video Generation result of EVA could be transferred to 7-dimensional robot motion. We trained a video2action model and applied the prediction planning on simulation environments. The results show that high-quality generation results can directly be transferred to success planning.
We demonstrate how the Finish Thinking Long video can be applied to the robot task. In the limited frames, the robot can only open half of the drawer. EVA could extend the frames by self-ask, and successfully increase the task completion rate.