EVA: An Embodied World Model for Future Video Anticipation
For ICLR2025 Rebuttal

A. RT1 Seen tasks multi-round Visualization Results

20_move_coke_can_near_water_bottle_frame_0_sample0_sample0.mp4

2round, 4s

move_coke_can_near_water_bottle

24_move_blue_plastic_bottle_near_rxbar_blueberry_frame_0_sample0_sample0.mp4

2round, 4s

move_blue_plastic_bottle_near_rxbar_blueberry

41_open_top_drawer_frame_0_sample0.mp4

2round, 4s

open_top_drawer

52_pick_green_can_from_bottom_drawer_and_place_on_counter_frame_0_sample0_sample.mp4

3round, 6s

pick_green_can_from_bottom_drawer_and_place_on_counter

B. RT1 1 round Visualization Results

01_pick_coke_can_frame_0_sample0.mp4

pick_coke_can

02_pick_brown_chip_bag_frame_0_sample0.mp4

pick_brown_chip_bag

12_pick_pepsi_can_from_bottom_shelf_of_fridge_frame_0_sample0.mp4

pick_pepsi_can_from_bottom_shelf_of_fridge

16_move_pepsi_can_near_paper_bowl_frame_0_sample0.mp4

move_pepsi_can_near_paper_bowl

C. Egocentric Motion

We do not include the special model design for ego-motion. However, we include the video data with keypoint annotation in our training data. We specially extract the intersection part of the keypoint and language annotation Ego-Exo4d in our training data (around 10k data clips). We believe this will be an insightful task for future research.

For discussion, there are a few methods that can help, adding key-point tokens, adding a special decoder, combining the motion capture in the VLM, together with motion guidance ControlNet for VDM, etc. However, limited by the data scale, we are still in the process of experimenting with these method.

In this section, we add a gesture capture model on EVA's egocentric videos and show that EVA's generation results have good quality in ego-gesture, and such gestures can be further turned to other future research.

Add spinach to a mixing bowl.mp4

add_salt_to_salad_in_mixing_bowl.mp4

Check paper recipe.mp4

Continue_steaming_until_the_milk_reaches_the_desired_temperature.mp4

Cut carrots.mp4

Get_onions.mp4

Squeeze_the_sides_of_the_test_tube_to_add_solution_on_the_test_plate.mp4

Stir_fry_the_minced_beef_in_the_skillet.mp4

Turn on the stove.mp4

Wash celeries.mp4

Whisk_until_the_egg_whites_and_yolks_are_well_integrated.mp4

Wipe hands.mp4

D. w/o Ensamble-LoRA

Add milk_frame_0_round1_sample0.mp4

❌Robot hands shows in human egocentric scenes

grasp the blue block from the drawer_frame_0_sample0.mp4

❌Unknow camera motion

Without Ensamble-LoRA

We finetuned the EVA without Ensamble-LoRA, at EVA datasets. The failing cases show that the unbalanced data scale would confuse the model, like unknown compassion on the robot, or robot hands in an egocentric video. By separating the task-specific LoRA, our model can avoid this issue.

In the ablation study of the E-LoRA, we compare our model(EVA-Generator) with the model without Ensemble-LoRA that tunes all data together(Dynamicrafter-Tune) in Tab2. As shown in the table, the EVA-Generator sacrificed DD(14.28 lower), but was much better in all other scores, especially GCE(6.11 better) and FVD(58.24 better). Such a huge performance gap comes from the unbalanced data distribution of different embodied scenes, and the Ensemble-LoRA can efficiently solve this issue by separating different scenes in different LoRA.

E. Failing Cases of Long Horizon Task Generation

视频1.mp4

Giving four prompts in the following order(zero-short):

1. Lift pink block

2. Put into drawer

3. Open the drawer

4. Lift the blue block

The pink block changes to blue in the 5 sec showing the weakness of short-tern and long-tern memory of the current world model.
In step 3, the drawer is already opened, so only touch the handle, shows that the world model have some reason ability.

F. EVA in CALVIN Simulation

Open the Drawer

The Video Generation result of EVA could be transferred to 7-dimensional robot motion. We trained a video2action model and applied the prediction planning on simulation environments. The results show that high-quality generation results can directly be transferred to success planning.

Open the Drawer with Finish Thinking Frame Extention

We demonstrate how the Finish Thinking Long video can be applied to the robot task. In the limited frames, the robot can only open half of the drawer. EVA could extend the frames by self-ask, and successfully increase the task completion rate.

Page updated

Google Sites

Report abuse

EVA: An Embodied World Model for Future Video Anticipation For ICLR2025 Rebuttal

A. RT1 Seen tasks multi-round Visualization Results

B. RT1 1 round Visualization Results

C. Egocentric Motion

D. w/o Ensamble-LoRA

Without Ensamble-LoRA

E. Failing Cases of Long Horizon Task Generation

F. EVA in CALVIN Simulation

Open the Drawer

Open the Drawer with Finish Thinking Frame Extention

EVA: An Embodied World Model for Future Video Anticipation
For ICLR2025 Rebuttal