Empowering World Models with Reflection for Embodied Video Prediction

ICML2025 Submission

Update: More results for rebuttal

Longer video generation results
EVA Bench & dataset metadata examples.
Raw code (for rebuttal demo only, unready for publish)

Abstract

Video generation models have made significant progress in simulating future states, showcasing their potential as world simulators in embodied scenarios. However, existing models often lack robust understanding, limiting their ability to perform multi-step predictions or handle Out-of-Distribution (OOD) scenarios. To address this challenge, we propose the Reflection of Generation (RoG), a set of intermediate reasoning strategies designed to enhance video prediction. It leverages the complementary strengths of pre-trained vision-language and video generation models, enabling them to function as a world model in embodied scenarios. To support RoG, we introduce Embodied Video Anticipation Benchmark(EVA-Bench), a comprehensive benchmark that evaluates embodied world models across diverse tasks and scenarios, utilizing both in-domain and OOD datasets. Building on this foundation, we devise a world model, Embodied Video Anticipator (EVA), that follows a multistage training paradigm to generate high-fidelity video frames and apply an autoregressive strategy to enable adaptive generalization for longer video sequences. Extensive experiments demonstrate the efficacy of EVA in various downstream tasks like video generation and robotics, thereby paving the way for large-scale pre-trained models in real-world video prediction applications.

Reflection of Generation Inference Pipline

EVA Model

Demo Results

Action Prompt Following Generation

🙋Rotate the pink block left.

🤖EVA Video Prediction:

Rotate_pink_block_left_sample0.mp4

🙋Swep the pink block to the left

🤖EVA Video Prediction:

sweep_the_pink_block_to_the_left_sample0.mp4

🙋Rotate the pink block right.

🤖EVA Video Prediction:

Rotate_pink_block_right_sample0.mp4

🙋Swep the pink block to the right

🤖EVA Video Prediction:

sweep_the_pink_block_to_the_right_sample0.mp4

Action-Description

15256185677712676254.mp4

Can you summarize the actions in this video? Please provide a brief description in one sentence.

"chatunivi_base": "The video shows a robotic arm cutting a piece of wood.",
"qwen2_vl_7b": "Fold the left sleeve of the sweatshirt.",
"gpt4o": "A robotic arm is folding a sweater sleeve on a table.",
🤖 "EVA": "Fold the left arm of sweatshirt."

aria02_214-1_0000017.mp4

What is happening in this video? Please provide a brief description in one sentence.

"chatunivi_base": "A man is cutting up vegetables in a kitchen while another man watches.",
"qwen2_vl_7b": "A person is cutting a tomato on a cutting board.",
"gpt4o": "A person is preparing vegetables on a cutting board in a kitchen.",
🤖"EVA": "Cut bell pepper with knife on the chopping board."

Finish-Thinking QA

8861232332269494997_30percent.mp4

The video is describing the step to Place an orange inside the basket, is this step completed?

"chatunivi_base": "No", ☑️
"qwen2_vl_7b": "yes", ❌
"gpt4o": "Yes.", ❌
🤖 "EVA": "No" ☑️

14363225362447322247.mp4

Based on the given video, has the job to put book in the shelf completed?

"chatunivi_base": "No", ❌
"qwen2_vl_7b": "yes", ☑️
"gpt4o": "No.", ❌
🤖 "EVA": "Yes" ☑️

15035065468016823011_30percent.mp4

The video is describing the step to place the packet in the tray, is this step completed?

"chatunivi_base": "No", ☑️
"qwen2_vl_7b": "no", ☑️
"gpt4o": "No.", ☑️
🤖"EVA": "No" ☑️

How-To Generation

🙋How should I perform [Stir fry the minced beef in the skillet]?

Stir fry the minced beef in the skillet_frame_0_sample0.mp4

🙋Can you show me how to [Cut carrots]?

Cut carrots_frame_0_sample0.mp4

🙋How would you do [knock redbull can over]?

knock redbull can over_frame_0_sample0.mp4

Multi-Round Next-Step Prediction

🙋This video depicts how to [Move the chip to a white can]. The current goal is [Move Chips to 7 up can]. What is likely to happen next?

[Round 1 Generation]:

move brown chip bag near 7up can_frame_0_round0_sample0.mp4

NEXT-STEP🤖: Move Chips near the 7-up can.

[Round 2 Generation]:

move brown chip bag near 7up can_frame_0_round1_sample0.mp4

Multi-Round RoG Longer Video Generation

20_move_coke_can_near_water_bottle_frame_0_sample0_sample0.mp4

2round, 4s

move_coke_can_near_water_bottle

24_move_blue_plastic_bottle_near_rxbar_blueberry_frame_0_sample0_sample0.mp4

2round, 4s

move_blue_plastic_bottle_near_rxbar_blueberry

41_open_top_drawer_frame_0_sample0.mp4

2round, 4s

open_top_drawer

52_pick_green_can_from_bottom_drawer_and_place_on_counter_frame_0_sample0_sample.mp4

3round, 6s

pick_green_can_from_bottom_drawer_and_place_on_counter

Rebuttal🔥 More++ longer video generation demos

50.mp4

8s,

pick green can from top drawer and place on counter

51.mp4

10s,

pick green can from top drawer and place on counter

53.mp4

20s

pick green can from bottom drawer and place on counter

55.mp4

12s,

pick sponge from middle drawer and place on counter

More Generation Results

We also provided a video on custom videos. It is interesting to find that such a video generation world model has the potential to serve as a video handbook for future VR devices, teaching people how to use kettle or tap, etc.

Turn_on_the_tap_frame_0_sample0.mp4

Turn on the Tap

Place_the_kettle_frame_0_sample0.mp4

Place the kettle

pick green can from middle shelf of fridge_frame_0_sample0.mp4

Pick green can

Open_the_button_of_kettle_frame_0_sample0.mp4

Open the button of kettle

Egocentric Gesture

We applied the gesture detection model to EVA's generation results. The experiments show that EVA could help to generate reliable human gesture motions together with high-fidelity video.

Add spinach to a mixing bowl.mp4

add_salt_to_salad_in_mixing_bowl.mp4

Check paper recipe.mp4

Continue_steaming_until_the_milk_reaches_the_desired_temperature.mp4

Cut carrots.mp4

Get_onions.mp4

Squeeze_the_sides_of_the_test_tube_to_add_solution_on_the_test_plate.mp4

Stir_fry_the_minced_beef_in_the_skillet.mp4

EVA-to-Action in CALVIN Simulation

We use simple conditional MLP as decoder to transfer the generated video into robot actions.

Open the Drawer

The Video Generation result of EVA could be transferred to 7-dimensional robot motion. We trained a video2action model and applied the prediction planning on simulation environments. The results show that high-quality generation results can directly be transferred to success planning.

Failing Cases of Long Horizon Task Generation

We also provide multi-round long-horizon task generation. The video demonstrates some insightful difficulties in the world model. A long-shot Memory system might be necessary, self-aware or reflection of generation is important.

视频1.mp4

Giving four prompts in the following order(zero-short):

1. Lift pink block

2. Put into drawer

3. Open the drawer

4. Lift the blue block

The pink block changes to blue in the 5 sec showing the weakness of short-term and long-tern memory of the current world model.
In step 3, the drawer is already opened, so only touching the handle, shows that the world model has some reason ability.

EVA Benchmark

The distribution(top) and selected frame(left) of EVA Benchmark.

Rebuttal🔥Data demos and codes...

EVA-Benchmark&Dataset Examples

[

{

"id": "2535951750640968742",

"video": "/share/world_model/robovqa/videos/10821748764221072760.mp4",

"task": "NextStep",

"question": "Based on the given video, current goal is: Please place the cold cups into the tray. Q: immediate next step?",

"answer": "A: Remove the glass from right stack put it in the tray",

"conversations": [

{

"from": "human",

"value": "This is a 'Next-Step' task where you need to predict the next step in a sequence of steps.<video> Based on the given video, current goal is: Please place the cold cups into the tray. Q: immediate next step?"

},

{

"from": "gpt",

"value": "A: Remove the glass from right stack put it in the tray"

}

]

},

{

"id": "EgoExo4D_KeyStep_1998",

"video": "/share/world_model/egoexo4d/keystep_train_takes-cut/nus_covidtest_02_2/aria01_214-1_0000009.mp4",

"task": "HowTo",

"question": "What is the way to swirl the collection swab in the nostril?",

"answer": "Swirl the collection swab in the nostril.",

"conversations": [

{

"from": "human",

"value": "This is a 'How-to' task where you need to explain how to accomplish a specific task.<video> What is the way to swirl the collection swab in the nostril?"

},

{

"from": "gpt",

"value": "Swirl the collection swab in the nostril."

}

]

},

{

"id": "15562648869436311870",

"video": "/share/world_model/robovqa/videos/9394608913112580403.mp4",

"caption": "put the pip corn packet on the table",

"nextvideo": "/share/world_model/robovqa/videos/1113294827289372720.mp4",

"question": "This video is about how to put the pip corn packet on the table. Current goal is: please remove the snacks from the tray. If this video is the final step, please answer 'Yes.'. If no, please predict what is the next step should do.",

"answer": "put the pip corn packet on the table",

"task": "<task:immediate_planning_with_context20:freeform>",

"conversations": [

{

"from": "human",

"value": "This is a 'Task-Planning' task where you need to determine the next step required to complete the task.<video> This video is about how to put the pip corn packet on the table. Current goal is: please remove the snacks from the tray. If this video is the final step, please answer 'Yes.'. If no, please predict what is the next step should do."

},

{

"from": "gpt",

"value": "put the pip corn packet on the table"

}

]

},

{

"id": "EgoExo4D_KeyStep_4853",

"video": "/share/world_model/ego_robo/stepcompleted/EgoExo4D_KeyStep_4853_40percent.mp4",

"task": "StepCompleted",

"question": "Based on the given video, has the action been executed?",

"answer": "No",

"conversations": [

{

"from": "human",

"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> Based on the given video, has the action been executed?"

},

{

"from": "gpt",

"value": "No"

}

]

},

{

"id": "1655112556696676852",

"video": "/share/world_model/robovqa/videos/12189026333842243844.mp4",

"task": "StepCompleted",

"question": "Has the action to put the orange in the big bowl completed?",

"answer": "No",

"conversations": [

{

"from": "human",

"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> Has the action to put the orange in the big bowl completed?"

},

{

"from": "gpt",

"value": "No"

}

]

},

{

"id": "robot_vqa_5787632916096419092",

"video": "/share/world_model/robovqa/videos/7647415357755296087.mp4",

"conversations": [

{

"from": "human",

"value": "This is a video understanding task where you need to focus on the main subjects, their actions, and the background scenes to answer the following question.<video> current goal is: place sugar packets in the correct sections of the clear plastic tray Q: immediate next step?"

},

{

"from": "gpt",

"value": "A: Drop sugar packet in a organizer"

}

]

},

{

"id": "EgoExo4D_KeyStep_1878",

"video": "/share/world_model/ego_robo/stepcompleted/EgoExo4D_KeyStep_1878_40percent.mp4",

"task": "StepCompleted",

"question": "The video is describing the step to Arrange the covid test pack materials on the table, is this activity concluded?",

"answer": "No",

"conversations": [

{

"from": "human",

"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> The video is describing the step to Arrange the covid test pack materials on the table, is this activity concluded?"

},

{

"from": "gpt",

"value": "No"

}

]

},

{

"id": "11654493023765072006",

"video": "/share/world_model/robovqa/videos/7394941969709917478.mp4",

"task": "HowTo",

"question": "Could you explain how to move away from the white box?",

"answer": "Move away from the white box.",

"conversations": [

{

"from": "human",

"value": "This is a 'How-to' task where you need to explain how to accomplish a specific task.<video> Could you explain how to move away from the white box?"

},

{

"from": "gpt",

"value": "Move away from the white box."

}

]

},

{

"id": "6099260259542926199",

"video": "/share/world_model/robovqa/videos/18398277088045882422.mp4",

"task": "StepCompleted",

"question": "The video is describing the activity to grab the objects, is this activity completed?",

"answer": "Yes",

"conversations": [

{

"from": "human",

"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> The video is describing the activity to grab the objects, is this activity completed?"

},

{

"from": "gpt",

"value": "Yes"

}

]

},

{

"id": "EgoExo4D_KeyStep_3344",

"video": "/share/world_model/ego_robo/stepcompleted/EgoExo4D_KeyStep_3344_40percent.mp4",

"task": "StepCompleted",

"question": "Based on the given video, is this process done?",

"answer": "No",

"conversations": [

{

"from": "human",

"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> Based on the given video, is this process done?"

},

{

"from": "gpt",

"value": "No"

}

]

},

{

"id": "calvinrobot_vqa_1210939_1211003",

"video": "/share/world_model/calvinRobo/calvin/dataset/videos_ABCD_7/1210939_1211003.mp4",

"task": "HowTo",

"question": "What is the way to grasp the red block lying on the shelf?",

"answer": "Grasp the red block lying on the shelf.",

"conversations": [

{

"from": "human",

"value": "This is a 'How-to' task where you need to explain how to accomplish a specific task.<video> What is the way to grasp the red block lying on the shelf?"

},

{

"from": "gpt",

"value": "Grasp the red block lying on the shelf."

}

]

},

{

"id": "calvinrobot_vqa_1844327_1844391",

"video": "/share/world_model/calvinRobo/calvin/dataset/videos_ABCD_7/1844327_1844391.mp4",

"task": "HowTo",

"question": "What is the best way to rotate the blue block 90 degrees to the left?",

"answer": "Rotate the blue block 90 degrees to the left.",

"conversations": [

{

"from": "human",

"value": "This is a 'How-to' task where you need to explain how to accomplish a specific task.<video> What is the best way to rotate the blue block 90 degrees to the left?"

},

{

"from": "gpt",

"value": "Rotate the blue block 90 degrees to the left."

}

]

},

{

"id": "15342981815691041261",

"video": "/share/world_model/robovqa/videos/9216678769626174720.mp4",

"task": "StepCompleted",

"question": "Based on the given video, has the step to Put the spoon in the paper bowl completed?",

"answer": "No",

"conversations": [

{

"from": "human",

"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> Based on the given video, has the step to Put the spoon in the paper bowl completed?"

},

{

"from": "gpt",

"value": "No"

}

]

},

{

"id": "8507301063372860449",

"video": "/share/world_model/robovqa/videos/13556453805290541046.mp4",

"task": "StepCompleted",

"question": "The video is describing the activity to drop the cup in a bin, is this activity completed?",

"answer": "Yes",

"conversations": [

{

"from": "human",

"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> The video is describing the activity to drop the cup in a bin, is this activity completed?"

},

{

"from": "gpt",

"value": "Yes"

}

]

},

{

"id": "17160077548367934395",

"video": "/share/world_model/robovqa/videos/4738293974986050859.mp4",

"task": "StepCompleted",

"question": "The video is describing the step to drop blue chip bag into the drawer, is this step completed?",

"answer": "Yes",

"conversations": [

{

"from": "human",

"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> The video is describing the step to drop blue chip bag into the drawer, is this step completed?"

},

{

"from": "gpt",

"value": "Yes"

}

]

},

{

"id": "1550535357727507718",

"video": "/share/world_model/robovqa/videos/80375803329766278.mp4",

"task": "NextStep",

"question": "Based on the given video, current goal is: please put coffee pods back into the case Q: immediate next step?",

"answer": "A: done",

"conversations": [

{

"from": "human",

"value": "This is a 'Next-Step' task where you need to predict the next step in a sequence of steps.<video> Based on the given video, current goal is: please put coffee pods back into the case Q: immediate next step?"

},

{

"from": "gpt",

"value": "A: done"

}

]

},

{

"id": "4005568423641573528",

"video": "/share/world_model/robovqa/videos/16770588140512569214.mp4",

"task": "HowTo",

"question": "How can I put apple in the bowl?",

"answer": "put apple in the bowl?",

"conversations": [

{

"from": "human",

"value": "This is a 'How-to' task where you need to explain how to accomplish a specific task.<video> How can I put apple in the bowl?"

},

{

"from": "gpt",

"value": "put apple in the bowl?"

}

]

},

{

"id": "robot_vqa_332998056731364634",

"video": "/share/world_model/robovqa/videos/6278762415297476422.mp4",

"conversations": [

{

"from": "human",

"value": "This is a video understanding task where you need to focus on the main subjects, their actions, and the background scenes to answer the following question.<video> Q: what action is possible right now?"

},

{

"from": "gpt",

"value": "A: Tilt the glass above white cup"

}

]

},

]

EVA-Code language evaluation score

evasl.py Code

class Scorer():

def __init__(self, ref, gt):

self.ref = ref

self.gt = gt

print('setting up scorers...')

self.scorers = [

(Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),

(Meteor(), "METEOR"),

(Rouge(), "ROUGE_L"),

(Cider(), "CIDEr"),

(Spice(), "SPICE"),

]

def compute_scores(self):

total_scores = {}

for scorer, method in self.scorers:

print('computing %s score...' % (scorer.method()))

score, scores = scorer.compute_score(self.gt, self.ref)

if type(method) == list:

for sc, scs, m in zip(score, scores, method):

print("%s: %0.3f" % (m, sc))

total_scores["Bleu"] = score

else:

print("%s: %0.3f" % (method, score))

total_scores[method] = score

return total_scores

def calculate_clip(pds):

metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")

metric = metric.cuda()

CLIP_T_scores = []

for idx in pds.keys():

video_path = pds[idx][0]['video'].replace('share', 'home')

caption = pds[idx][0]['caption']

cap = cv2.VideoCapture(video_path)

total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

cap.set(cv2.CAP_PROP_POS_FRAMES, total_frames - 1)

ret, frame = cap.read()

CLIP_T_score = metric(torch.from_numpy(frame).cuda(), caption)

CLIP_T_scores.append(CLIP_T_score.detach().cpu().numpy())

return np.array(CLIP_T_scores).mean()

def run_eval(gt_path, pred_path):

preds = defaultdict(list)

gts = defaultdict(list)

preds_all = []

gts_all = []

with open(gt_path, "r") as f:

gt = json.load(f)

with open(pred_path, "r") as f:

pred = json.load(f)

idx = 0

for d, domain in enumerate(gt):

print("evaluating ", domain['domain'])

for i, data in enumerate(domain['meta']):

video, captions = data["video"], data["answer"]

if '/share/world_model/egoexo4d' in video:

idx = idx-1

continue

# pds[i].append({"video":video, "caption":pd[video]})

# pds_all += [pd[video]]

gts[idx+i].append({"video": video, "caption": captions})

gts_all += [captions]

idx += len(domain['meta'])

idx = 0

for d, domain in enumerate(pred):

if 'meta' in domain.keys():

for i, data in enumerate(domain['meta']):

video, captions = data["video"], data["answer"]

# pds[i].append({"video":video, "caption":pd[video]})

# pds_all += [pd[video]]

preds[idx+i].append({"video": video, "caption": captions})

preds_all += [captions]

else:

video, captions = domain["video"], domain["vlm_pred"]

if '/share/world_model/egoexo4d' in video:

continue

preds[idx].append({"video": video, "caption": captions})

preds_all += [captions]

idx += 1

tokenizer = PTBTokenizer()

pds = tokenizer.tokenize(preds)

gts = tokenizer.tokenize(gts)

clip_score = calculate_clip(preds)

scorer = Scorer(pds, gts)

total_score = scorer.compute_scores()

total_score['clip_score'] = clip_score

sum = 0

for key, value in total_score.items():

if key == 'clip_score':

sum = sum + ((1/value-0.02)/0.08)

elif key == 'CIDEr':

sum = sum + value/10

elif isinstance(value, list):

sum = sum + value[0] + value[1]

else:

sum = sum+value

total_score['evasl'] = sum/7

return total_score

if __name__ == '__main__':

metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")

res_folders = '/home/world_model/EVA-Bench/results/vlm_res/icml_rebuttal'

gt = '/home/world_model/EVA-Bench/metadata'

preds_path = os.listdir(res_folders)

result_list = {}

# preds_path = ['GPT4O']

for model in preds_path:

result_list[model] = {}

pred = os.path.join(res_folders, model)

files = os.listdir(pred)

for file in files:

if not '.json' in file:

continue

result_list[model][file.replace('.json', '')] = run_eval(

os.path.join(gt, file), os.path.join(pred, file))

for model, values in result_list.items():

print(f"## {model}")

for key, value in values.items():

scores = []

for metric, score in value.items():

if isinstance(score, list):

scores.extend(score)

else:

scores.append(score)

formatted_numbers = [f"{num:.4f}" for num in scores]

scores = "& ".join(formatted_numbers)

print(f"### {key}\n {scores}")

EVA-Code visual score

evasl.py Code

for task in tasks:

task_folder = os.path.join(result_folder, task)

models = os.listdir(task_folder)

df = pd.read_csv(f"{gtFolder}/{task}.csv")

model_clip_results = {}

for model in models:

video_folder = os.path.join(os.path.join(

task_folder, model), 'samples_separate')

videos = os.listdir(video_folder)

os.system(f"""torchrun --nproc_per_node=1 --standalone vbench_eva.py \

--dimension 'subject_consistency' 'background_consistency' 'motion_smoothness' 'dynamic_degree' 'aesthetic_quality' \

--videos_path {video_folder} \

--exp_name '{model}' \

--output_path "evaluation_results/Open-X/{task}", \

--mode=custom_input""")

print(task)

CLIP_T_scores = []

CLIP_I_scores = []

for video in videos:

prompt = video.replace('_frame_0_sample0.mp4', '').replace(

'_sample0.mp4', '').replace('?', '').replace('_', ' ')

# prompt = video.replace('_sample0.mp4','').replace('_',' ')

row = df[df['name'] == prompt]

# name = row.iloc[0]['name']

page_dir = row.iloc[0]['page_dir'].replace('share', 'home')

gt_path = f"{page_dir}/{row.iloc[0]['videoid']}.mp4"

# print(gt_path)

video_path = os.path.join(video_folder, video)

# get last frame

cap = cv2.VideoCapture(video_path)

total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

cap.set(cv2.CAP_PROP_POS_FRAMES, total_frames - 1)

ret, frame = cap.read()

# get last frame

gt_cap = cv2.VideoCapture(gt_path)

gt_total_frames = int(gt_cap.get(cv2.CAP_PROP_FRAME_COUNT))

gt_cap.set(cv2.CAP_PROP_POS_FRAMES, gt_total_frames - 1)

gt_ret, gt_frame = gt_cap.read()

CLIP_T_score = metric(torch.from_numpy(frame).cuda(), prompt)

CLIP_T_scores.append(CLIP_T_score.detach().cpu().numpy())

try:

pred_feature = clip_I_model.encode_image(clip_I_preprocess(

Image.fromarray(frame)).unsqueeze(0).to('cuda'))

gt_feature = clip_I_model.encode_image(clip_I_preprocess(

Image.fromarray(gt_frame)).unsqueeze(0).to('cuda'))

pred_feature = pred_feature / \

pred_feature.norm(dim=1, keepdim=True).to(torch.float32)

gt_feature = gt_feature / \

gt_feature.norm(dim=1, keepdim=True).to(torch.float32)

logit_scale = clip_I_model.logit_scale.exp()

clip_I_score = logit_scale * (pred_feature * gt_feature).sum()

CLIP_I_scores.append(clip_I_score.detach().cpu().numpy())

except:

continue

# print(score)

model_clip_results[model] = {

"CLIP-I": np.array(CLIP_I_scores).mean(), "CLIP-T": np.array(CLIP_T_scores).mean()}

for model, values in model_clip_results.items():

print(f"{model}: {values}")

Empowering World Models with Reflection for Embodied Video Prediction

ICML2025 Submission

Update: More results for rebuttal

Abstract

Reflection of Generation Inference Pipline

EVA Model

Demo Results

Action Prompt Following Generation

Action-Description

Can you summarize the actions in this video? Please provide a brief description in one sentence.

What is happening in this video? Please provide a brief description in one sentence.

Finish-Thinking QA

The video is describing the step to Place an orange inside the basket, is this step completed?

Based on the given video, has the job to put book in the shelf completed?

The video is describing the step to place the packet in the tray, is this step completed?

How-To Generation

🙋How should I perform [Stir fry the minced beef in the skillet]?

🙋Can you show me how to [Cut carrots]?

🙋How would you do [knock redbull can over]?

Multi-Round Next-Step Prediction

🙋This video depicts how to [Move the chip to a white can]. The current goal is [Move Chips to 7 up can]. What is likely to happen next?

[Round 1 Generation]:

[Round 2 Generation]:

Multi-Round RoG Longer Video Generation

Rebuttal🔥 More++ longer video generation demos

More Generation Results

Egocentric Gesture

EVA-to-Action in CALVIN Simulation

Open the Drawer

Failing Cases of Long Horizon Task Generation

EVA Benchmark

Rebuttal🔥Data demos and codes...

EVA-Benchmark&Dataset Examples

EVA-Code language evaluation score

EVA-Code visual score

Quantitative Result