Longer video generation results
EVA Bench & dataset metadata examples.
Raw code (for rebuttal demo only, unready for publish)
Video generation models have made significant progress in simulating future states, showcasing their potential as world simulators in embodied scenarios. However, existing models often lack robust understanding, limiting their ability to perform multi-step predictions or handle Out-of-Distribution (OOD) scenarios. To address this challenge, we propose the Reflection of Generation (RoG), a set of intermediate reasoning strategies designed to enhance video prediction. It leverages the complementary strengths of pre-trained vision-language and video generation models, enabling them to function as a world model in embodied scenarios. To support RoG, we introduce Embodied Video Anticipation Benchmark(EVA-Bench), a comprehensive benchmark that evaluates embodied world models across diverse tasks and scenarios, utilizing both in-domain and OOD datasets. Building on this foundation, we devise a world model, Embodied Video Anticipator (EVA), that follows a multistage training paradigm to generate high-fidelity video frames and apply an autoregressive strategy to enable adaptive generalization for longer video sequences. Extensive experiments demonstrate the efficacy of EVA in various downstream tasks like video generation and robotics, thereby paving the way for large-scale pre-trained models in real-world video prediction applications.
🙋Rotate the pink block left.
🤖EVA Video Prediction:
🙋Swep the pink block to the left
🤖EVA Video Prediction:
🙋Rotate the pink block right.
🤖EVA Video Prediction:
🙋Swep the pink block to the right
🤖EVA Video Prediction:
"chatunivi_base": "The video shows a robotic arm cutting a piece of wood.",
"qwen2_vl_7b": "Fold the left sleeve of the sweatshirt.",
"gpt4o": "A robotic arm is folding a sweater sleeve on a table.",
🤖 "EVA": "Fold the left arm of sweatshirt."
"chatunivi_base": "A man is cutting up vegetables in a kitchen while another man watches.",
"qwen2_vl_7b": "A person is cutting a tomato on a cutting board.",
"gpt4o": "A person is preparing vegetables on a cutting board in a kitchen.",
🤖"EVA": "Cut bell pepper with knife on the chopping board."
"chatunivi_base": "No", ☑️
"qwen2_vl_7b": "yes", ❌
"gpt4o": "Yes.", ❌
🤖 "EVA": "No" ☑️
"chatunivi_base": "No", ❌
"qwen2_vl_7b": "yes", ☑️
"gpt4o": "No.", ❌
🤖 "EVA": "Yes" ☑️
"chatunivi_base": "No", ☑️
"qwen2_vl_7b": "no", ☑️
"gpt4o": "No.", ☑️
🤖"EVA": "No" ☑️
NEXT-STEP🤖: Move Chips near the 7-up can.
2round, 4s
move_coke_can_near_water_bottle
2round, 4s
move_blue_plastic_bottle_near_rxbar_blueberry
2round, 4s
open_top_drawer
3round, 6s
pick_green_can_from_bottom_drawer_and_place_on_counter
8s,
pick green can from top drawer and place on counter
10s,
pick green can from top drawer and place on counter
20s
pick green can from bottom drawer and place on counter
12s,
pick sponge from middle drawer and place on counter
We also provided a video on custom videos. It is interesting to find that such a video generation world model has the potential to serve as a video handbook for future VR devices, teaching people how to use kettle or tap, etc.
Turn on the Tap
Place the kettle
Pick green can
Open the button of kettle
We applied the gesture detection model to EVA's generation results. The experiments show that EVA could help to generate reliable human gesture motions together with high-fidelity video.
We use simple conditional MLP as decoder to transfer the generated video into robot actions.
The Video Generation result of EVA could be transferred to 7-dimensional robot motion. We trained a video2action model and applied the prediction planning on simulation environments. The results show that high-quality generation results can directly be transferred to success planning.
We also provide multi-round long-horizon task generation. The video demonstrates some insightful difficulties in the world model. A long-shot Memory system might be necessary, self-aware or reflection of generation is important.
Giving four prompts in the following order(zero-short):
1. Lift pink block
2. Put into drawer
3. Open the drawer
4. Lift the blue block
The pink block changes to blue in the 5 sec showing the weakness of short-term and long-tern memory of the current world model.
In step 3, the drawer is already opened, so only touching the handle, shows that the world model has some reason ability.
The distribution(top) and selected frame(left) of EVA Benchmark.
[
{
"id": "2535951750640968742",
"video": "/share/world_model/robovqa/videos/10821748764221072760.mp4",
"task": "NextStep",
"question": "Based on the given video, current goal is: Please place the cold cups into the tray. Q: immediate next step?",
"answer": "A: Remove the glass from right stack put it in the tray",
"conversations": [
{
"from": "human",
"value": "This is a 'Next-Step' task where you need to predict the next step in a sequence of steps.<video> Based on the given video, current goal is: Please place the cold cups into the tray. Q: immediate next step?"
},
{
"from": "gpt",
"value": "A: Remove the glass from right stack put it in the tray"
}
]
},
{
"id": "EgoExo4D_KeyStep_1998",
"video": "/share/world_model/egoexo4d/keystep_train_takes-cut/nus_covidtest_02_2/aria01_214-1_0000009.mp4",
"task": "HowTo",
"question": "What is the way to swirl the collection swab in the nostril?",
"answer": "Swirl the collection swab in the nostril.",
"conversations": [
{
"from": "human",
"value": "This is a 'How-to' task where you need to explain how to accomplish a specific task.<video> What is the way to swirl the collection swab in the nostril?"
},
{
"from": "gpt",
"value": "Swirl the collection swab in the nostril."
}
]
},
{
"id": "15562648869436311870",
"video": "/share/world_model/robovqa/videos/9394608913112580403.mp4",
"caption": "put the pip corn packet on the table",
"nextvideo": "/share/world_model/robovqa/videos/1113294827289372720.mp4",
"question": "This video is about how to put the pip corn packet on the table. Current goal is: please remove the snacks from the tray. If this video is the final step, please answer 'Yes.'. If no, please predict what is the next step should do.",
"answer": "put the pip corn packet on the table",
"task": "<task:immediate_planning_with_context20:freeform>",
"conversations": [
{
"from": "human",
"value": "This is a 'Task-Planning' task where you need to determine the next step required to complete the task.<video> This video is about how to put the pip corn packet on the table. Current goal is: please remove the snacks from the tray. If this video is the final step, please answer 'Yes.'. If no, please predict what is the next step should do."
},
{
"from": "gpt",
"value": "put the pip corn packet on the table"
}
]
},
{
"id": "EgoExo4D_KeyStep_4853",
"video": "/share/world_model/ego_robo/stepcompleted/EgoExo4D_KeyStep_4853_40percent.mp4",
"task": "StepCompleted",
"question": "Based on the given video, has the action been executed?",
"answer": "No",
"conversations": [
{
"from": "human",
"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> Based on the given video, has the action been executed?"
},
{
"from": "gpt",
"value": "No"
}
]
},
{
"id": "1655112556696676852",
"video": "/share/world_model/robovqa/videos/12189026333842243844.mp4",
"task": "StepCompleted",
"question": "Has the action to put the orange in the big bowl completed?",
"answer": "No",
"conversations": [
{
"from": "human",
"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> Has the action to put the orange in the big bowl completed?"
},
{
"from": "gpt",
"value": "No"
}
]
},
{
"id": "robot_vqa_5787632916096419092",
"video": "/share/world_model/robovqa/videos/7647415357755296087.mp4",
"conversations": [
{
"from": "human",
"value": "This is a video understanding task where you need to focus on the main subjects, their actions, and the background scenes to answer the following question.<video> current goal is: place sugar packets in the correct sections of the clear plastic tray Q: immediate next step?"
},
{
"from": "gpt",
"value": "A: Drop sugar packet in a organizer"
}
]
},
{
"id": "EgoExo4D_KeyStep_1878",
"video": "/share/world_model/ego_robo/stepcompleted/EgoExo4D_KeyStep_1878_40percent.mp4",
"task": "StepCompleted",
"question": "The video is describing the step to Arrange the covid test pack materials on the table, is this activity concluded?",
"answer": "No",
"conversations": [
{
"from": "human",
"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> The video is describing the step to Arrange the covid test pack materials on the table, is this activity concluded?"
},
{
"from": "gpt",
"value": "No"
}
]
},
{
"id": "11654493023765072006",
"video": "/share/world_model/robovqa/videos/7394941969709917478.mp4",
"task": "HowTo",
"question": "Could you explain how to move away from the white box?",
"answer": "Move away from the white box.",
"conversations": [
{
"from": "human",
"value": "This is a 'How-to' task where you need to explain how to accomplish a specific task.<video> Could you explain how to move away from the white box?"
},
{
"from": "gpt",
"value": "Move away from the white box."
}
]
},
{
"id": "6099260259542926199",
"video": "/share/world_model/robovqa/videos/18398277088045882422.mp4",
"task": "StepCompleted",
"question": "The video is describing the activity to grab the objects, is this activity completed?",
"answer": "Yes",
"conversations": [
{
"from": "human",
"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> The video is describing the activity to grab the objects, is this activity completed?"
},
{
"from": "gpt",
"value": "Yes"
}
]
},
{
"id": "EgoExo4D_KeyStep_3344",
"video": "/share/world_model/ego_robo/stepcompleted/EgoExo4D_KeyStep_3344_40percent.mp4",
"task": "StepCompleted",
"question": "Based on the given video, is this process done?",
"answer": "No",
"conversations": [
{
"from": "human",
"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> Based on the given video, is this process done?"
},
{
"from": "gpt",
"value": "No"
}
]
},
{
"id": "calvinrobot_vqa_1210939_1211003",
"video": "/share/world_model/calvinRobo/calvin/dataset/videos_ABCD_7/1210939_1211003.mp4",
"task": "HowTo",
"question": "What is the way to grasp the red block lying on the shelf?",
"answer": "Grasp the red block lying on the shelf.",
"conversations": [
{
"from": "human",
"value": "This is a 'How-to' task where you need to explain how to accomplish a specific task.<video> What is the way to grasp the red block lying on the shelf?"
},
{
"from": "gpt",
"value": "Grasp the red block lying on the shelf."
}
]
},
{
"id": "calvinrobot_vqa_1844327_1844391",
"video": "/share/world_model/calvinRobo/calvin/dataset/videos_ABCD_7/1844327_1844391.mp4",
"task": "HowTo",
"question": "What is the best way to rotate the blue block 90 degrees to the left?",
"answer": "Rotate the blue block 90 degrees to the left.",
"conversations": [
{
"from": "human",
"value": "This is a 'How-to' task where you need to explain how to accomplish a specific task.<video> What is the best way to rotate the blue block 90 degrees to the left?"
},
{
"from": "gpt",
"value": "Rotate the blue block 90 degrees to the left."
}
]
},
{
"id": "15342981815691041261",
"video": "/share/world_model/robovqa/videos/9216678769626174720.mp4",
"task": "StepCompleted",
"question": "Based on the given video, has the step to Put the spoon in the paper bowl completed?",
"answer": "No",
"conversations": [
{
"from": "human",
"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> Based on the given video, has the step to Put the spoon in the paper bowl completed?"
},
{
"from": "gpt",
"value": "No"
}
]
},
{
"id": "8507301063372860449",
"video": "/share/world_model/robovqa/videos/13556453805290541046.mp4",
"task": "StepCompleted",
"question": "The video is describing the activity to drop the cup in a bin, is this activity completed?",
"answer": "Yes",
"conversations": [
{
"from": "human",
"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> The video is describing the activity to drop the cup in a bin, is this activity completed?"
},
{
"from": "gpt",
"value": "Yes"
}
]
},
{
"id": "17160077548367934395",
"video": "/share/world_model/robovqa/videos/4738293974986050859.mp4",
"task": "StepCompleted",
"question": "The video is describing the step to drop blue chip bag into the drawer, is this step completed?",
"answer": "Yes",
"conversations": [
{
"from": "human",
"value": "This is a 'Step-Completed' task where you need to predict if a step is completed or not.<video> The video is describing the step to drop blue chip bag into the drawer, is this step completed?"
},
{
"from": "gpt",
"value": "Yes"
}
]
},
{
"id": "1550535357727507718",
"video": "/share/world_model/robovqa/videos/80375803329766278.mp4",
"task": "NextStep",
"question": "Based on the given video, current goal is: please put coffee pods back into the case Q: immediate next step?",
"answer": "A: done",
"conversations": [
{
"from": "human",
"value": "This is a 'Next-Step' task where you need to predict the next step in a sequence of steps.<video> Based on the given video, current goal is: please put coffee pods back into the case Q: immediate next step?"
},
{
"from": "gpt",
"value": "A: done"
}
]
},
{
"id": "4005568423641573528",
"video": "/share/world_model/robovqa/videos/16770588140512569214.mp4",
"task": "HowTo",
"question": "How can I put apple in the bowl?",
"answer": "put apple in the bowl?",
"conversations": [
{
"from": "human",
"value": "This is a 'How-to' task where you need to explain how to accomplish a specific task.<video> How can I put apple in the bowl?"
},
{
"from": "gpt",
"value": "put apple in the bowl?"
}
]
},
{
"id": "robot_vqa_332998056731364634",
"video": "/share/world_model/robovqa/videos/6278762415297476422.mp4",
"conversations": [
{
"from": "human",
"value": "This is a video understanding task where you need to focus on the main subjects, their actions, and the background scenes to answer the following question.<video> Q: what action is possible right now?"
},
{
"from": "gpt",
"value": "A: Tilt the glass above white cup"
}
]
},
]
evasl.py Code
class Scorer():
def __init__(self, ref, gt):
self.ref = ref
self.gt = gt
print('setting up scorers...')
self.scorers = [
(Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),
(Meteor(), "METEOR"),
(Rouge(), "ROUGE_L"),
(Cider(), "CIDEr"),
(Spice(), "SPICE"),
]
def compute_scores(self):
total_scores = {}
for scorer, method in self.scorers:
print('computing %s score...' % (scorer.method()))
score, scores = scorer.compute_score(self.gt, self.ref)
if type(method) == list:
for sc, scs, m in zip(score, scores, method):
print("%s: %0.3f" % (m, sc))
total_scores["Bleu"] = score
else:
print("%s: %0.3f" % (method, score))
total_scores[method] = score
return total_scores
def calculate_clip(pds):
metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
metric = metric.cuda()
CLIP_T_scores = []
for idx in pds.keys():
video_path = pds[idx][0]['video'].replace('share', 'home')
caption = pds[idx][0]['caption']
cap = cv2.VideoCapture(video_path)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
cap.set(cv2.CAP_PROP_POS_FRAMES, total_frames - 1)
ret, frame = cap.read()
CLIP_T_score = metric(torch.from_numpy(frame).cuda(), caption)
CLIP_T_scores.append(CLIP_T_score.detach().cpu().numpy())
return np.array(CLIP_T_scores).mean()
def run_eval(gt_path, pred_path):
preds = defaultdict(list)
gts = defaultdict(list)
preds_all = []
gts_all = []
with open(gt_path, "r") as f:
gt = json.load(f)
with open(pred_path, "r") as f:
pred = json.load(f)
idx = 0
for d, domain in enumerate(gt):
print("evaluating ", domain['domain'])
for i, data in enumerate(domain['meta']):
video, captions = data["video"], data["answer"]
if '/share/world_model/egoexo4d' in video:
idx = idx-1
continue
# pds[i].append({"video":video, "caption":pd[video]})
# pds_all += [pd[video]]
gts[idx+i].append({"video": video, "caption": captions})
gts_all += [captions]
idx += len(domain['meta'])
idx = 0
for d, domain in enumerate(pred):
if 'meta' in domain.keys():
for i, data in enumerate(domain['meta']):
video, captions = data["video"], data["answer"]
# pds[i].append({"video":video, "caption":pd[video]})
# pds_all += [pd[video]]
preds[idx+i].append({"video": video, "caption": captions})
preds_all += [captions]
else:
video, captions = domain["video"], domain["vlm_pred"]
if '/share/world_model/egoexo4d' in video:
continue
preds[idx].append({"video": video, "caption": captions})
preds_all += [captions]
idx += 1
tokenizer = PTBTokenizer()
pds = tokenizer.tokenize(preds)
gts = tokenizer.tokenize(gts)
clip_score = calculate_clip(preds)
scorer = Scorer(pds, gts)
total_score = scorer.compute_scores()
total_score['clip_score'] = clip_score
sum = 0
for key, value in total_score.items():
if key == 'clip_score':
sum = sum + ((1/value-0.02)/0.08)
elif key == 'CIDEr':
sum = sum + value/10
elif isinstance(value, list):
sum = sum + value[0] + value[1]
else:
sum = sum+value
total_score['evasl'] = sum/7
return total_score
if __name__ == '__main__':
metric = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
res_folders = '/home/world_model/EVA-Bench/results/vlm_res/icml_rebuttal'
gt = '/home/world_model/EVA-Bench/metadata'
preds_path = os.listdir(res_folders)
result_list = {}
# preds_path = ['GPT4O']
for model in preds_path:
result_list[model] = {}
pred = os.path.join(res_folders, model)
files = os.listdir(pred)
for file in files:
if not '.json' in file:
continue
result_list[model][file.replace('.json', '')] = run_eval(
os.path.join(gt, file), os.path.join(pred, file))
for model, values in result_list.items():
print(f"## {model}")
for key, value in values.items():
scores = []
for metric, score in value.items():
if isinstance(score, list):
scores.extend(score)
else:
scores.append(score)
formatted_numbers = [f"{num:.4f}" for num in scores]
scores = "& ".join(formatted_numbers)
print(f"### {key}\n {scores}")
evasl.py Code
for task in tasks:
task_folder = os.path.join(result_folder, task)
models = os.listdir(task_folder)
df = pd.read_csv(f"{gtFolder}/{task}.csv")
model_clip_results = {}
for model in models:
video_folder = os.path.join(os.path.join(
task_folder, model), 'samples_separate')
videos = os.listdir(video_folder)
os.system(f"""torchrun --nproc_per_node=1 --standalone vbench_eva.py \
--dimension 'subject_consistency' 'background_consistency' 'motion_smoothness' 'dynamic_degree' 'aesthetic_quality' \
--videos_path {video_folder} \
--exp_name '{model}' \
--output_path "evaluation_results/Open-X/{task}", \
--mode=custom_input""")
print(task)
CLIP_T_scores = []
CLIP_I_scores = []
for video in videos:
prompt = video.replace('_frame_0_sample0.mp4', '').replace(
'_sample0.mp4', '').replace('?', '').replace('_', ' ')
# prompt = video.replace('_sample0.mp4','').replace('_',' ')
row = df[df['name'] == prompt]
# name = row.iloc[0]['name']
page_dir = row.iloc[0]['page_dir'].replace('share', 'home')
gt_path = f"{page_dir}/{row.iloc[0]['videoid']}.mp4"
# print(gt_path)
video_path = os.path.join(video_folder, video)
# get last frame
cap = cv2.VideoCapture(video_path)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
cap.set(cv2.CAP_PROP_POS_FRAMES, total_frames - 1)
ret, frame = cap.read()
# get last frame
gt_cap = cv2.VideoCapture(gt_path)
gt_total_frames = int(gt_cap.get(cv2.CAP_PROP_FRAME_COUNT))
gt_cap.set(cv2.CAP_PROP_POS_FRAMES, gt_total_frames - 1)
gt_ret, gt_frame = gt_cap.read()
CLIP_T_score = metric(torch.from_numpy(frame).cuda(), prompt)
CLIP_T_scores.append(CLIP_T_score.detach().cpu().numpy())
try:
pred_feature = clip_I_model.encode_image(clip_I_preprocess(
Image.fromarray(frame)).unsqueeze(0).to('cuda'))
gt_feature = clip_I_model.encode_image(clip_I_preprocess(
Image.fromarray(gt_frame)).unsqueeze(0).to('cuda'))
pred_feature = pred_feature / \
pred_feature.norm(dim=1, keepdim=True).to(torch.float32)
gt_feature = gt_feature / \
gt_feature.norm(dim=1, keepdim=True).to(torch.float32)
logit_scale = clip_I_model.logit_scale.exp()
clip_I_score = logit_scale * (pred_feature * gt_feature).sum()
CLIP_I_scores.append(clip_I_score.detach().cpu().numpy())
except:
continue
# print(score)
model_clip_results[model] = {
"CLIP-I": np.array(CLIP_I_scores).mean(), "CLIP-T": np.array(CLIP_T_scores).mean()}
for model, values in model_clip_results.items():
print(f"{model}: {values}")