1 More Video prediction results in different stages of tool-use tasks (validation dataset)
Groud truth:
(first row)
Predictions:
(second row)
"Grasp the electrical drill and move to target"
"Use the hammer to hit the nail"
Groud truth:
(first row)
Predictions:
(second row)
"Transfer liquid from pot to blue bowl"
"use tool to suction black liquid/drop a liquid into orange liquid/grasp the suction tool/drop a liquid into black cup"
2. Predictions before/after fine-tuning on tool-use tasks
Groud truth:
(first row)
Predictions:
(second row)
Before fine-tuning on tool tasks, video model already predict reasonable futures in zero-shot. (grasp different part of spoon)
After finetuning on tool tasks, video predcit futures aligns with demonstrations. (always grasp the handle of spoon)