VPP_rebuttal

Videos for VPP rebuttal

For the Reviewer Y6gd:

1 More Video prediction results in different stages of tool-use tasks (validation dataset)

Groud truth:

(first row)

Predictions:

(second row)

"Grasp the electrical drill and move to target"

"Use the hammer to hit the nail"

Groud truth:

(first row)

Predictions:

(second row)

"Transfer liquid from pot to blue bowl"

"use tool to suction black liquid/drop a liquid into orange liquid/grasp the suction tool/drop a liquid into black cup"

2. Predictions before/after fine-tuning on tool-use tasks

Groud truth:

(first row)

Predictions:

(second row)

Before fine-tuning on tool tasks, video model already predict reasonable futures in zero-shot. (grasp different part of spoon)

After finetuning on tool tasks, video predcit futures aligns with demonstrations. (always grasp the handle of spoon)

Page updated

Google Sites

Report abuse