Q Value Visualizations
Our demonstrations were labeled with positive rewards at the end of each trajectory and negative rewards elsewhere. The correct value function should then trends upwards during the execution of the task. All methods are able to achieve this on the training data, but our method performs better than other methods when evaluated on a held-out dataset as well as when novel distractors are introduced.
Grad-CAM
Grad-CAM visuals superimposed on frames from robot data. Regions highlighted in green denote patches of the image observation with the most significant influence on the learned policy. With no video pre-training (PTR), background areas in the image exert a significant influence on the output of the learned policy. In contrast, initializing the policy with the video pre-trained representation enables it to focus on gripper and object positions, which are crucial for solving the task.