Scenario 1: Performance on Various Tasks
We showcase successful trajectories collected by our method and compare them to unsuccessful ones collected by other methods. For success rates, see the quantitative results section below. Left: V-PTR (ours). Right: (clockwise, from top left) PTR (Kumar et al. 2023), VIP (Ma et al. 2022), R3M (Nair et al. 2022), masked visual pre-training (Xiao et al. 2022)
Croissant from bowl
V-PTR (ours)
PTR
Masked Visual Pre-training
VIP
R3M
Sweet potato on plate
V-PTR (ours)
PTR
Masked Visual Pre-training
VIP
R3M
Knife in pot
V-PTR (ours)
PTR
Masked Visual Pre-training
VIP
R3M
Cucumber in pot
V-PTR (ours)
PTR
Masked Visual Pre-training
VIP
R3M
Open Microwave
V-PTR (ours)
PTR
VIP
Sweep Beans
V-PTR (ours)
PTR
VIP
Even when successful, VIP has qualitatively worse performance (i.e. smaller swept area)
Scenario 2: Performing tasks with novel distractors
We introduce various distractors to test whether the robot is able to identify the correct object to pick up. The order of videos is the same as before.
Croissant from bowl with distractors
V-PTR (ours)
PTR
Masked Visual Pre-training
VIP
R3M
Sweet potato on plate with distractors
V-PTR (ours)
PTR
Masked Visual Pre-training
VIP
R3M
Knife in pot with distractors
V-PTR (ours)
PTR
Masked Visual Pre-training
VIP
R3M
Cucumber in pot with distractors
V-PTR (ours)
PTR
Masked Visual Pre-training
VIP
R3M
Our training data and evaluation setups use a variety of initial positions and distractor objects for each target task. Here are some examples of the data diversity in our training data for the "take croissant out of colander" task:
Scenario 3: New target objects
We replace the target object in one of the tasks (croissant out of bowl) with various other objects. Our method is still able to achieve some success, while others struggle.
Various objects in colander
V-PTR (ours)
PTR
Masked Visual Pre-training
VIP
R3M
Quantitative