Vision-Based Manipulators Need to Also See from Their Hands

Kyle Hsu*, Moo Jin Kim*, Rafael Rafailov, Jiajun Wu, Chelsea Finn

Stanford University

ICLR 2022 (Oral Presentation)

Paper: https://arxiv.org/abs/2203.12677

Code (Cube Grasping): https://github.com/moojink/cube-grasping

*Co-first authorship. Order determined by coin flip.

Abstract: We study how the choice of visual perspective affects learning and generalization in the context of physical manipulation from raw sensor observations. Compared with the more commonly used global third-person perspective, a hand-centric (eye-in-hand) perspective affords reduced observability, but we find that it consistently improves training efficiency and out-of-distribution generalization. These benefits hold across a variety of learning algorithms, experimental settings, and distribution shifts, and for both simulated and real-world robotics apparatuses. However, this is only the case when hand-centric observability is sufficient; otherwise, including a third-person perspective is necessary for learning, but also harms out-of-distribution generalization. To mitigate this, we propose to regularize the third-person information stream via a variational information bottleneck. On six representative manipulation tasks with varying hand-centric observability adapted from the Meta-World benchmark, this results in a state-of-the-art reinforcement learning agent operating from both perspectives improving its out-of-distribution generalization on every task. While some practitioners have long put cameras in the hands of robots, our work systematically analyzes the benefits of doing so and provides simple and broadly applicable insights for improving vision-based robotic manipulation.

Hand-Centric vs. Third-Person Perspectives

Cube Grasping Experimental Setup

Distribution Shift via Transformations on Table Height

Visualization of the table height distribution shift used in the cube grasping experiments, along with the distribution of initial object and end-effector positions.

z_shift = -0.10

z_shift = -0.05

z_shift = 0

z_shift = +0.05

z_shift = +0.10

Distribution Shift via Transformations on Distractor Objects

 Visualization of the distractor objects distribution shift used in the cube grasping experiments, along with the distribution of initial object and end-effector positions. The distribution shift "3 black (test)" is not depicted below due to space constraints. Note that as in the other experiment variants, the task is to grasp and lift the dark brown textured cube.

mix

3 red

3 green

3 blue

3 brown

3 white

Distribution Shift via Transformations on Table Textures

Visualization of the table textures distribution shift used in the cube grasping experiments, along with the distribution of initial object and end-effector positions.

5 textures

20 held-out textures

Cube Grasping Experimental Results

DAgger and DrQ results for cube grasping. The first, second, and third rows respectively contain results for the table height, distractor objects, and table textures experiment variants.  Compared to the third-person perspective (dashed lines), the hand-centric perspective (solid lines) leads to better out-of-distribution generalization performance across all three distribution shifts for both DAgger and DrQ. For DrQ, we also see appreciable improvements in sample efficiency when using the hand-centric perspective.

DAC results for cube grasping. Left: base variant (initial object and end-effector position randomization) with no distribution shift between demo collection and training. Center: base variant with table height shift between collection of 25 demos and training. Right: base variant plus three distractor objects with no distribution shift between demo collection and training. Across the three experiment variants, the hand-centric perspective (solid lines) enables the agent to generalize in- and out-of-distribution more efficiently and effectively than the third-person perspective (dashed lines).

Cube Grasping Learned Policy Videos

The following videos are of DAgger-trained policies; we omit DrQ-trained/DAC-trained policy videos to avoid redundancy, as they illustrate largely the same differences in performance between the hand-centric and third-person perspectives.

Distribution Shift via Transformations on Table Height

Rollouts of the learned policy on the train and test conditions for the table height experiment variant. For DAgger, we train on z_shift = 0 and test on z_shift = {-0.10, -0.05, +0.05, +0.10}. The agent achieves higher success rates given the hand-centric view. The images appear more blurry than before as these are the actual 84x84 image observations given to the agent.

z_shift = -0.10 (test)

z_shift = -0.05 (test)

z_shift = 0 (train)

z_shift = +0.05 (test)

z_shift = +0.10 (test)

Distribution Shift via Transformations on Distractor Objects

Rollouts of the learned policy on the train and test conditions for the distractor objects experiment variant. For DAgger, we train on a mix of distractors (1 red, 1 green, 1 blue) and test on {3 red, 3 green, 3 blue, 3 brown, 3 white, 3 black}. The agent achieves overall higher success rates given the hand-centric view, but the differences in performance are less drastic: the hand-centric perspective leads to better generalization performance only for the "3 brown" (depicted below) and "3 black" (not depicted) distribution shifts.

mix (train)

3 red (test)

3 green (test)

3 blue (test)

3 brown (test)

3 white (test)

Distribution Shift via Transformations on Table Textures

Rollouts of the learned policies on the train and test conditions for the table textures experiment variant. We train on a set of 5 table textures from the describable textures dataset (DTD) (Cimpoi et al., 2014) and test on a set of 20 held-out table textures. The agent achieves higher success rates given the hand-centric view.

5 textures

(train)

20 held-out textures

(test)

Combining Hand-Centric and Third-Person Perspectives

Meta-World Experimental Setup

Visualization of the train and test distributions in six tasks adapted from the Meta-World benchmark (Yu et al., 2020). The last two, reach-hard and peg-insert-side-hard, are custom-made. The hand-centric observability decreases from top to bottom (high, moderate, low). The train and test distributions are disjoint since we resample initial object positions at test time if any of them could have been seen during training.

handle-press-side

(train)

handle-press-side

(test)

button-press

(train)

button-press

(test)

soccer

(train)

soccer

(test)

peg-insert-side

(train)

peg-insert-side

(test)

reach-hard

(train)

reach-hard

(test)

peg-insert-side-hard

(train)

peg-insert-side-hard

(test)

Meta-World Experimental Results

DrQ-v2 results for Meta-World. Each row contains results for two manipulation tasks that roughly exhibit the same level of hand-centric observability, which decreases from top to bottom (high, moderate, low). Using the proposed approach (both perspectives with a VIB on the third-person perspective's representation, represented by orange curves) leads to the best out-of-distribution generalization performance for all levels of hand-centric observability (though it is matched by the hand-centric perspective when hand-centric observability is high).

Meta-World Learned Policy Videos

Rollouts of the learned DrQ-v2 policies on the train (left two columns) and test conditions (right two columns) for the six Meta-World tasks. We compare the vanilla combination of both hand-centric and third-person views with the version that regularizes the third-person information stream with a variational information bottleneck (VIB). Each GIF shows 3 episodes. The VIB-regularized agent is often able to solve tasks with initial object configurations that confuse the other (vanilla) approach.

Note that these videos are different than the ones shown in the cube grasping environment because here we show policies trained with both hand-centric and third-person perspectives combined (whereas in cube grasping the policies were trained with a single perspective). Therefore, to compare the gifs below, one must compare different columns to each other (e.g., first vs. second column, and third vs. fourth column) rather than different rows.

handle-press-side (train)

both views (vanilla)

handle-press-side (train)

both views + VIB(z_3)

handle-press-side (test)

both views (vanilla)

handle-press-side (test)

both views + VIB(z_3)

button-press (train)

both views (vanilla)

button-press (train)

both views + VIB(z_3)

button-press (test)

both views (vanilla)

button-press (test)

both views + VIB(z_3)

soccer (train)

both views (vanilla)

soccer (train)

both views + VIB(z_3)

soccer (test)

both views (vanilla)

soccer (test)

both views + VIB(z_3)

peg-insert-side (train)

both views (vanilla)

peg-insert-side (train)

both views + VIB(z_3)

peg-insert-side (test)

both views (vanilla)

peg-insert-side (test)

both views + VIB(z_3)

reach-hard (train)

both views (vanilla)

reach-hard (train)

both views + VIB(z_3)

reach-hard (test)

both views (vanilla)

reach-hard (test)

both views + VIB(z_3)

peg-insert-side-hard (train)

both views (vanilla)

peg-insert-side-hard (train)

both views + VIB(z_3)

peg-insert-side-hard (test)

both views (vanilla)

peg-insert-side-hard (test)

both views + VIB(z_3)

Hand-Centric vs. Third-Person Perspective on a Real Robot

The video below demonstrates how a behavioral cloning policy using the hand-centric perspective outperforms a policy using the third-person perspective when tested against unseen distractor objects and human interventions. Results are shown on a real robot apparatus.

seeing-from-hands_uncut-video.mp4