Generalization Through Hand-Eye Coordination:

An Action Space for Learning Spatially-Invariant Visuomotor Control

Chen Wang* Rui Wang* Ajay Mandlekar Silvio Savarese Li Fei-Fei Danfei Xu

Contact: chenwj@stanford.edu

Video Summary

Motivation

Imitation Learning (IL) is an effective framework to learn visuomotor skills from offline demonstration data.
IL methods often fail to generalize to new scene configurations not covered by training data.
Humans are good at generalization.

Goal: Learn policies that can generalize through a human-inspired hand-eye coordination mechanism.

The Tea-Serving Example

A manipulation example: tea-serving. In this task, the robot needs to grasp the teapot and pour tea into the cup. The figure shows what the robot needs to do at the beginning of the task - go to the teapot handle. Left: How do human reason about actions when teleoperating the robot to serve tea (human gaze focuses on the object). Right: What the robot should learn to do (1. place a visual keypoint on the right object; 2. reason about 3d location; and 3. compute action accordingly).

Intuition:

Human Hand-eye Coordination

Visual Attention: human vision focuses on the object to grasp. The "keypoint" is on teapot handle.

Action Space: human action is object-grounded, being "attracted" to the teapot handle.

We want robot learn to imitate this structure: (1) find visual keypoint (2) project keypoint to 3D (3) compute actions according to keypoint.

Contributions

1. We develop a novel action space, Hand-eye Action Networks (HAN), for learning human-like hand-eye coordination behaviors end-to-end from human-teleoperated demonstrations.

2. To enable tight coupling between visual attention and robot actions, we develop a novel 3D attention mechanism that learns to generate 3D keypoints at task-relevant locations for guiding robot movements without direct keypoint supervisions.

3. We evaluate on three simulated continuous control tasks of varying difficulties, and demonstrate that our action space enables a policy to generalize to tasks with unseen environment configurations in a zero-shot manner. We also show that the learned action space qualitatively exhibits human-like coordinated hand-eye movements.

Data Collection

We collect datasets via human teleoperation with Roboturk.

Hand-Eye Action Network (HAN)

Hand-eye Action Space

Instead of letting a policy network predict the action directly, we predict a keypoint (kp), an offset and a scalar gain k. We then combine the predictions with the end effector (ee) position to compute the final action, as illustrated in the figure.

The task of HAN is to learn to predict values within this hand-eye action space, before formulating the final action (delta position) of the robot.

HAN Architecture

HAN is a policy network - it takes an image as observation, and outputs the action for the robot.

We train this network end-to-end with supervision from the human demonstration dataset of (observation, action) pairs.

HAN consists of three major components

3D visual attention network

Take N random local crops ("regions") on the observation image; pass through a CNN to acquire N feature embeddings. Predict a 2D keypoint in each region, and project from 2D to 3D, obtaining N 3D keypoint coordinates.

Attention switching network

Predict a confidence score for each region. Use the confidence score to compute a weighted average of all keypoint coordinates, as the final keypoint.

Action target network

Use an MLP to predict a 3D offset and a scalar gain. Keypoint coordinates, 3D offset and gain form the hand-eye action space. The final action for the robot can be computed from these values. Gripper open/close is predicted separately.

Experiments

We study HAN with 3 experiments. Expert demonstrations are shown below. During training, objects are initialized within blue line-bounded regions. During testing, objects are initialized within red line-bounded regions.

Lifting

Robot needs to lift the cube up to a height.

Stacking

Robot needs to stack the red cube on top of the green plate.

Tool-using

Robot cannot reach the blue cube in the beginning, so it needs to use the red tool to hook the blue cube closer, then place the blue cube inside the brown hole.

Quantitative Results

In all three experiments, HAN outperforms vanilla BC and other model variants without the action space / visual attention mechanism design.

Qualitative Results

The annotations for HAN used in the videos below.

Rollout videos of HAN after training in three environments

lift_ours_1_200.mp4

Lifting

stack_ours_1.mp4

Stacking

tool_ours_1.mp4

Tool-Using

Some key snapshots during the rollout of Stacking and Tool-Using to showcase the effect of attention switch during multi-stage tasks.

Page updated

Report abuse

Generalization Through Hand-Eye Coordination:

An Action Space for Learning Spatially-Invariant Visuomotor Control

Chen Wang* Rui Wang* Ajay Mandlekar Silvio Savarese Li Fei-Fei Danfei Xu

Contact: chenwj@stanford.edu

Video Summary

Motivation

Imitation Learning (IL) is an effective framework to learn visuomotor skills from offline demonstration data.

IL methods often fail to generalize to new scene configurations not covered by training data.

Humans are good at generalization.

Goal: Learn policies that can generalize through a human-inspired hand-eye coordination mechanism.

The Tea-Serving Example

Intuition:

Human Hand-eye Coordination

Visual Attention: human vision focuses on the object to grasp. The "keypoint" is on teapot handle.

Action Space: human action is object-grounded, being "attracted" to the teapot handle.

We want robot learn to imitate this structure: (1) find visual keypoint (2) project keypoint to 3D (3) compute actions according to keypoint.

Contributions

1. We develop a novel action space, Hand-eye Action Networks (HAN), for learning human-like hand-eye coordination behaviors end-to-end from human-teleoperated demonstrations.

2. To enable tight coupling between visual attention and robot actions, we develop a novel 3D attention mechanism that learns to generate 3D keypoints at task-relevant locations for guiding robot movements without direct keypoint supervisions.

Data Collection

We collect datasets via human teleoperation with Roboturk.

Hand-Eye Action Network (HAN)

Hand-eye Action Space

Instead of letting a policy network predict the action directly, we predict a keypoint (kp), an offset and a scalar gain k. We then combine the predictions with the end effector (ee) position to compute the final action, as illustrated in the figure.

The task of HAN is to learn to predict values within this hand-eye action space, before formulating the final action (delta position) of the robot.

HAN Architecture

HAN is a policy network - it takes an image as observation, and outputs the action for the robot.

We train this network end-to-end with supervision from the human demonstration dataset of (observation, action) pairs.

HAN consists of three major components

3D visual attention network

Take N random local crops ("regions") on the observation image; pass through a CNN to acquire N feature embeddings. Predict a 2D keypoint in each region, and project from 2D to 3D, obtaining N 3D keypoint coordinates.

Attention switching network

Predict a confidence score for each region. Use the confidence score to compute a weighted average of all keypoint coordinates, as the final keypoint.

Action target network

Use an MLP to predict a 3D offset and a scalar gain. Keypoint coordinates, 3D offset and gain form the hand-eye action space. The final action for the robot can be computed from these values. Gripper open/close is predicted separately.

Experiments

We study HAN with 3 experiments. Expert demonstrations are shown below. During training, objects are initialized within blue line-bounded regions. During testing, objects are initialized within red line-bounded regions.

Lifting

Stacking

Tool-using

Quantitative Results

In all three experiments, HAN outperforms vanilla BC and other model variants without the action space / visual attention mechanism design.

Qualitative Results

The annotations for HAN used in the videos below.

Rollout videos of HAN after training in three environments

Lifting

Stacking

Tool-Using

Some key snapshots during the rollout of Stacking and Tool-Using to showcase the effect of attention switch during multi-stage tasks.