Generalization Through Hand-Eye Coordination:
An Action Space for Learning Spatially-Invariant Visuomotor Control
Video Summary
Motivation
- Imitation Learning (IL) is an effective framework to learn visuomotor skills from offline demonstration data.
- IL methods often fail to generalize to new scene configurations not covered by training data.
- Humans are good at generalization.
Goal: Learn policies that can generalize through a human-inspired hand-eye coordination mechanism.
The Tea-Serving Example
A manipulation example: tea-serving. In this task, the robot needs to grasp the teapot and pour tea into the cup. The figure shows what the robot needs to do at the beginning of the task - go to the teapot handle. Left: How do human reason about actions when teleoperating the robot to serve tea (human gaze focuses on the object). Right: What the robot should learn to do (1. place a visual keypoint on the right object; 2. reason about 3d location; and 3. compute action accordingly).
Intuition:
Human Hand-eye Coordination
Visual Attention: human vision focuses on the object to grasp. The "keypoint" is on teapot handle.
Action Space: human action is object-grounded, being "attracted" to the teapot handle.
We want robot learn to imitate this structure: (1) find visual keypoint (2) project keypoint to 3D (3) compute actions according to keypoint.
Contributions
1. We develop a novel action space, Hand-eye Action Networks (HAN), for learning human-like hand-eye coordination behaviors end-to-end from human-teleoperated demonstrations.
2. To enable tight coupling between visual attention and robot actions, we develop a novel 3D attention mechanism that learns to generate 3D keypoints at task-relevant locations for guiding robot movements without direct keypoint supervisions.
3. We evaluate on three simulated continuous control tasks of varying difficulties, and demonstrate that our action space enables a policy to generalize to tasks with unseen environment configurations in a zero-shot manner. We also show that the learned action space qualitatively exhibits human-like coordinated hand-eye movements.
Data Collection
We collect datasets via human teleoperation with Roboturk.
Hand-Eye Action Network (HAN)
Hand-eye Action Space
Instead of letting a policy network predict the action directly, we predict a keypoint (kp), an offset and a scalar gain k. We then combine the predictions with the end effector (ee) position to compute the final action, as illustrated in the figure.
The task of HAN is to learn to predict values within this hand-eye action space, before formulating the final action (delta position) of the robot.
HAN Architecture
HAN is a policy network - it takes an image as observation, and outputs the action for the robot.
We train this network end-to-end with supervision from the human demonstration dataset of (observation, action) pairs.
HAN consists of three major components
- 3D visual attention network
Take N random local crops ("regions") on the observation image; pass through a CNN to acquire N feature embeddings. Predict a 2D keypoint in each region, and project from 2D to 3D, obtaining N 3D keypoint coordinates.
- Attention switching network
Predict a confidence score for each region. Use the confidence score to compute a weighted average of all keypoint coordinates, as the final keypoint.
- Action target network
Use an MLP to predict a 3D offset and a scalar gain. Keypoint coordinates, 3D offset and gain form the hand-eye action space. The final action for the robot can be computed from these values. Gripper open/close is predicted separately.
Experiments
We study HAN with 3 experiments. Expert demonstrations are shown below. During training, objects are initialized within blue line-bounded regions. During testing, objects are initialized within red line-bounded regions.
Lifting
Robot needs to lift the cube up to a height.
Stacking
Robot needs to stack the red cube on top of the green plate.
Tool-using
Robot cannot reach the blue cube in the beginning, so it needs to use the red tool to hook the blue cube closer, then place the blue cube inside the brown hole.