PolyBot: Training One Policy Across Robots While Embracing Variability

Jonathan Yang, Dorsa Sadigh, Chelsea Finn

jyang27@cs.stanford.edu

Motivation

When collecting large datasets in order to scale robotic learning, it is important to ensure this data is applicable to a wide-range of robotic setups.
However, there exists a large domain shift between robotic platforms from four sources of variation: control scheme, camera viewpoint, kinematic configuration, and end-effector morphology.
- Collecting data across this domain shift is infeasible due to the large cost of creating new robotic setups
Prior work have focused on generalizing across some of these factors of variation, while fixing others.
- Fixing factors of variation trades off a reduced domain shift
- Unlike these works, we do not constrain the camera viewpoint, embodiment, or low-level controller to be fixed.

Method

Our framework aligns the input, output, and internal representation spaces of our policy across embodiments.

We propose a set of design choices to align the observation and action spaces across robots that greatly reduces the domain shift across robots without sacrificing the generality of our setup
- We utilize 3D-printed wrist-camera mounts with the end-effector in view, but without assumptions on a fixed camera angle
- We use a shared higher-level environment and inverse-kinematics solver, but allow a modular low-level controller for each robot that can vary.
We then train a task-conditioned multiheaded policy, where each head captures separate dynamics information per robot.
We finally exploit a consistent, low-dimensional proprioceptive state signal to align our policy's internal representations. This is done by a contrastive learning approach which maps similar states across trajectories together.

Contrastive Internal Representation Alignment

Our use of a shared inverse kinematics solver allows for a consistent proprioceptive signal across with respect to the robot base.
For each trajectory, we define a "fixed state" as a state which has a consistent notion of goal completion across robots.
We compute the difference between the current state and fixed state for all states in the trajectory.
Then, we map similar differences together across trajectories.

Our Robots!

We evaluate our method on the Franka Emika Panda, Rethink Robotics Sawyer, and Trossen Robotics WidowX 250S. Each of these robots have differing sizes and kinematic configurations.

Evaluation

We evaluate on two different sets of tasks: Pick/Place and Shelf Manipulation.
Each set of tasks has a shared dataset across three robots.
We then collect a dataset for each task variant for other robots.
We show successful few-shot transfer with only 5 demonstrations on the target robot.

Distractor Pick/Place