Model-Based Visual Planning with Self-Supervised Functional Distances
Stephen Tian1, Suraj Nair2, Frederik Ebert1,
Sudeep Dasari3, Benjamin Eysenbach3, Chelsea Finn2 and Sergey Levine1
1 University of California, Berkeley, 2 Stanford University, 3 Carnegie Mellon University
A generalist robot must be able to complete a variety of tasks in its environment. One appealing way to specify each task is in terms of a goal observation. However, learning goal-reaching policies with reinforcement learning remains a challenging problem, particularly when hand-engineered reward functions are not available. Learned dynamics models are a promising approach for learning about the environment without rewards or task-directed data, but planning to reach goals with such a model requires a notion of functional similarity between observations and goal states. We present a self-supervised method for model-based visual goal reaching, which uses both a visual dynamics model as well as a dynamical distance function learned using model-free reinforcement learning. Our approach learns entirely using offline, unlabeled data, making it practical to scale to large and diverse datasets. In our experiments, we find that our method can successfully learn models that perform a variety of tasks at test-time, moving objects amid distractors with a simulated robotic arm and even learning to open and close a drawer using a real-world robot. In comparisons, we find that this approach substantially outperforms both model-free and model-based prior methods.
Want to acquire general purpose knowledge from unlabeled data to perform downstream tasks
Aim is to complete tasks specified by goal-images, learning from an offline, unlabeled dataset
Learning goal-reaching policies with model-free offline RL is challenging, and dynamics model based methods require planning heuristics
Our method, Model-Based Reinforcement Learning with Offline Learned Distances (MBOLD), has two main learned components:
Predictive dynamics model: In our experiments, we directly predict future image observations given the current image and an action sequence
Functional distance: Predicts the minimum number of timesteps required to transition between two states
The dynamics model and functional distance are used together during sampling-based planning.
Dynamics model and distance function are both learned from offline, unlabeled data. To learn an optimal distance function from suboptimal data, we relabel goals for each transition, and perform offline Q-learning.
The learned dynamics model rolls out candidates action sequences. The best actions are identified based on which future states the distance function finds are closest to the specified goal.
Our method is able to solve the task by first moving the object to the desired position and then moving the robot arm. Prior methods often ignore the object position and only match the arm.
MBOLD plans to move around the drawer handle to relocate the door, again displacing the robot from its goal position to do so.
Our method solves a real-world drawer opening task.
Third-person robot execution (3x)
MBOLD solves the drawer closing task, even when directly matching the arm position would fail.
Third-person robot execution (3x)
Visual Foresight (Ebert et. al, 2018) simply matches the goal arm position when using pixel-wise MSE as a planning cost.