Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models

Abstract:

Humans are masters at quickly learning many complex tasks, relying on an approximate understanding of the dynamics of their environments. In much the same way, we would like our learning agents to quickly adapt to new tasks. In this paper, we explore how model-based Reinforcement Learning (RL) can enhance transfer to new tasks. We develop an algorithm that learns action-conditional, predictive model of expected future observations, rewards and values from which a policy can be derived by following the gradient of the estimated value along imagined trajectories. We show how robust policy optimization can be achieved even with approximate models on robot manipulation tasks, learned directly from vision and proprioception. We evaluate the efficacy of our approach in a transfer learning scenario, re-using previously learned models on tasks with different reward structures and visual distractors, and show a significant improvement in learning speed compared to strong off-policy baselines.

Inputs:

  1. RGB images from two cameras located to the left and right of the robot (64 x 64 resolution)
  2. Proprioception data (Joint angles & velocities, Finger position & velocity and Grasp sensor state -- 17 dimensional)

Actions:

5 action dimensions

Setup:

2 learners (batch size 16), 8 actors

Learning from scratch:

Lift-B (IVG(5), Multitask)

lift_b.mp4

Stack-B (IVG(5), Multitask)

stack_b.mp4

Transferring learned models:

Lift-R (IVG(5), Multitask)

lift_r.mp4

Stack-R (IVG(5), Multitask)

stack_r.mp4

Distractor: Lift-R (IVG(5), Multitask)

lift_distractor1.mp4

Distractor: Stack-R (IVG(5), Multitask)

stack_distractor1.mp4

Match positions (IVG(5), Multitask)

move_to_target.mp4