Estimating Q(s, s') with Deep Deterministic Dynamics Gradients

[Code] [Paper]

Ashley D. Edwards, Himanshu Sahni, Rosanne Liu, Jane Hung, Ankit Jain, Rui Wang, Adrien Ecoffet, Thomas Miconi, Charles Isbell, Jason Yosinski

Abstract: In this paper, we introduce a novel form of a value function, Q(s, s' ), that expresses the utility of transitioning from a state s to a neighboring state s' and then acting optimally thereafter. In order to derive an optimal policy, we develop a novel forward dynamics model that learns to make next-state predictions that maximize Q(s, s' ). This formulation decouples actions from values while still learning off-policy. We highlight the benefits of this approach in terms of value function transfer, learning within redundant action spaces, and learning off-policy from state observations generated by sub-optimal or completely random policies.

A D3G trained dynamics model gradually learns to make grid-like predictions from the start to goal when trained to solve a grid world task.

Given a sequence of states, rewards, and termination conditions (i.e. no actions!) obtained from a random policy, a D3G trained dynamics model imagines balancing a pole*.

...it also imagines moving reacher to a target location.

* The model predicts state vectors which we render in the MuJoCo simulator.