Variational Reparametrized Policy Learning with Differentiable Physics

(1) Bandit 1 training visualization

Blue: Reinforce, Orange: Adam, Green: Our method.

The action distribution is plotted and normalized for visualization.

(2) Double pendulum swing-up: visualize trajectory distribution

Reinforce

Adam

Our method

(3) Landscape of expected reward after gaussian filter

The global optimum is marked with the red dotted line.

Page updated

Google Sites

Report abuse