(1) Bandit 1 training visualization
Blue: Reinforce, Orange: Adam, Green: Our method.
The action distribution is plotted and normalized for visualization.
(2) Double pendulum swing-up: visualize trajectory distribution
Reinforce
Adam
Our method
(3) Landscape of expected reward after gaussian filter
The global optimum is marked with the red dotted line.