Deep RL, Robotics

Learning to Explore via Meta-Policy Gradient ( ICML 2018 , paper, video)

Markov Decision Process of Two-agents:

We have two agents, Teacher with exploration policy $\pi_e$ and Student with exploitation policy $\pi$. Teacher learns to generate exploration trajectories based on the student's performance improvement. Student learns the policy to maximize its expected long-term return. In MDP, the State is student policy $\pi$, Student is part of Env, the Agent is Teacher, the Action is the trajectories $D_0$ generated by executing teacher's policy $\pi_e$ , the Transition Function is a policy updater, e.g. DDPG. Finally, we define meta-reward as the performance improvement $R(\pi, D_0) = R(\pi') - R(\pi)$, where $\pi' <- DDPG(\pi, D_0)$.

pendulum.mp4
inverted_double_pendulum.mp4

Stochastic Variance Reduction for Policy Gradient Estimation (arXiv, video)

The variance of the policy gradient estimates obtained from the simulation is often excessive, leading to poor sample efficiency. In this paper, we apply the stochastic variance reduced gradient descent (SVRG) to model-free policy gradient to significantly improve the sample-efficiency. The SVRG estimation is incorporated into a trust-region Newton conjugate gradient framework for the policy optimization. On several Mujoco tasks, our method achieves significantly better performance compared to the state-of-the-art model-free policy gradient methods in robotic continuous control such as trust region policy optimization (TRPO)

cheetah.video.1.24314.video050000.mp4
walker.video.1.12763.video134000.mp4
ant.video.1.21752.video1240000.mp4
swimmer.video.1.3516.video025000.mp4
hopper.video.1.2398.video066000.mp4