Reinforcement Learning with Deep Energy-Based Policies
All videos are downloadable at this Google Drive folder: https://drive.google.com/drive/folders/0B_KFuCNKS7ZVRlFUOFBUSVZLOUE?usp=sharing
A swimmer snake robot
Note: we use Ornstein–Uhlenbeck process (theta = 0.15, sigma = 0.1) to generate noise to the action outputs of deterministic policies trained with DDPG.
There is no additional noise added to the stochastic policy trained with soft Q-learning. The only stochasticity comes from the policy itself.
video link: http://youtu.be/8ysBFCDp1e8
A quadrupedal robot exploring a maze
Note: we use Ornstein–Uhlenbeck process (theta = 0.15, sigma = 0.1) to generate noise to the action outputs of deterministic policies trained with DDPG.
There is no additional noise added to the stochastic policy trained with soft Q-learning. The only stochasticity comes from the policy itself.
The reward is a Guassian with mean equal to the goal position.
video link: http://youtu.be/ppdIdYdD_U0
Pretraining a quadrupedal robot
A quadrupedal robot is trained with reward = (speed of its center-of-mass). The ideal maximum entropy policy should move uniformly in all directions. However, typically deterministic or uni-modal policies trained are not able to achieve this effect. The video below demonstrates how an energy-based stochastic policy can correctly represent the maximum entropy policy of the form below:
Note: the policies are not perfect, so it is common that the robot flips over.
Similarly, OU noise (theta = 0.15, sigma = 0.1) is added to DDPG policies.
video link: http://youtu.be/KpDVM4h8m4g
Fine-tuning a pretrained policy in new environments
A quadrupedal robot is pretrained on an empty ground with reward = (speed of its center of mass). Then the robot is placed in three new environments and may transfer knowledge from the pretraining environment. Soft Q-learning results in a maximum entropy policy that is advantageous for such pretrain and fine-tune tasks.
video link: http://youtu.be/7Nm1N6sUoVs