Reinforcement Learning with Deep Energy-Based Policies

All videos are downloadable at this Google Drive folder: https://drive.google.com/drive/folders/0B_KFuCNKS7ZVRlFUOFBUSVZLOUE?usp=sharing

A swimmer snake robot

Note: we use Ornstein–Uhlenbeck process (theta = 0.15, sigma = 0.1) to generate noise to the action outputs of deterministic policies trained with DDPG.

There is no additional noise added to the stochastic policy trained with soft Q-learning. The only stochasticity comes from the policy itself.

A quadrupedal robot exploring a maze

Note: we use Ornstein–Uhlenbeck process (theta = 0.15, sigma = 0.1) to generate noise to the action outputs of deterministic policies trained with DDPG.

There is no additional noise added to the stochastic policy trained with soft Q-learning. The only stochasticity comes from the policy itself.

The reward is a Guassian with mean equal to the goal position.

Pretraining a quadrupedal robot

A quadrupedal robot is trained with reward = (speed of its center-of-mass). The ideal maximum entropy policy should move uniformly in all directions. However, typically deterministic or uni-modal policies trained are not able to achieve this effect. The video below demonstrates how an energy-based stochastic policy can correctly represent the maximum entropy policy of the form below:

Note: the policies are not perfect, so it is common that the robot flips over.

Similarly, OU noise (theta = 0.15, sigma = 0.1) is added to DDPG policies.

Fine-tuning a pretrained policy in new environments

A quadrupedal robot is pretrained on an empty ground with reward = (speed of its center of mass). Then the robot is placed in three new environments and may transfer knowledge from the pretraining environment. Soft Q-learning results in a maximum entropy policy that is advantageous for such pretrain and fine-tune tasks.