Soft Actor-Critic Algorithms and Applications
Tuomas Haarnoja*, Aurick Zhou*, Kristian Hartikainen*, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, Sergey Levine.
UC Berkeley / Google
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we focus on soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. We further extend this method to incorporate a number of modifications that substantially accelerate training and improve stability with respect to hyperparameters, including a constrained formulation that automatically tunes the temperature. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds. Our experimental evaluation demonstrates state-of-the-art performance on a range of benchmark tasks, as well as real-world learning for challenging tasks such as robotic locomotion for a quadrupedal robot and robotic manipulation with a dexterous hand.
Rollouts from SAC policy trained for Dynamixel Claw task from vision. The robot must rotate the valve so that the colored peg faces the right. The video embedded in the bottom right corner shows the frames as seen by the policy. (UC Berkeley)
Testing robustness of the learned policy against visual perturbations. The robot must rotate the valve so that the colored peg faces the right. (UC Berkeley)
We trained the Minitaur robot to walk in 2 hours. (Google Brain)
Even though the policy was trained on flat terrain, it generalizes surprisingly well to unseen terrains. (Google Brain)
Tuomas Haarnoja*, Aurick Zhou*, Kristian Hartikainen*, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, Sergey Levine. Soft Actor-Critic Algorithms and Applications. arXiv preprint, 2018.