Conservative Q-Learning
for Offline RL
Aviral Kumar1, Aurick Zhou1, George Tucker2, Sergey Levine1,2
1UC Berkeley 2Google Research, Brain Team
Update (11/29): We have updated the results in the paper to be compatible with D4RL-v2 environments, you can find the code to reproduce these experiments along with wandb links here.
Abstract
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a principled policy improvement procedure. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.
Motivation
Offline Q-learning methods often learn highly overestimated Q-values in offline RL settings, due to out-of-distribution actions, finite sample error, and function approximation error.
This erroneous overestimation often gives rise to poor policies, and the algorithm is unable to correct such errors at all.
In this work, we propose a Q-learning method that learns provably lower-bounded Q-function estimates, and optimizes the Q-function against this learned Q-function.
Algorithm Summary
CQL is a Q-learning or actor-critic algorithm that learns Q-functions such that the expected value of a policy under the learned Q-function lower-bounds the true policy value. In order to obtain such lower-bounded Q-values, CQL additionally minimizes the Q-function under a distribution under a chosen distribution, while maximizing it under the data distribution, and trains the Q-function using the following objective:
The learned Q-values are then used for policy optimization, in a similar fashion as policy iteration:
Empirical Results
CQL learns lower-bounded Q-values. We present the difference between the learned Q-value and the discounted policy return, at intermediate iterations during training in the table below. Observe that CQL, indeed, learns lower-bounded Q-value estimates. On the other hand, prior methods that use ensembles (Ens.) or policy constraints (BEAR) tend to overestimate Q-values.
CQL outperforms prior methods on realistic complex datasets. We evaluated CQL on a number of D4RL datasets, with complex data distributions and hard control problems, and observed that CQL outperforms prior methods, sometimes by 2-5x.
For any questions or suggestions, please email: aviralk@berkeley.edu.
Other useful resources for offline RL:
Offline RL Tutorial (Levine, Kumar, Tucker, Fu): https://arxiv.org/abs/2005.01643
Offline RL datasets (Fu, Kumar, Nachum, Tucker, Levine): https://arxiv.org/abs/2004.07219, https://github.com/rail-berkeley/d4rl