Conservative Q-Learning
for Offline RL

Aviral Kumar1, Aurick Zhou1, George Tucker2, Sergey Levine1,2

1UC Berkeley 2Google Research, Brain Team

TL,DR: An offline RL method that learns lower-bounded policy value functions

Update (11/29): We have updated the results in the paper to be compatible with D4RL-v2 environments, you can find the code to reproduce these experiments along with wandb links here.

Abstract

Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a principled policy improvement procedure. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.

Motivation

  • Offline Q-learning methods often learn highly overestimated Q-values in offline RL settings, due to out-of-distribution actions, finite sample error, and function approximation error.

  • This erroneous overestimation often gives rise to poor policies, and the algorithm is unable to correct such errors at all.

  • In this work, we propose a Q-learning method that learns provably lower-bounded Q-function estimates, and optimizes the Q-function against this learned Q-function.

Algorithm Summary

CQL is a Q-learning or actor-critic algorithm that learns Q-functions such that the expected value of a policy under the learned Q-function lower-bounds the true policy value. In order to obtain such lower-bounded Q-values, CQL additionally minimizes the Q-function under a distribution under a chosen distribution, while maximizing it under the data distribution, and trains the Q-function using the following objective:

The learned Q-values are then used for policy optimization, in a similar fashion as policy iteration:

Empirical Results

CQL learns lower-bounded Q-values. We present the difference between the learned Q-value and the discounted policy return, at intermediate iterations during training in the table below. Observe that CQL, indeed, learns lower-bounded Q-value estimates. On the other hand, prior methods that use ensembles (Ens.) or policy constraints (BEAR) tend to overestimate Q-values.

CQL outperforms prior methods on realistic complex datasets. We evaluated CQL on a number of D4RL datasets, with complex data distributions and hard control problems, and observed that CQL outperforms prior methods, sometimes by 2-5x.

For any questions or suggestions, please email: aviralk@berkeley.edu.

Other useful resources for offline RL:

  1. Offline RL Tutorial (Levine, Kumar, Tucker, Fu): https://arxiv.org/abs/2005.01643

  2. Offline RL datasets (Fu, Kumar, Nachum, Tucker, Levine): https://arxiv.org/abs/2004.07219, https://github.com/rail-berkeley/d4rl