Conservative Q-Learning
for Offline RL

Aviral Kumar¹, Aurick Zhou¹, George Tucker², Sergey Levine^1,2

¹UC Berkeley ²Google Research, Brain Team

TL,DR: An offline RL method that learns lower-bounded policy value functions

Paper

Slides

Update (11/29): We have updated the results in the paper to be compatible with D4RL-v2 environments, you can find the code to reproduce these experiments along with wandb links here.

Abstract

Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a principled policy improvement procedure. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.

Motivation

Offline Q-learning methods often learn highly overestimated Q-values in offline RL settings, due to out-of-distribution actions, finite sample error, and function approximation error.
This erroneous overestimation often gives rise to poor policies, and the algorithm is unable to correct such errors at all.
In this work, we propose a Q-learning method that learns provably lower-bounded Q-function estimates, and optimizes the Q-function against this learned Q-function.

Algorithm Summary

CQL is a Q-learning or actor-critic algorithm that learns Q-functions such that the expected value of a policy under the learned Q-function lower-bounds the true policy value. In order to obtain such lower-bounded Q-values, CQL additionally minimizes the Q-function under a distribution under a chosen distribution, while maximizing it under the data distribution, and trains the Q-function using the following objective:

The learned Q-values are then used for policy optimization, in a similar fashion as policy iteration:

Empirical Results

CQL learns lower-bounded Q-values. We present the difference between the learned Q-value and the discounted policy return, at intermediate iterations during training in the table below. Observe that CQL, indeed, learns lower-bounded Q-value estimates. On the other hand, prior methods that use ensembles (Ens.) or policy constraints (BEAR) tend to overestimate Q-values.

CQL outperforms prior methods on realistic complex datasets. We evaluated CQL on a number of D4RL datasets, with complex data distributions and hard control problems, and observed that CQL outperforms prior methods, sometimes by 2-5x.

For any questions or suggestions, please email: aviralk@berkeley.edu.

Other useful resources for offline RL:

Offline RL Tutorial (Levine, Kumar, Tucker, Fu): https://arxiv.org/abs/2005.01643
Offline RL datasets (Fu, Kumar, Nachum, Tucker, Levine): https://arxiv.org/abs/2004.07219, https://github.com/rail-berkeley/d4rl

Page updated

Google Sites

Report abuse

Conservative Q-Learning for Offline RL

Abstract

Motivation

Algorithm Summary

Empirical Results

Conservative Q-Learning
for Offline RL