Multi-Game Decision Transformers

Kuang-Huei Lee*, Ofir Nachum*, Sherry Yang, Lisa Lee, Daniel Freeman, Winnie Xu, Sergio Guadarrama, Ian Fischer, Eric Jang, Henryk Michalewski, Igor Mordatch*

Google Research

Paper Code and Pre-Trained Models Google AI Blog Twitter Thread

Abstract

A longstanding goal of the field of AI is a strategy for compiling diverse experience into a highly capable, generalist agent. In the subfields of vision and language, this was largely achieved by scaling up transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learning agents. Specifically, we show that a single transformer-based model -- with a single set of weights -- trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance. When trained and evaluated appropriately, we find that the same trends observed in language and vision hold, including scaling of performance with model size and rapid adaptation to new games via fine-tuning. We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning, and find that our Multi-Game Decision Transformer models offer the best scalability and performance. We release the pre-trained models and code to encourage further research in this direction.

Overview of the training and evaluation setup

We observe expert-level game-play in the interactive setting after offline learning from trajectories ranging from beginner to expert.

Decision Transformers Architecture

Following [1], we pose the problem of offline reinforcement learning as a sequential modeling problem. Returns, actions, and rewards are tokenized, and we train the model to predict the next return, action, and reward discrete token in a sequence via standard cross-entropy loss. However, unlike [1], our design allows predicting the return distribution and sampling from it, instead of relying on a user to manually select an expert-level return at inference time.

Experiments

How do different online and offline methods perform in the multi-game regime?

We train a single agent that achieves 126% of human-level performance simultaneously across 41 Atari games after training on offline expert and non-expert datasets.

Aggregates of human-normalized scores across 41 Atari games. Grey bars are single-game specialist models while blue are generalists. We also report the performance of Deep Q-Network (DQN) [2], Batch-Constrained Q-learning (BCQ) [11], Behavioral Cloning (BC) [4], and Constrained Q-Learning (CQL) [6].

How do different methods scale with model size?

In large language and vision models, lowest-achievable training loss typically decreases predictably with increasing model size [10]. We investigate whether similar trends hold for interactive in-game performance -- not just training loss -- and show a similar power-law performance trend.

Multi-Game Decision Transformer performance reliably increases over two orders of magnitude, whereas the other methods either saturate, or have much slower performance growth. We also find that larger models train faster, in the sense of reaching higher in-game performance after observing the same number of tokens.

Scaling of median scores for all training games.

Scaling of median scores for all novel games after fine-tuning DT and CQL.

How effective are different methods at transfer to novel games?

Pretraining for rapid adaptation to new games has not been explored widely on Atari games despite being a natural and well-motivated task due to its relevance to how humans transfer knowledge to new games.

Pretraining with the DT objective performs the best across all games. All methods with pretraining outperform training CQL from scratch, which verifies our hypothesis that pretraining on other games should indeed help with rapid learning of a new game. CPC and BERT underperform DT, suggesting that learning state representations alone is not sufficient for desirable transfer performance. While ACL adds an action prediction auxiliary loss to BERT, it showed little effect, suggesting that modeling the actions in the right way on the offline data is important for good transfer performance.

Fine-tuning performance on 1% of 5 held-out games' data after pretraining on other 41 games using DT [1], CQL [6], CPC [7], BERT [8], and ACL [9].

We pretrained all methods on the full datasets of the 41 training games with 50M steps each, and fine-tuned one model per held-out game using 1% (500k steps) from each game. All methods are fine-tuned for 100,000 steps, which is much shorter than training any agent from scratch.

Does multi-game decision transformer improve upon training data?

We evaluate whether decision transformer with expert action inference is capable of acting better than the best demonstrations seen during training. We see significant improvement over the training data in a number of games.

Percent of improvement of top-3 decision transformer rollouts over the best score in the training dataset. 0% indicates no improvement. Top-3 metric (instead of mean) is used to more fairly compare to the best -- rather than expert average -- demonstration score.

Does expert action inference improve upon behavioral cloning?

We find that median performance across all games is indeed significantly improved by generating optimality-conditioned actions. Decision transformer outperforms behavioral cloning in 31 out of 41 games.

Mean and standard deviation of scores across all games. We show DQN-normalized scores in this figure for better presentations.

Are there benefits to specifically using transformer architecture?

Decision Transformer is an Upside-Down RL (UDRL) [12] implementation that uses the transformer architecture and considers RL as a sequence modeling problem. To understand the benefit of the transformer architecture, we compare to an UDRL implementation that uses feed-forward, convolutional Impala [13] networks.

What does multi-game decision transformer attend to?

We find that specific heads and layers attend to meaningful image patches such as player character, player's free movement space, non-player objects, and other environment features:

Seaquest: player

Frostbite: player

Asterix: player

Breakout: ball

Breakout: free move space

Asterix: frozen characters

Seaquest: bullets

Breakout: unbroken blocks

References

[1] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision Transformer: Reinforcement Learning via Sequence Modeling. NeurIPS 2021.

[2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature 2015.

[3] Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Thomas Paine, Sergio Gómez, Konrad Zolna, Rishabh Agarwal, Josh S Merel, Daniel J Mankowitz, Cosmin Paduraru, et al. RL Unplugged: A suite of benchmarks for offline reinforcement learning.

[4] Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 1991.

[5] Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. ICML 2017.

[6] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. NeurIPS 2020.

[7] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv:1807.03748, 2018

[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018

[9] Mengjiao Yang and Ofir Nachum. Representation matters: Offline pretraining for sequential decision making. ICML 2021.

[10] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020

[11] Scott Fujimoto, David Meger, and Doina Precup. Off-Policy Deep Reinforcement Learning without Exploration. ICML 2019.

[12] Juergen Schmidhuber. Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions. 2019.

[13] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, Koray Kavukcuoglu. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. ICML 2018.