Can standard language modeling frameworks train effective policies for reinforcement learning?
We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite the simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
Offline reinforcement learning as a sequence modeling problem
We investigate shifting our perspective of reinforcement learning (RL) by posing sequential decision making problems in a language modeling framework. While conventional work in RL has utilized specialized frameworks relying on Bellman backups, we propose to instead model trajectories with sequence modeling, enabling us to use strong and well-studied architectures such as transformers to generate behaviors. To illustrate this, we study offline reinforcement learning, where we train a model from a fixed dataset rather than collecting experience in the environment. This enables us to train RL policies using the same code as a language modeling framework, with minimal changes.
Decision Transformer: autoregressive sequence modeling for RL
We take a simple approach: each modality (return, state, or action) is passed into an embedding network (convolutional encoder for images, linear layer for continuous states). The embeddings are then processed by an autoregressive transformer model, trained to predict the next action given the previous tokens using a linear output layer.
Evaluation is also easy: we can initialize by a desired target return (e.g. 1 or 0 for success or failure) and the starting state in the environment. Unrolling the sequence -- similar to standard autoregressive generation in language models -- yields a sequence of actions to execute in the environment.
Stitching subsequences to produce optimal trajectories
Consider the task of finding the shortest path on a fixed graph, posed as a reinforcement learning problem (accumulated reward = sum of edge weights). In a training dataset consisting of random walks, we observe many suboptimal trajectories. If we train Decision Transformer on these sequences, we can ask the model to generate an optimal path by conditioning on a large return. We find that by training on only random walks, Decision Transformer can learn to stitch together subsequences from different training trajectories in order to produce optimal trajectories at test time!
In fact, this is the same behavior which is desired from off-policy Q-learning algorithms commonly used in offline reinforcement learning frameworks. However, without needing to introduce TD learning algorithms, value pessimism, or behavior regularization , we can achieve the same behavior using a sequence modeling framework!
Comparisons on offline RL benchmarks
We find these ideas extend to benchmarks commonly used in offline RL literature -- namely, the Atari Learning Environment and OpenAI Gym, as well as a Minigrid Key-To-Door task. Across this diverse set of tasks spanning both discrete and continuous control as well as state and image observations, we find Decision Transformer can match the performance of well-studied and specialized TD learning algorithms developed for these settings.
Sequence modeling as multitask learning
One effect of this type of modeling is that we perform conditional generation, where we initialize a trajectory by inputting our desired return. Decision Transformer does not yield a single policy; rather, it models a wide distribution of policies. If we plot average achieved return against the target return of a trained Decision Transformer, we find distinct policies are learned that can reasonably match the target, trained only with supervised learning. Furthermore, on some tasks (such as Qbert and Seaquest), we find Decision Transformer can actually extrapolate outside of the dataset and model policies achieving higher return!
For more details and results, see our paper. We're excited by the possibility for combining well-established ideas from language modeling with reinforcement learning settings, closing the gap between practitioners in two previously distinct subfields.