You Can't Count on Luck:

Why Decision Transformers Fail in Stochastic Environments

Keiran Paster, Sheila McIlraith, Jimmy Ba

[Paper: arXiv] [Code: Github]

Abstract

Recently, methods such as Decision Transformer that reduce reinforcement learning to a prediction task and solve it via supervised learning (RvS) have become popular due to their simplicity, robustness to hyperparameters, and strong overall performance on offline RL tasks. However, simply conditioning a probabilistic model on a desired return and taking the predicted action can fail dramatically in stochastic environments since trajectories that result in a return may have only achieved that return due to luck. In this work, we describe the limitations of RvS approaches in stochastic environments and propose a solution. Rather than simply conditioning on the return of a single trajectory as is standard practice, our proposed method, ESPER, learns to cluster trajectories and conditions on average cluster returns, which are independent from environment stochasticity. Doing so allows ESPER to achieve strong alignment between target return and expected performance in real environments. We demonstrate this in several challenging stochastic offline-RL tasks including the challenging puzzle game 2048, and Connect Four playing against a stochastic opponent. In all tested domains, ESPER achieves significantly better alignment between the target return and achieved return than simply conditioning on returns. ESPER also achieves higher maximum performance than even value-based baselines.

Method

Gambling Environment

To understand why Decision Transformer fails in stochastic environments, consider a simple environment with 3 actions: two gambling machines and one action that guarantees a positive return.

Reward-Conditioned Policy

Decision Transformer learns to act by learning the distribution of actions conditioned on an outcome such as achieving a particular return. In this gambling environment, when conditioning on achieving a reward of 1, the agent learns to sometimes play the purple slot machine, even though the expected return of playing it is negative.

ESPER: Environment-Stochasticity-Independent Representations

In our work, we show that the fundamental problem is that when conditioning on returns, the agent can have an unrealistic view of the environment dynamics. For example, filtering by trajectories that achieve one return will throw out all the bad outcomes of playing the purple slot machine.

Our approach, ESPER, clusters trajectories using an adversarial loss so that within each cluster, state transitions are distributed realistically. Rather than train a policy that conditions on returns of individual trajectories, ESPER conditions on the average returns in these clusters.

Results

The following plots show the achieved performance (over many trials) when making an agent target a particular return. The histograms represent the returns that are in-distribution for each agent and the shaded region represents returns that are out-of-distribution for ESPER.

While a return-conditioned agent cannot achieve a return of 1 consistently in the gambling environment, ESPER can achieve any possible return.

We tested ESPER in Connect 4 against a stochastic opponent that usually plays optimally, but rarely will not play in the rightmost column. ESPER learns clusters corresponding to winning and losing behavior while a return-conditioned agent will sometimes wrongly assume that the stochastic opponent will make a mistake.

We tested ESPER in a modified 2048 environment where the agent wins if it creates a 128 tile. While a return-conditioned agent cannot disentangle trajectories which won due to good behavior from ones where the agent was lucky, ESPER learns clusters corresponding to random and expert behavior and can solve the task.

Scaling

We trained ESPER and the return-conditioned agent with varying amounts of data. ESPER achieves stronger performance with more data while the return-conditioned agent doesn't. This shows that poor performance on stochastic tasks cannot simply be solved with scale.