Generalized Decision Transformer for Offline Hindsight Information Matching

Hiroki Furuta, Yutaka Matsuo, Shixiang Shane Gu

International Conference on Learning Representations (ICLR2022), Spotlight

Overview

How to extract as much learning signal from each trajectory data has been a key problem in reinforcement learning (RL). Recent works have shown that conditioning on future trajectory information -- such as future states in hindsight experience replay (HER) or returns-to-go in Decision Transformer (DT) -- enables efficient learning of context-conditioned policies. We demonstrate that all these approaches are essentially doing hindsight information matching (HIM) -- training policies that can output the rest of trajectory that matches some statistics of future state information. We introduce Generalized Decision Transformer (GDT) framework and show how different choices for the feature function Φ(s, a) and the anti-causal aggregator not only recover DT as a special case, but also lead to novel Categorical DT (CDT) and Bi-directional DT (BDT) for matching different statistics of the future information offline.

Concept of Generalized Decision Transformer

DT (Chen et al. 2021): Φ(s, a) = r(s, a) , and using summation as anti-causal aggregator
Categorical DT (Ours): Φ(s, a) = r(s, a) or any function of state-action pair, and using binning as anti-causal aggregator
Bi-directional DT (Ours): Φ(s, a) = r(s, a) or any function of state-action pair, and using transformer as anti-causal aggregator

Experiments & Video

One of the variants: Categorical DT can choose any Φ as long as the binning is tractable (e.g. reward, xyz-velocities, etc) and solve offline multi-task state-marginal matching problem. We provide the following two benchmark problems and results.

Synthesized Bi-modal Distribution

While it was trained with uni-modal trajectories (cheetah running forward, or backflipping; top two gifs), CDT successfully matches to the patchworked bi-modal distribution: cheetah running forward first and then backflipping during a single rollout (down three gifs). We term it "distribution stitching".

Running Forward Expert

Backflipping Expert

Run and Turn behavior Learned by CDT

Example # 1

Example # 2

Diverse Unseen Distribution

When it is trained with a diverse enough dataset, CDT also can match the target distributions unseen during training. In cheetah-velocity task from meta RL/IL problem, CDT enables the cheetah to run with unknown target x-velocity (x-vel = 0.5, 1.5, 2.5).