Generalized Decision Transformer for Offline Hindsight Information Matching
Hiroki Furuta, Yutaka Matsuo, Shixiang Shane Gu
International Conference on Learning Representations (ICLR2022), Spotlight
Overview
How to extract as much learning signal from each trajectory data has been a key problem in reinforcement learning (RL). Recent works have shown that conditioning on future trajectory information -- such as future states in hindsight experience replay (HER) or returns-to-go in Decision Transformer (DT) -- enables efficient learning of context-conditioned policies. We demonstrate that all these approaches are essentially doing hindsight information matching (HIM) -- training policies that can output the rest of trajectory that matches some statistics of future state information. We introduce Generalized Decision Transformer (GDT) framework and show how different choices for the feature function Φ(s, a) and the anti-causal aggregator not only recover DT as a special case, but also lead to novel Categorical DT (CDT) and Bi-directional DT (BDT) for matching different statistics of the future information offline.
Concept of Generalized Decision Transformer
DT (Chen et al. 2021): Φ(s, a) = r(s, a) , and using summation as anti-causal aggregator
Categorical DT (Ours): Φ(s, a) = r(s, a) or any function of state-action pair, and using binning as anti-causal aggregator
Bi-directional DT (Ours): Φ(s, a) = r(s, a) or any function of state-action pair, and using transformer as anti-causal aggregator
Experiments & Video
One of the variants: Categorical DT can choose any Φ as long as the binning is tractable (e.g. reward, xyz-velocities, etc) and solve offline multi-task state-marginal matching problem. We provide the following two benchmark problems and results.
Synthesized Bi-modal Distribution
While it was trained with uni-modal trajectories (cheetah running forward, or backflipping; top two gifs), CDT successfully matches to the patchworked bi-modal distribution: cheetah running forward first and then backflipping during a single rollout (down three gifs). We term it "distribution stitching".
Running Forward Expert
Backflipping Expert
Run and Turn behavior Learned by CDT
Example # 1
Example # 2
Diverse Unseen Distribution
When it is trained with a diverse enough dataset, CDT also can match the target distributions unseen during training. In cheetah-velocity task from meta RL/IL problem, CDT enables the cheetah to run with unknown target x-velocity (x-vel = 0.5, 1.5, 2.5).