Online Decision Transformer

Qinqing Zheng, Amy Zhang, Aditya Grover

Meta AI Research, UC Berkeley, UCLA

Can sequence models balance exploration and exploitation?

Abstract

Recent work has shown that offline reinforcement learning (RL) can be formulated as a sequence modeling problem (Chen et al., 2021; Janner et al., 2021) and solved via approaches similar to large- scale language modeling. However, any practical instantiation of RL also involves an online component, where policies pretrained on passive offline datasets are finetuned via task-specific interactions with the environment. We propose Online Decision Transformers (ODT), an RL algorithm based on sequence modeling that blends offline pretraining with online finetuning in a unified framework. Our framework uses sequence-level entropy regularizers in conjunction with autoregressive modeling objectives for sample-efficient exploration and finetuning. Empirically, we show that ODT is competitive with the state-of-the-art in absolute performance on the D4RL benchmark but shows much more significant gains during the finetuning procedure.

New Designs

While retaining simplicity, ODT redesigns Decision Transformer (DT, Chen et al., 2021) policies to enable online learning.

1. We consider stochastic policies. We optimize the likelihood of such policies augmented with novel entropy regularizers. These entropy regularizers operate at a sequence level and guide exploration.

As shown in the plot, stochasticity is key to enable stable online training, whereas the deterministic variant of ODT exhibits high variation.

2. We carefully design the target returns-to-go (RTG) token for exploration. We find that using a fixed RTG slightly greater than the expert performance works best for ODT, outperforming the quantile based curriculum strategies.

3. The conditioned RTG can be mismatched from the true achieved RTG during trajectory rollouts. We correct for this mismatch via hindsight relabelling of the trajectory returns.

Benchmark Comparison

ODT is competitive with the state-of-the-art in absolute performance but shows much more significant gains during finetuning.

Baselines: DT, implicit Q learning (IQL, Kostrikov et al., 2021) and Soft actor critic (SAC, Haarnoja et al., 2018a) .

Benchmark: D4RL

Page updated

Google Sites

Report abuse