Online Decision Transformer
Qinqing Zheng, Amy Zhang, Aditya Grover
Meta AI Research, UC Berkeley, UCLA
Qinqing Zheng, Amy Zhang, Aditya Grover
Meta AI Research, UC Berkeley, UCLA
Recent work has shown that offline reinforcement learning (RL) can be formulated as a sequence modeling problem (Chen et al., 2021; Janner et al., 2021) and solved via approaches similar to large- scale language modeling. However, any practical instantiation of RL also involves an online component, where policies pretrained on passive offline datasets are finetuned via task-specific interactions with the environment. We propose Online Decision Transformers (ODT), an RL algorithm based on sequence modeling that blends offline pretraining with online finetuning in a unified framework. Our framework uses sequence-level entropy regularizers in conjunction with autoregressive modeling objectives for sample-efficient exploration and finetuning. Empirically, we show that ODT is competitive with the state-of-the-art in absolute performance on the D4RL benchmark but shows much more significant gains during the finetuning procedure.
While retaining simplicity, ODT redesigns Decision Transformer (DT, Chen et al., 2021) policies to enable online learning.
1. We consider stochastic policies. We optimize the likelihood of such policies augmented with novel entropy regularizers. These entropy regularizers operate at a sequence level and guide exploration.
As shown in the plot, stochasticity is key to enable stable online training, whereas the deterministic variant of ODT exhibits high variation.
2. We carefully design the target returns-to-go (RTG) token for exploration. We find that using a fixed RTG slightly greater than the expert performance works best for ODT, outperforming the quantile based curriculum strategies.
3. The conditioned RTG can be mismatched from the true achieved RTG during trajectory rollouts. We correct for this mismatch via hindsight relabelling of the trajectory returns.
ODT is competitive with the state-of-the-art in absolute performance but shows much more significant gains during finetuning.
Baselines: DT, implicit Q learning (IQL, Kostrikov et al., 2021) and Soft actor critic (SAC, Haarnoja et al., 2018a) .
Benchmark: D4RL