Recurrent Action Transformer with Memory

Egor Cherepanov1,2*, Alexey Staroverov1*, Dmitry Yudin1,2, Alexey K. Kovalev1,2, and Aleksandr I. Panov1,2

1AIRI, Moscow, Russia, 2MIPT, Dolgoprudny, Russia

{cherepanov,staroverov,yudin,kovalev,panov}@airi.net

*Equal contribution

Code

arXiv

TL;DR

Recurrent Action Transformer with Memory (RATE) is a novel model architecture for offline reinforcement learning that incorporates a recurrent memory mechanism designed to regulate information retention. RATE outperforms other memory-enhanced agents in memory-intensive environments and shows better or comparable results in classic tasks such as Atari and MuJoCo.

Why do we need memory?

In the real world, most tasks are described in terms of Partially Observable Markov Decision Processes, where decision-making in the present moment depends on the historical context.

Decision Transformer, based on the Transformer architecture and designed for offline RL tasks, is able to process information about events from the past only as long as it is in the context of the model. However, due to the quadratic complexity of the attention mechanism, increasing the size of the context is not always possible, thus necessitating the use of memory mechanisms. These mechanisms allow to take into account out-of-context information, which allows to work efficiently in memory-intensive environments.

How did we add memory?

In RATE we utilized recurrently trained memory embeddings and caching of previous hidden states to add memory, and Memory Retention Valve (MRV) to control information leakage from memory embeddings, allowing sparse sequences to be processed.

A trajectory consisting of returns-to-go, observations and actions is divided into segments, then trainable memory embeddings are concatenated to these segments, and then RATE is recurrently trained on these segments. The updated memory embeddings after processing each segment are processed using a MRV block based on the cross-attention mechanism.

Experiments

We designed experiments to accomplish two primary objectives: (a) to demonstrate the advantage of our RATE model in memory-intensive environments (T-Maze, ViZDoom-Two-Colors, Memory Maze, Minigrid-Memory), and (b) to investigate the effectiveness of the proposed model in classical MDPs to demonstrate its versatility (Atari and MuJoCo).

ViZDoom-Two-Colors

In the ViZDoom-Two-Colors environment, Decision Transformer shows an inability to remember the pillar color, as a consequence of which its total reward drops dramatically, in contrast to memory-enhanced agents (RATE, RMT, TrXL) (figure c, d). In turn, RATE outperforms RMT and TrXL on average (figure a) and has better stability (figure e, f, g).

T-Maze

In the T-Maze environment with a highly sparse structure and reward function, all models were trained on corridors of length 1 - 90 and validated on corridors of length 1 - 900 (out-of-context). RATE significantly outperforms DT and memory-enhanced baselines RMT and TrXL.

Minigrid-Memory

In the Minigrid-Memory environment, agents were trained on mazes of size 31x31 and validated on mazes of size 11x11 to 91x91. RATE interpolates slightly worse than TrXL but better than RMT, and extrapolates slightly worse than RMT but better than TrXL. Thus, on average, RATE shows more stable results compared to other baselines.

Atari & MuJoCo

In Atari and MuJoCo tasks, RATE not only outperforms DT but also provide better or comparable results to modern Mamba-based architectures. The obtained results confirm the ability of RATE to work not only in memory-intensive environments, but also in classical tasks.

Conclusion

In this paper, we propose Recurrent Action Transformer with Memory (RATE), a transformer model for offline RL that exploits memory mechanisms in the form of memory embeddings and caching of hidden states, and the Memory Retention Valve (MRV), which controls memory updating and prevents the loss of important information in sparse tasks.

The proposed model interpolates and extrapolates well outside the transformer context, is able to retain important information for a long time when operating in highly sparse environments, and allows to compensate for the effect of bias in the training data.

Citation

If you find our work useful, please cite our paper:

@misc{cherepanov2024rate,

title={Recurrent Action Transformer with Memory},

author={Egor Cherepanov and Alexey Staroverov and Dmitry Yudin and Alexey K. Kovalev and Aleksandr I. Panov},

year={2024},

eprint={2306.09459},

archivePrefix={arXiv},

primaryClass={cs.LG},

url={https://arxiv.org/abs/2306.09459},

}

Page updated

Google Sites

Report abuse