Recurrent Action Transformer with Memory (RATE) is a novel offline reinforcement learning architecture that combines memory embeddings, hidden state caching, and a Memory Retention Valve (MRV). This design allows the model to selectively retain information and extrapolate successfully over sequences up to 9600 steps long. RATE consistently outperforms baselines in memory-intensive environments and remains highly competitive on standard benchmarks like Atari and MuJoCo.
In real-world scenarios, agents often receive partial observations, framing tasks as Partially Observable Markov Decision Processes (POMDPs) where historical context is vital for success. Standard architectures like the Decision Transformer struggle here because their fixed context windows cannot capture delayed rewards or long-term dependencies. Memory mechanisms overcome this by allowing agents to explicitly retain and retrieve critical cues from thousands of steps in the past.
The Decision Transformer effectively models offline RL tasks, but it can only utilize past events that fit within its fixed context window. Expanding this window is computationally prohibitive due to the quadratic cost of the self-attention mechanism, causing the agent to fail when crucial cues appear thousands of steps earlier. Memory mechanisms overcome this limitation by carrying out-of-context information forward through segment-level recurrence, enabling the model to successfully navigate long-horizon, memory-intensive environments.
To equip the model with memory, RATE introduces segment-level recurrence that combines learnable memory embeddings, caching of past hidden states, and a novel Memory Retention Valve (MRV).
Input trajectories, composed of returns-to-go, observations, and actions, are divided into non-overlapping segments, with memory embeddings both prepended and appended to each segment to provide simultaneous read and write access for the transformer.
After the segment is processed, the MRV uses a cross-attention mechanism to selectively filter the updated memory embeddings, preventing the catastrophic overwriting of critical information across extremely long and sparse sequences.
We designed our experiments to achieve two main goals: (a) to showcase the strengths of the RATE model in memory-intensive environments, and (b) to assess its effectiveness in standard MDPs to demonstrate its overall versatility. To evaluate its memory capabilities, we tested RATE on T-Maze, ViZDoom-Two-Colors, MemoryMaze, Minigrid-Memory, and the POPGym benchmark. Additionally, we evaluated the model on classic Atari and MuJoCo environments to confirm that its memory mechanisms do not hinder its performance on standard RL tasks.
In the ViZDoom-Two-Colors environment, Decision Transformer shows an inability to remember the pillar color, as a consequence of which its total reward drops dramatically, in contrast to memory-enhanced agents (RATE, RMT, TrXL) (figure c, d). In turn, RATE outperforms RMT and TrXL on average (figure a) and has better stability (figure e, f, g).
In the highly sparse Passive T-Maze environment, an agent receives a cue at the first step that indicates the correct turn at the end of a maze. To test out-of-distribution generalization, all models were trained on episodes of up to 900 steps and then evaluated on extended sequences of up to 9,600 steps. RATE significantly outperforms the Decision Transformer (DT) and memory-enhanced baselines like RMT and TrXL, which degrade sharply when sequence lengths increase. Remarkably, RATE achieves 100% success on in-distribution sequences and maintains high performance even at 9,600-step inference, equivalent to processing nearly 28,800 tokens, proving its ability to retain sparse cues over incredibly long horizons.
In the Minigrid-Memory environment, agents must locate a clue before making a decision, requiring a combination of memory and credit assignment. To test model generalization, all agents were trained on grids of a fixed 41×41 size and evaluated on a wide range of unseen grid sizes spanning from 11×11 up to 501×501. RATE achieves consistently high average returns across the entire spectrum, demonstrating exceptionally strong interpolation and extrapolation capabilities. While baselines like TrXL also perform well on average, they exhibit notably higher variance and sensitivity to grid scale, highlighting RATE's superior stability and robustness.
In standard offline RL benchmarks like Atari and MuJoCo, RATE not only outperforms the standard Decision Transformer but also matches or exceeds recent state-of-the-art methods, including specialized state space models like Decision Mamba and MambaDM. These results confirm that RATE's advanced memory mechanisms do not hinder its performance on standard Markov Decision Processes, establishing it as a highly versatile, general-purpose offline RL model across different task types.
We propose the Recurrent Action Transformer with Memory (RATE), a novel transformer-based architecture for offline RL that successfully combines attention with recurrence for long-horizon decision-making. By integrating memory embeddings, hidden state caching, and the Memory Retention Valve (MRV), RATE selectively retains critical information across sequence segments and is theoretically guaranteed to prevent catastrophic memory loss. Ultimately, RATE achieves state-of-the-art results on memory-intensive tasks, extrapolating successfully over sequences up to 9,600 steps long, while matching or exceeding specialized baselines on standard Atari and MuJoCo tasks, establishing it as a highly versatile, general-purpose offline RL model.
If you find our work useful, please cite our paper:
@inproceedings{
cherepanov2026recurrent,
title={Recurrent Action Transformer with Memory},
author={Egor Cherepanov and Aleksei Staroverov and Alexey Kovalev and Aleksandr Panov},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=kByN4v0M3e}
}