Spatially-Aware Transformer for
Embodied Agents

 Junmo Cho*      Jaesik Yoon*      Sungjin Ahn

ICLR 2024 Spotlight

arXiv    code

 Episodic memory plays a crucial role in various cognitive processes, such as the ability to mentally recall past events. While cognitive science emphasizes the significance of spatial context in the formation and retrieval of episodic memory, the current primary approach to implementing episodic memory in AI systems is through transformers that store temporally ordered experiences, which overlooks the spatial dimension. As a result, it is unclear how the underlying structure could be extended to incorporate the spatial axis beyond temporal order alone and thereby what benefits can be obtained. To address this, we explore the use of Spatially-Aware Transformer models that incorporate spatial information.

Home Robot Thought Experiments

Let's consider the home robot to help our housework.

a) When robot navigates the house in diverse ways, without spatial context, can it remember the contents through the spatially relational question? Such as please pick an object in the left room of Room D.

b) Another interesting thought experiment can be, how can the temporal episodic memory handle when the robot stays a lot in a single room? Generally, temporal memory follows FIFO policy to replace their memory with a new memory. Then, can the older memory than Room B be evicted?

Based on those two interesting questions, we proposed a new spatially-aware transformer-based episodic memory.

Spatially-Aware Transformer (SAT)


Under the consideration that the spatial information is given with the observation, the Spatially-Aware Transformer is designed by adding the spatial information through a learnable embedding. 

SAT with FIFO Memory

The primitive design of SAT is to apply SAT to FIFO policy memory. 

In the above figure, the color themes mean the different space. Through this, we can expect the spatially relational tasks such as the Thought Experiment a), but it will evict the old memory first like the temporal memory, so it cannot solve experiment b).

SAT with Place Memory

Then, how could we keep the old memory? One of the heuristic policies can be managing the memory per space. We call it Place Memory.

Place memory divides the capacity to each place evenly. If one place memory is full, then evicting the old memory in the place.

Let's consider again the Thought Experiment b). Even though the robot stays long time in Room B, Place Memory can sustain the memory from other rooms by replacing the memory only from Room B.

Then, Place Memory is always better than FIFO? No, it is not. The memory policy performance can be diverse for the given task.

SAT with Adaptive Memory Allocator (AMA)

We discussed FIFO and Place Memory policies, and which one is better than another. Thus, we propose a novel learnable policy based on the performance on the given task.

The policy set (e.g., FIFO and Place Memory) and the task description (if exists) are given for the policy to learn to find the best policy in the given set.

Through the AMA, the memory can "adaptively" change their policy for the given task.

Experimental Results


We present our experimental results and analysis for supervised learning and Reinforcement Learning tasks.

Implicit derivation of spatial information in transformers

To verify that the transformers can derive the spatial information, we designed a spatial reasoning task on the Room Ballet environment, which we call the Next Ballet task. In this task, the agent is asked to remember the dance type danced in the left room of where the given queried dancer is in.

From this experimental result, we can find the temporal memory (even given the action information), it is not good enough to infer the spatial relationship while SAT can solve it by infering the relationship given spatial information.

Place-centric hierarchical reading

Under the hypothesis that the place-specifically collected memory can be beneficial for spatial reading operations, we adapted the hierarchical reading from the previous work, HCAM and designed another task. This task is based on the Room Ballet Environment as the above experiment. In this task, the agent can change their visitation at every timestamp different from the above, so to predict which dance did in the specific room, the agent should access the memories from the room.

The result shows that placen-centric hierarchical reading on Place Memory (PM-PM) can solve this task even on the large environment, while the agent without hierarchical reading (SAT-FIFO) and the agent with temporal hierarchical reading (SAT-FIFO-TH) showed worse performance in large environment. We note that, in this task, the capacity is large enough, so even using FIFO policy, every memory is in the transformer episodic memory.

Learning to select memory allocation strategy with AMA

In this experiment, we assume that the memory capacity is limited and demonstrate that with SAT-AMA, it is possible to learn to find a proper memory management strategy that best fits the downstream task and that it is better than always using FIFO. We first intro- duce four strategies: First-In-First-Out (FIFO), Last-In-First-Out (LIFO), Most-Visited-First-Out (MVFO), and Least-Visited-First- Out (LVFO). FIFO/LIFO replace the oldest/newest observation in the memory. MVFO/LVFO replace the oldest memory from the most/least frequently visited room. Note that whether these strate- gies are good or not is not the primary value of this work. Rather, they are examples of prior strategies that can be designed as the below figure.

We can find the SAT-AMA can solve the four tasks adaptively, even the task description is given as natural language, unseen sentence like "Recall the dancers in the room the agent frequented the most" (it's performance is shown in the right figure as SAT-AMA-Lang).

Action-conditioned generation in FFHQ world

We demonstrate the capability of the SAT to generate conditioning on actions while navigating an environment to see its potential to implement an episodic world model. To test this, we designed a navigation task on the facial images from FFHQ dataset. The environment is a face image which is considered as a 10 × 10 grid map. The agent observes a partial image of the face, as shown in below figure, while navigating the image provided with an action (up, down, left, right) at each step. During the observation phase, the agent randomly explores the environment, and it is asked to remember the partial images only given random action sequences. 

We evaluated SAT and temporal transformer episodic memory with action. We can find the temporal memory cannot achieve good performance like SAT even with action information. Additionally, by evaluating diverse size of place set (e.g., 8, 16, 32), we showed stability of the SAT performance.

Long-term Memory required RL task

Lastly, we test an RL agent equipped with SAT-AMA memory. The task involves two rooms. The agent is initially placed in a room with a single box of a random color (either blue or green). After staying a few steps in the room, the agent is transported to the second room which is larger and contains multiple yellow boxes, and navigates for long time. After this observation phase, the agent is warped back to the first room, where it is tasked to collect the box that matches the color of the box observed initially. The memory capacity is limited to memorize every history, so the agent should replace the second room memory while sustaining the memory from first room.

We can find AMA can select a proper policy to solve this task, thereby the agent with AMA can solve this.

Conclusion and Discussion

Transformer-based episodic memory has utilized temporal order to serialize experience frames. In this paper, we explore the potential of incorporating the spatial axis as another fundamental aspect of the physical world. We argue that this is crucial for embodied agents and introduce the concept of Spatially-Aware Transformers (SAT). We propose different SAT architectures, starting from the simplest SAT-FIFO to SAT with place memory, which includes a hierarchical episodic memory centered around places. Furthermore, we introduce the Adaptive Memory Allocation (AMA) method, which provides a more flexible memory management strategy beyond the FIFO memory writing approach. Through experiments, we assess the improved performance of these methods and demonstrate their applicability to a wide range of machine learning problems.