Episodic Transformer for
Vision-and-Language Navigation

Alexander Pashevich, Cordelia Schmid, Chen Sun

Inria, Google Research, Brown University

Abstract: Interaction and navigation defined by natural language instructions in dynamic environments pose significant challenges for neural agents. This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions. We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. To improve training, we leverage synthetic instructions as an intermediate representation that decouples understanding the visual appearance of an environment from the variations of natural language instructions. We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance. Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.

Motivation

We attempt to address two main challenges of vision-and-language navigation (VLN): (1) handling highly compositional tasks consisting of many subtasks and actions; (2) understanding the complex human instructions that are used to specify a task. The figure on the right shows an example task that illustrates both challenges. We show six key steps from a demonstration of 53 actions with corresponding step-by-step instructions. To fulfill the task, the agent is expected to remember the location of a fireplace and use this knowledge much later. It also needs to solve object- (e.g. “another vase”) and location-grounded (e.g. “where you were standing previously”) coreference resolution in order to understand the human instructions.

Figure 1: An example of a compositional task in the ALFRED dataset where the agent is asked to bring two vases to a cabinet.

Episodic Transformer (E.T.) architecture

E.T. architecture relies on attention-based multi-layer transformer encoders. It has no hidden state and observes the full history of visual observations and previous actions. To predict the next action, the E.T. model is given a natural language instruction, visual observations since the beginning of an episode and previously taken actions. Figure 2 shows an example that corresponds to the 6th timestep of an episode. After processing a language instruction with a transformer-based language encoder, embedding 6 visual observations with a ResNet-50 backbone and passing 5 previous actions through a look-up table, the agent outputs 6 actions. During training we use all predicted actions for a gradient descent step. At test time, we apply the last action to the environment.

Figure 2: Overview of the E.T. architecture.

Leveraging synthetic representations

We observe that domain-specific language, and temporal logic can unambiguously specify the target states and (optionally) their temporal dependencies, while being decoupled from the visual appearance of a certain environment and the variations of human instructions. We hypothesize that using these synthetic instructions as an intermediate interface between the human and the agent would help the model to learn more easily and generalize better. First, we pretrain the language encoder of the model to translate natural language instructions to synthetic language instructions (left image). Due to a more task-oriented synthetic representation, the language encoder learns a better representation. We use the language encoder weights to initialize the language encoder of the agent (shown in yellow). Secondly, we jointly use demonstrations annotated with natural language and demonstrations annotated with synthetic language to train the agent (right image). Due to the larger size of the synthetic language dataset, the resulting agent has better performance even when evaluated on natural language annotations.

Figure 3: We use synthetic annotations for pretraining the language encoder and training the whole model jointly with human annotations.

Attention to previous observations

To better understand the impact of using a transformer encoder for action predictions, we show several qualitative examples of attention weights produced by the multimodal encoder of an E.T. agent. We use attention rollout to compute attention weights from an output action to previous visual observations. The figure shows examples where an E.T. model attends to previous visual frames to successfully solve a task. The frames attention weights are showed with a horizontal bar where frames corresponding to white squares have close to zero attention scores and frames corresponding to red squares have high attention scores. We do not include the attention score of the current frame as it is always significantly higher than scores for previous frames.

Figure 4: Visualizations of normalized attention heatmap to previous visual observations.

Figure explanation

In the first example, the agent is asked to pick up an apple and to heat it using a microwave. The agent walks past a microwave at timestep 8, picks up an apple at timestep 18 and attends to the microwave frame in order to recall where to bring the apple. In the second example, the agent slices a potato at timesteps 17-18 (hard to see on the visual observations). Later, the agent gets rid of the knife and follows the next instruction asking to pick up a potato slice. At timestep 39, the agent attends to the frames 17-18 where the potato was sliced in order to come back to the slices and complete the task. In the third example, the agent needs to sequentially move two pans. While picking up the second pan at timestep 29, the agent attends to the frames 20-22 where the first pan was replaced. In the fourth example, the agent is asked to wash a cloth and to put it to a drawer. The agent washes the cloth at timestep 20 but the cloth state change is hard to notice at the given frames. At timestep 31, the agent attends to the frame with an open tap in order to keep track of the cloth state change.

Attention to language tokens

We illustrate transformer attention scores from an output action to input language tokens by comparing two models: (1) E.T. model trained from scratch, (2) E.T. model whose language encoder is pretrained with the translation task. Similarly to the visual attention, we use attention rollout and highlight the words with high attention scores with red background color. We observe that the agent trained without language pretraining misses word tokens that are important for the task according to human interpretation (marked with blue rectangles). In contrast, the pretrained E.T. agent often is able to pay attention to those tokens and solve the tasks successfully.

Figure 5: Visualizations of language attention heatmaps, without and with the language encoder pretraining.

Figure explanation

In the first example, the agent needs to pick up a bat. While the non-pretrained E.T. model has approximately equal attention scores for multiple tokens (those words are highlighted with pale pink color) and does not solve the task, the pretrained E.T. attends to bat tokens (highlighted with red) and successfully finds the bat. In the second example, the agent needs to first cool an egg in a fridge and to heat it in a microwave later. The non-pretrained E.T. has similar attention scores for microwave and refridgerator tokens (they are highlighted with pink) and makes a mistake by choosing to heat the egg first. The pretrained E.T. agent has higher attention scores for the refridgerator tokens and correctly decides to cool the egg first. In the third example, the agent needs to pick up a knife to cut a potato later. The non-pretrained agent distributes its attention over many language tokens and picks up a fork which is incorrect. The pretrained E.T. agent strongly attends to the knife token and picks the knife up.

Qualitative analysis

We show 3 successful and 2 failed examples of the E.T. agent solving tasks from the ALFRED validation fold. In the first example, the agent performs a sequence of 148 actions and successfully solves a task. This example shows that the agent is able to pick up small objects such as a knife and a tomato slice. The agent puts both of them to a plate and brings the plate to a fridge. In the second example, the agent brings a washed plate to a fridge. The agent does not know where the plate is and walks along a counter checking several places. Finally, it finds the plate, washes it and brings it to the fridge. In the third example, the agent successfully heats an apple and puts it on a table. The agent understands the instruction "bring the heated apple back to the table on the side" and navigates back to its previous position.

Among the most common failure cases are picking up wrong objects and mistakes during navigation. In the fourth example, the agent misunderstands the instruction "pick up the bowl to the right of the statue on the table" and decides to pick up a statue on the frame marked with red. It then brings the statue to a correct location but the full task is considered to be failed. The fifth example shows a failure mode in an unseen environment. The agent is asked to pick up a basketball and to bring it to a lamp. The agent first wanders around a room but eventually picks up the basketball. It then fails to locate the lamp and finds itself staring into a mirror. The agent gives up on solving the task and decides to terminate the episode.

Figure 6: Example trajectories of E.T. agent solving ALFRED tasks.

Episodic Transformer for Vision-and-Language Navigation

Motivation

Episodic Transformer (E.T.) architecture

Leveraging synthetic representations

Attention to previous observations

Figure explanation

Attention to language tokens

Figure explanation

Qualitative analysis

Episodic Transformer for
Vision-and-Language Navigation