SEA: Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation

Chia-Wen Kuo, Chih-Yao Ma, Judy Hoffman, Zsolt Kira

Georgia Institute of Technology

GitHub

Abstract

In Vision-and-Language Navigation (VLN), researchers typically take an image encoder pre-trained on ImageNet without fine-tuning on the environments that the agent will be trained or tested on. However, the distribution shift between the training images from ImageNet and the views in the navigation environments may render the ImageNet pre-trained image encoder suboptimal.

Therefore, in this paper, we design a set of structure-encoding auxiliary tasks (SEA) that leverage the data in the navigation environments to pre-train and improve the image encoder. Specifically, we design and customize: (1) 3D jigsaw, (2) traversability prediction, and (3) instance classification to pre-train the image encoder.

Through rigorous ablations, our SEA pre-trained features are shown to better encode structural information of the scenes, which ImageNet pre-trained features fail to properly encode but is crucial for the target navigation task. The SEA pre-trained features can be easily plugged into existing VLN agents without any tuning. For example, on Test-Unseen environments, the VLN agents combined with our SEA pre-trained features achieve absolute success rate improvement of 12% for Speaker-Follower, 5% for Env-Dropout, and 4% for AuxRN.

Motivation

Most of the existing VLN works ignore the importance of the underlying visual representation by simply taking an image encoder pre-trained on ImageNet to encode the views in the navigation environments. Because of the data distribution shift between ImageNet and the navigation environments, as well as the difference between the pre-training task (image classification) and the target task (VLN), the ImageNet pre-trainedimage encoder may not be able to encode information crucial for the VLN task.

Furthermore, in the navigation environments, image labels such as semantic segmentation masks, object bounding boxes, or object and scene classes may not be available for fine-tuning the image encoder. It is computationally prohibitive to fine-tune the image encoder jointly with the agent on the target VLN task.

To improve the image encoder without the need for manually annotated labels in the target environments and without fine-tuning with the VLN agent jointly, we pre-train the image encoder on proposed structure-encoding auxiliary tasks (SEA) with data available in the navigation environments.

Proposed Method

What are important information?

We start by observing the following instruction example: “Exit the screening room, make a right, go straight into the room with the globe and stop.”

As highlighted above, to correctly follow the instruction, the agent needs to encode the following information from its image encoder:

Structural information of the scene (exit, right, straight into)
Discriminative information for scenes and objects (screening room, room, globe) in the visual representation.

Therefore, we design three auxiliary tasks shown in the figure above: (1) 3D jigsaw, (2) traversability prediction, and (3) instance classification to encode these crucial information for VLN.

3D Jigsaw

Predict the relative pose between an anchor view (red box) and a query view (yellow). The query view is sampled from the anchor view’s neighboring views along the elevation, heading, and position dimensions.

Traversability Prediction

Predict whether a view contains any traversable direction. The images in the blue box are labeled as True (contain traversable directions), and the images in the red box are labeled as False (do not contain traversable directions.)

Instance Classification

Identify a view’s augmented copy from a pool of other image views. In this example, the view in the blue box is the corresponding augmented copy (positive pair), while the views in the red box are other image views (negative pairs).

VLN Agent

After pre-training the image encoder with the proposed SEA auxiliary tasks, we use our pre-trained visual encoder in place of the ImageNet pre-trained visual encoder for existing VLN methods. By decoupling the pre-training of the image encoder from the VLN agent, other VLN methods can benefit from our improved visual representation with minimal modification.

Main Results

We select Speaker-Follower, Env-Dropout, and AuxRN, and replace the ImageNet pre-trained features with our SEA pre-trained features. We use the released code from these VLN methods and train the agent with our SEA pre-trained features without any hyper-parameter tuning for the agent.

With our SEA pre-trained features, all three agents achieve consistent improvement in Val-Unseen and Test-Unseen. Notably, in Test-Unseen, the most important part of the evaluation since it tests generalization performance to new held-out environments, our SEA pre-trained features achieve 12% absolute improvement in both SR and SPL for Speaker-Follower, and 4% for the already strong AuxRN agent.

Analyses

What information is encoded by training on the auxiliary tasks?

We first train the image encoder with different combinations of auxiliary tasks. We then append a light-weight head to the image encoder and fine-tune only the head (with the image encoder frozen) to downstream tasks:

Semantic segmentation
Normal estimation
Multi-label object classification
Scene classification

Semantic segmentation and normal estimation require structural information of the scene, while multi-label object classification and scene classification require discriminative information of objects and classes.

In the table above, our SEA pre-trained features successfully encode more structural information of the scenes, which are crucial for performing a navigation task in addition to the discriminative information of objects and scenes.

How does agent’s performance correlate with each auxiliary task?

Instance classification (#4,6,7) is the most effective auxiliary task among the three, while 3D jigsaw and traversability are also beneficial as they further improve the performance when combined with instance classification.

Resources

arXiv

GitHub

@inproceedings{kuo2023structure,

title={Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation},

author={Chia-Wen Kuo and Chih-Yao Ma and Judy Hoffman and Zsolt Kira},

booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},

pages={1104--1113},

year={2023}

}

SEA: Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation

Chia-Wen Kuo, Chih-Yao Ma, Judy Hoffman, Zsolt Kira

Georgia Institute of Technology

Abstract

Motivation

Proposed Method

What are important information?

3D Jigsaw

Traversability Prediction

Instance Classification

VLN Agent

Main Results

Analyses

What information is encoded by training on the auxiliary tasks?

How does agent’s performance correlate with each auxiliary task?

Resources