Embodied Multimodal Multitask Learning

Devendra Singh Chaplot^1,2, Lisa Lee¹, Ruslan Salakhutdinov¹, Devi Parikh^2,3, Dhruv Batra^2,3∗

¹Carnegie Mellon University, ²Facebook AI Research, ³Georgia Institute of Technology

Published at IJCAI 2020

Embodied Multimodal Tasks. We focus on multi-task learning of two visually-grounded language navigation tasks: In Semantic Goal Navigation (SGN), the agent is given a language instruction (“Go to the red torch”) to navigate to a goal location. In Embodied Question Answering (EQA), the agent is given a question (“What color is the torch?”), and it must navigate around the 3D environment to explore the environment and gather information to answer the question (“red”).

Cross-task Knowledge Transfer. Train and test sets are created for testing cross-task knowledge transfer between SGN and EQA. Each instruction in the test set contains a word that is never seen in any instruction in the training set but is seen in some questions in the training set. Similarly, each question in the test set contains a word never seen in any training set question.

Dual-Attention Policy visualization: No Aux, Hard

Below, we show policy visualizations for our Dual-Attention model trained without Auxiliary tasks on the train set (left) and test set (right) in the Doom Hard environment. Note that the Aux labels are just shown for reference, and the Dual-Attention model is trained without auxiliary labels.

Train set

Test set

Dual-Attention Model

Architecture of the Dual-Attention unit with example intermediate representations and operations.