Embodied Multimodal Multitask Learning

Devendra Singh Chaplot1,2, Lisa Lee1, Ruslan Salakhutdinov1, Devi Parikh2,3, Dhruv Batra2,3∗

1Carnegie Mellon University, 2Facebook AI Research, 3Georgia Institute of Technology

Embodied Multimodal Tasks. We focus on multi-task learning of two visually-grounded language navigation tasks: In Semantic Goal Navigation (SGN), the agent is given a language instruction (“Go to the red torch”) to navigate to a goal location. In Embodied Question Answering (EQA), the agent is given a question (“What color is the torch?”), and it must navigate around the 3D environment to explore the environment and gather information to answer the question (“red”).

Cross-task Knowledge Transfer. Train and test sets are created for testing cross-task knowledge transfer between SGN and EQA. Each instruction in the test set contains a word that is never seen in any instruction in the training set but is seen in some questions in the training set. Similarly, each question in the test set contains a word never seen in any training set question.

Dual-Attention Policy visualization: No Aux, Hard

Below, we show policy visualizations for our Dual-Attention model trained without Auxiliary tasks on the train set (left) and test set (right) in the Doom Hard environment. Note that the Aux labels are just shown for reference, and the Dual-Attention model is trained without auxiliary labels.

Train set

Test set

Dual-Attention Model

Architecture of the Dual-Attention unit with example intermediate representations and operations.