Program
Invited Talks
Speaker: Yoav Artzi, Cornell University
Title: Language and Reasoning Diversity in Grounded Natural Language Understanding
Abstract: Language grounding is a promising avenue to study core problems in natural language understanding. In this talk, I will discuss challenges in data collection and evaluation for grounded language tasks. The talk will be mainly structured around two visual resources we recently introduced: NLVR2 and Touchdown. In NLVR2, we present a dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. Key to NLVR2 is a scalable data collection process, which generates data that is robust to implicit linguistic biases and enables straightforward evaluation that goes beyond the single example to evaluate model generalization. Touchdown is a navigation and spatial reasoning dataset that uses an interactive environment made of 30K street panoramas of New York City. This data allows us to take the study of grounded language to real-life environments, while still providing a scalable development and evaluation platform. This rich visual stimuli elicits language that is is significantly more diverse than existing resources. Finally, I will conclude with a short discussion of grounding beyond visual observations, and into actions in physical agents.
Speaker: Angeliki Lazaridou, Deepmind
Title: Multi-agent communication from raw perceptual input: what works, what doesn't and what's next
Abstract: Multi-agent communication has been traditionally used as a computational tool to study language evolution. Recently, it has attracted attention also as a means to achieve better coordination among multiple interacting agents in complex environments. However, is it easy to scale previous research in the new deep learning era? In this talk, I will first give a brief overview of some of the previous approaches that study emergent communication in cases where agents are given as input symbolic data. I will then move on to presenting some of the challenges that agents face when are placed in grounded environments where they receive raw perceptual information and how environmental or pre-linguistic conditions affect the nature of the communication protocols that they learn. Finally, I will discuss some potential remedies that are inspired from human language and communication.
Speaker: Margaret Mitchell
Title: The Data Bottleneck and Multimodal Ethical AI
Abstract: This talk will discuss work on training and evaluating AI systems informed by ethical principles. Conflicting values in the collection of data, and in training and evaluating with data, give rise to different systems and different academic cultures around ML research depending on how the different values are prioritized. I will discuss the normative vs. descriptive distinction in modeling vision-language data, and the outcomes that arise from prioritizing fairness, diversity, inclusion, and transparency in system development.
- Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects
Gabriel Grand and Yonatan Belinkov
2. Referring to Objects in Videos using Spatio-Temporal Identifying Descriptions
Peratham Wiriyathammabhum, Abhinav Shrivastava, Vlad Morariu and Larry Davis
3. A Survey on Biomedical Image Captioning
John Pavlopoulos, Vasiliki Kougia and Ion Androutsopoulos
4. Revisiting Visual Grounding
Erik Conser, Kennedy Hahn, Chandler Watson and Melanie Mitchell
5. The Steep Road to Happily Ever After: An Analysis of Current Visual Storytelling Models
Yatri Modi and Natalie Parde
6. "Caption" as a Coherence Relation: Evidence and Implications
Malihe Alikhani and Matthew Stone
7. Learning Multilingual Word Embeddings Using Image-Text Data
Karan Singhal, Karthik Raman and Balder ten Cate
8. Grounded Word Sense Translation
Chiraag Lala, Pranava Madhyastha and Lucia Specia
Accepted Extended-Abstracts
1. Insensibility of Question Word Order in Visual Question Answering
Hwanhee Lee and Kyomin Jung
2. Visual Understanding and Narration: A Deeper Understanding and Explanation of Visual Scenes
Authors: Stephanie M. Lukin, Claire Bonial and Clare Voss
3. Visually Grounded Cross-Lingual Transfer Learning
Fangyu Liu, Rémi Lebret and Karl Aberer
4. Image Captioning via Personality
Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes and Jason Weston
5. Engaging Grounded Dialogue: The Image-Chat Task
Kurt Shuster, Samuel Humeau, Antoine Bordes and Jason Weston
6. Towards Building a Logical Inference System for Image Retrieval
Riko Suzuki, Hitomi Yanaka, Masashi Yoshikawa, Koji Mineshima and Daisuke Bekki
7. Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds
John P. Lalor, Hao Wu and Hong Yu