Abstracts of Invited talks

Title: HySTER: A Hybrid Spatio-Temporal Event Reasoner


Author: Theophile Sautory, Nuri Cingillioglu, Alessandra RussoImperial College London


Abstract: The task of Video Question Answering (VideoQA) consists in answering natural language questions about a video and serves as a proxy to evaluate the performance of a model in scene sequence understanding. Most methods designed for Video QA up-to-date are end-to-end deep learning architectures which struggle at complex temporal and causal reasoning and provide limited transparency in reasoning steps. We present the HySTER: a Hybrid Spatio-Temporal Event Reasoner to reason over physical events in videos. Our model leverages the strength of deep learning methods to extract information from video frames with the reasoning capabilities and explainability of symbolic artificial intelligence in an answer set programming framework. We define a method based on general temporal, causal and physics rules which can be transferred across tasks. We apply our model to the CLEVRER dataset and demonstrate state-of-the-art results in question answering accuracy. This work sets the foundations for the incorporation of inductive logic programming in the field of Video QA.


Title: Interpretable Representations with Multimodal Deep Generative Models

Author: Siddharth N., University of Edinburgh

Abstract: Learning generative models that span multiple data modalities, such as vision and language, is often motivated by the desire to learn meaningful, generalisable representations that faithfully capture information across modalities. This has particular relevance to explainability, with being able to intervene in, and manipulate, such models in meaningful ways, allowing for better insight into the captured concepts and their relationships. Here, I will talk about a couple of different approaches to incorporating associated data from different modalities with a view to learning meaningful representations and subsequently discuss ways to effectively characterize, evaluate, and learn such models from data.


Title: The New Brown Corpus: Building a corpus for childlike grounded learning

Author: Ellie Pavlick (Brown University)

Abstract: Despite their impressive empirical performance, state-of-the-art NLP systems pale in comparison to human language understanding. A natural question to ask is whether we can improve models by more closely emulating the way humans learn. Replicating a child's language-learning experience is infeasible, if not impossible. However, mimicking aspects such learning environments (3D grounded data, naturalistic spontaneous speech) may offer insights into the comparative strengths and weaknesses of how ML systems train vs. how humans learn. In this talk, I will discuss our efforts to build an aspirational, naturalistic language learning corpus using virtual reality. Our collected data is small but rich with signals not available in other grounded language corpora. I will discuss baseline modeling results and ideas for moving forward.


Title: A Semi-supervised Machine Learning Method for Language Acquisition

Author: Ting Liu and Sharon Small (Siena College), Peter Tu and James Kubricht (General Electronic Company)

Abstract: Language acquisition is a difficult but popular research topic that has drawn a great deal of NLP researchers’ attention. People have developed different approaches using super-vised, semi-supervised, and unsupervised algorithms. How-ever, when children learn they do not re-quire large amounts of training data. Instead, they are able to accurately general-ize their knowledge from one object to other objects (Springer and Keil 1989, Keil 1992, Kelemen 2003). In this paper, we present a system that simulates the children’s learn-ing process by studying entities and their attributes, such as color, shape, materials, etc. from a carefully de-signed cur-riculum and then learns additional entity attributes from a large unannotated dataset with support of an unsupervised language parser (Jin et al. 2018). Our system yields results that are quite promising, indicating what could be a new direction for language acquisition.


Title: Are We There Yet? Learning to Localize in Embodied Instruction Following

Author: Shane Storks (University of Michigan) and Qiaozi Gao, Govind Thattai, Gokhan Tur (Amazon Alexa AI)

Abstract: Embodied instruction following is a challenging problem requiring an agent to infer a sequence of primitive actions to achieve a goal environment state from complex language and visual inputs. Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem consisting of step-by-step natural language instructions to achieve subgoals which compose to an ultimate high-level goal. Key challenges for this task include localizing target locations and navigating to them through visual inputs, and grounding language instructions to visual appearance of objects. To address these challenges, in this study, we augment the agent’s field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each time step. We also improve language grounding by introducing a pre-trained object detection module to the model pipeline. Empirical studies show that our approach exceeds the baseline model performance.


Title: Designing a neuro-cognitive system for interactive task learning


Author: Shiwali Mohan (PARC)


Abstract: Interactive task learning (ITL) is an emerging area of research at the intersection of artificial intelligence, machine learning, human-computer interaction, and cognitive science. It studies how intelligent agents can be designed so that they can dynamically extend their domain knowledge and skills through natural interactions with human collaborators. The capability of learning new domain concepts and task knowledge online, post-deployment, is critical to the adoption of complex intelligent agents such as general-purpose robots. In this talk, I will introduce AILEEN - a neuro-cognitive ITL system that is currently under development at the Palo Alto Research Center. AILEEN brings together algorithmic methods from diverse disciplines of AI into a single, hybrid, interactive system and is embodied in a simulated, robotic platform. It has components that implement deep learning, rule-based reasoning, analogical processing & generalization, planning etc. and together demonstrate complex interactive behavior. The talk will discuss challenges in developing a hybrid AI system, summarize the design choices we have made in the AILEEN design, and will argue for creative and innovative ways to evaluate ITL systems.


Title: Learning Neuro-symbolic Representations of Commonsense Knowledge

Author: Antoine Bosselut (Stanford University)

Abstract: Situations described using natural language are richer than what humans explicitly communicate. For example, the sentence "She pumped her fist" connotes many potential auspicious causes. For machines to understand natural language, they must be able to make commonsense inferences about explicitly stated information. However, current NLP systems lack the ability to connect the situations they encounter to relevant world knowledge. Moreover, they cannot learn to reason over linked facts to robustly generalize to future unseen events. In this talk, I will describe efforts at measuring the degree of commonsense knowledge already encoded by large-scale language models, and show these models can be converted to more expressive knowledge models that hypothesize relevant commonsense knowledge on-demand.