ACM Multimedia Workshop on Multimodal Conversational AI

16th October, 2pm CET

Keynotes


Augmenting Machine Intelligence with Multimodal Information

Prof. Zhou Yu, University of California at Davies, USA


Abstract

Humans interact with other humans or the world through information from various channels including vision, audio, language, haptics, etc. To simulate intelligence, machines require similar abilities to process and combine information from different channels to acquire better situation awareness, better communication ability, and better decision-making ability. In this talk, we describe three projects. In the first study, we enable a robot to utilize both vision and audio information to achieve better user understanding. Then we use incremental language generation to improve the robot's communication with a human. In the second study, we utilize multimodal history tracking to optimize policy planning in task-oriented visual dialogs. In the third project, we tackle the well-known trade-off between dialog response relevance and policy effectiveness in visual dialog generation. We propose a new machine learning procedure that alternates from supervised learning and reinforcement learning to optimum language generation and policy planning jointly in visual dialogs. We will also cover some recent ongoing work on image synthesis through dialogs.


Bio: Zhou Yu is an Assistant Professor at the UC Davis Computer Science Department. Zhou will join the CS department at Columbia University in Jan 2021 as an Assistant Professor. She obtained her Ph.D. from Carnegie Mellon University in 2017. Zhou has built various dialog systems that have a real impact, such as a job interview training system, a depression screening system, and a second language learning system. Her research interest includes dialog systems, language understanding and generation, vision and language, human-computer interaction, and social robots. Zhou received an ACL 2019 best paper nomination, featured in Forbes 2018 30 under 30 in Science, and won the 2018 Amazon Alexa Prize.

Response Generation and Retrieval in Multimodal Conversational AI

Prof. Verena Rieser, Heriot-Watt University, UK

Abstract

With recent progress in deep learning, there has been an increased interest in multimodal conversational AI, which requires an AI agent to hold a meaningful conversation with humans in Natural Language about content in other modalities, e.g. pictures or videos.

In this talk, I will present two case studies: one in generating responses for closed-domain task-based multimodal dialogue systems with applications in conversational multimodal search; and one case-study in selecting/ retrieving responses for open-domain multimodal systems with applications in visual dialogue and visual question answering.

Throughout my talk I will highlight open challenges, including context modelling, knowledge grounding, encoding history, multimodal fusion, evaluation techniques, and shortcomings of current datasets.


Bio: Verena Rieser is a professor at Heriot-Watt University where she leads research on Natural Language Generation and Spoken Dialogue Systems. She is also a co-founder of the Conversational AI company Alana. Verena was recently awarded a Senior Research Fellowship by the Royal Society and is PI of several funded research projects and industry awards.

Over the past two years, Verena's team was one of the three finalists in the prestigious Amazon Alexa Challenge. Her research has featured in the BBC's documentary The Joy of AI, BBC’s Tomorrow's World and in national and international news. She also acts as an advisor on AI to the Royal Society of Edinburgh, UK Research and Innovation, DATAIA the French Institute for AI, and the GOV.UK Centre for Data Ethics and Innovation.