Recently, conversational systems have seen a significant rise in demand due to modern commercial applications such as Amazon's Alexa, Apple's Siri, Microsoft's Cortana, and Google Assistant. The research on multimodal chatbots is a widely underexplored area, where users and the conversational agent communicate by natural language and visual data.
Conversational agents are now becoming a commodity as a large number of companies push for this technology. The wide use of these conversational agents exposes the many challenges in achieving more natural, human-like, and engaging conversational agents. The research community is actively addressing several of these challenges: how are visual and text data related in user utterances? How to interpret the user intent? How to engage in a conversation about multimodal content?
The potential target audience is research students or practitioners who wish to broaden their understanding of conversational agents that can engage in conversations about multimedia content.
It is advised a Computer Science degree or basic understanding of vision and language data representation.
9:00 AM (GMT+1), September 6 of 2022 at the 22nd ACM IVA Conference
We will start the tutorial with an introduction to the concept of Conversational Task Assistants, agents that can assist users in completing tasks.
The second part of the tutorial will focus on the introduction of multimodality to conversational systems, and we will address some of the challenges of assistant embodiment and user understanding.
In the third part, we will discuss other components needed to support multimodal conversations, including a dialogue policy, search/recommendation components, and response computation methods.
In the final part of the tutorial, we will present case studies of the presented methods, in particular, iFetch an online-fashion shopping assistant, and the Alexa Prize Taskbot Challenge award-winning TWIZ bot.
Part 1: Introduction (30 mins)
Introduction: What is a Conversational Agent?
Key Concepts and Definitions
Part 2: Multimodal Conversational Agents (1 hour)
Virtual Assistant Embodiment and Personality
Understanding the User in Multimodal Conversations
Dialog Manager: Robust Dialog State Tracking
Part 3: Conversational Agent Components (1 hour)
Dialog Policy
Answering User Needs (Search and Recommendation)
Response Computation
Part 4: Case studies (30 mins)
iFetch: Online Fashion Shopping Assistant
TWIZ: The Multimodal Task-Assistant
Associate Professor at the Department of Computer Science, Universidade Nova de Lisboa (FCT NOVA).
He holds a Ph.D. degree (2004-2008) in Computer Science from Imperial College London, UK.
He is regularly involved in international program committees and research projects.
His research interests cover the different problems of Vision and Language Mining and Search, such as: multimedia retrieval, social media information analysis, and machine learning.
Researcher at NOVA LINCS currently pursuing a Ph.D. degree in the area of multimodal conversational systems.
He holds an M.Sc. Degree (2015-2020) in Computer Science from NOVA University.
He has experience in conversational search and task-guiding agents and was the team leader of the award-winning Alexa’s TWIZ TaskBot.
His interests include the development of conversational agents, NLP, and multimodal AI.
You can find the resources for this tutorial here.