Dialog Systems: Trends and Challenges

December 15th, 2022 | IIIT Delhi, India

Abstract

Dialog systems play a crucial role in providing human-like conversational experience to the user. Several components work in unison to deliver natural and intelligible response from the Conversational Agents (CA). Some of the key components include but not limited to Dialog Understanding, Dialog Manager, Information Retrieval system and Dialog Generation. This tutorial focuses on providing an overview of the research trends and associated challenges to build end-to-end dialog systems.

Dialog Understanding and Manager

We extract different information from conversation as per the requirements of the task. The extracted information help conversational agent to continue with the conversation. This information may include task specific information such as domain, slots, intents, and dialogue acts among others. If we consider the task of reserving a table at a restaurant, the domain is ‘restaurant’. The restaurant domain may include slots such as restaurant name, number of people, date, time, cuisine, location, availability of parking space and other information that would be helpful for successfully booking a table. The captured intent from the utterance symbolizes the motive of the user. In the restaurant domain, the intents could be: booking confirmation, asking for cuisine information, rejecting the suggestion, so on. The information is extracted and aggregated at each turn of the conversation to represent the current state of the conversation, also known as Dialogue State (DS). The process of tracking the Dialog state at every turn in a conversation is called Dialog State Tracking (DST), which is a crucial module in a CA.

Policy Manager (PM), another important component of the Dialog Manager, makes use of Dialog States to decide the next course of action and accordingly generate the response. Another sub-problem within the dialogue systems is of closed or open domain Question Answering (QA), wherein the users intend to seek answers or information regarding a query. The QA module explores different information sources to find the relevant answer and frame it as a response.

This session aims at exploring the “What”, “Why” and “How” part of the DST and QA systems. We will study how the DST component interacts with other important modules of a conversational agent. We will talk about different domains in which DST and QA have been explored and successfully deployed, the most recent and state-of-the-art works, the evaluation methods, and the key challenges involved in designing such systems.

Response Generation

Conventional conversational agents rely on rule based systems which use pre-defined templates to generate responses. However, this makes the chatbots monotonous and may result in low user engagement capability. With the advancements in deep learning and introduction of Large Language Models (LLMs), applicability of text generation has seen a surge in many domains such as question-answering, document summarization etc. Dialog turn generation using LLMs have picked up interest in the research community to make the conversations more natural and contextual. The advantages include but are not limited to:

1. Multi-turn contextual information encoding in the current turn.

2. Controlled dialog generation using dialog acts to incorporate specific piece of texts.

3. Enriching dialogs with paralinguistic information such as emotions, empathy, persona etc.

4. Conduct multi-turn engaging conversation

As such, this tutorial aims to explore several state-of-the-art dialog generation techniques, which employs LLMs. Fine-tuning of auto-regressive models such as GPT, for generating next turn by taking previous turns as context will be discussed. Controlled text generation using dialog acts will also be discussed, which inherently helps to steer the conversation in accordance with the controlling parameters.

Multimodal AI

Multimodal AI is a multi-disciplinary research area that constitutes various heterogeneous and interconnected sources of data including linguistic, visual and acoustic. It addresses some end goals of AI by integrating and modelling multiple modalities to solve numerous real-world problems in multimedia, healthcare, robotics, finance and Human-computer interactions. With the advancement in machine learning techniques from automatic speech recognition to more recent LLMs, the research area brings some unique challenges to researchers, that involve the study of individual data characteristics as well as cross-alignment between the modalities. In this tutorial, we start with a historical overview of the multimodal research area, then investigate the transition from unimodal to multimodal models, discuss the fundamental technical aspects along with the recent advances in multimodal AI. We will also provide some future directions and underlying challenges to build real-world applications at scale.