YRRSDS 2021 - Talks

Keynote Talks

Speaker: Dilek Hakkani-Tur - Amazon Alexa AI

Title: Language Models for Spoken Language Understanding in Conversational Systems

Abstract: Spoken language understanding (SLU) in dialogue systems aim to understand user’s intent and extract associated arguments from user utterances. Large language models (LMs) pre-tained from large amounts of text data have boosted SLU performance, amongst many other natural language processing tasks. In this talk, I plan to give an overview of our research areas and then focus on our work on few-shot learning and robustness to speech recognition errors in SLU. Recent work has proposed in-context learning, and priming and prompting of LMs for zero- and few-shot approaches to language understanding tasks. Along similar lines, we propose to view language understanding tasks as question answering (QA), benefiting from large LMs and QA datasets, resulting in significant improvements for few-shot SLU. Furthermore, large LMs are trained with text data using masked language modeling. We propose training LMs with tasks that resemble speech recognition errors, for obtaining LMs that are robust to these challenges for SLU. In our work, we’ve shown that, while these LMs result in better SLU performance on speech datasets, they also improve speech recognition word error rates. I will conclude the talk with our efforts on enabling research on robustness to speech recognition errors in conversational systems.

Speaker: Verena Rieser - Heriot Watt University

Title: Responsible Conversational AI: Trusted, Safe and Bias-free

Abstract: With recent progress in deep learning, there has been an increased interest in learning dialogue systems from data, also known as “Conversational AI”. In this talk, I will focus on the task of response generation, for which I will highlight lessons learnt and ongoing challenges, such as reducing `hallucinations’ for task-based systems, safety critical issues for open-domain chatbots, and the often overlooked problem of `good’ persona design. I will argue that we will need to solve these challenges to create trusted, safe and bias-free systems for end-user applications.

Speaker: Sakriani Sakti - Nara Institute of Science and Technology / RIKEN Center for Advanced Intelligence Project

Title: Listening while Speaking and Visualizing: A Semi-supervised Approach with Multimodal Machine Speech Chain

Abstract: The development of advanced spoken language technologies, such as automatic speech recognition (ASR) and text-to-speech synthesis (TTS), has enabled computers to either learn how to listen or speak. Many applications and services are now available, but the construction is commonly done based on a supervised fashion where a large amount of paired speech and corresponding transcription is required.

In this talk, we will introduce a semi-supervised learning mechanism based on a machine speech chain framework. First, we describe the primary machine speech chain architecture that learns not only to listen or speak but also to listen while speaking. The framework enables ASR and TTS to teach each other given unpaired data. Then, we describe the recent multimodal machine chain framework that mimics overall human communication to listen while speaking and visualizing. With the support of image captioning and production models, the framework further reduces the need for a large amount of unpaired data. It enables ASR and TTS to improve their performance using an image-only dataset.