Invited Talks

Behavioral Machine Intelligence for health applications

Shrikanth (Shri) Narayanan

University of Southern California, Los Angeles, CA

Signal Analysis and Interpretation Laboratory

http://sail.usc.edu

The convergence of sensing, communication and computing technologies is allowing capture and access to data, in diverse forms and modalities, in ways that were unimaginable even a few years ago. These include data that afford the analysis and interpretation of multimodal cues of verbal and non-verbal human behavior to facilitate human behavioral research and its translational applications in healthcare. These data not only carry crucial information about a person’s intent, identity and trait but also underlying attitudes, emotions and other mental state constructs. Automatically capturing these cues, although vastly challenging, offers the promise of not just efficient data processing but in creating tools for discovery that enable hitherto unimagined scientific insights, and means for supporting diagnostics and interventions.

Recent computational approaches that have leveraged judicious use of both data and knowledge have yielded significant advances in this regard, for example in deriving rich, context-aware information from multimodal signal sources including human speech, language, and videos of behavior. These are even complemented and integrated with data about human brain and body physiology. This talk will focus on some of the advances and challenges in gathering such data and creating algorithms for machine processing of such cues. It will highlight some of our ongoing efforts in Behavioral Signal Processing (BSP)—technology and algorithms for quantitatively and objectively understanding typical, atypical and distressed human behavior—with a specific focus on communicative, affective and social behavior. The talk will illustrate Behavioral Informatics applications of these techniques that contribute to quantifying higher-level, often subjectively described, human behavior in a domain-sensitive fashion. Examples will be drawn from mental health and well being realms such as Autism Spectrum Disorders, Couple therapy, Depression and Addiction counseling.

Biography of the speaker:

Shrikanth (Shri) Narayanan is the Niki & C. L. Max Nikias Chair in Engineering at the University of Southern California, where he is Professor of Electrical Engineering, and jointly in Computer Science, Linguistics, Psychology, Neuroscience and Pediatrics, Director of the Ming Hsieh Institute and Research Director of the Information Sciences Institute. Prior to USC he was with AT&T Bell Labs and AT&T Research. His research focuses on human-centered information processing and communication technologies. He is a Fellow of the Acoustical Society of America, IEEE, ISCA, the American Association for the Advancement of Science (AAAS), Association for Psychological Science, and the National Academy of Inventors. Shri Narayanan is Editor in Chief for IEEE Journal of Selected Topics in Signal Processing and an Editor for the Computer, Speech and Language Journal and an Associate Editor for the APISPA Transactions on Signal and Information Processing having previously served an Associate Editor for the IEEE Transactions of Speech and Audio Processing (2000-2004), the IEEE Signal Processing Magazine (2005-2008), the IEEE Transactions on Signal and Information Processing over Networks (2014-2015), IEEE Transactions on Multimedia (2008-2012), the IEEE Transactions on Affective Computing, and the Journal of Acoustical Society of America. He is a recipient of several honors including the 2015 Engineers Council’s Distinguished Educator Award, a Mellon award for mentoring excellence, the 2005 and 2009 Best Journal Paper awards from the IEEE Signal Processing Society and serving as its Distinguished Lecturer for 2010-11, as an ISCA Distinguished Lecturer for 2015-16 and the 2017 Willard R. Zemlin Memorial Lecturer for ASHA. With his students, he has received several best paper awards including a 2014 Ten-year Technical Impact Award from ACM ICMI and a six-time winner of the Interspeech Challenges. He has published over 800 papers and has been granted 17 U.S. patents.

Building speech technology systems for unwritten languages

Odette Scharenborg

Associate Professor,

Multimedia Computing Group, Delft University of Technology, Netherlands

Automatic speech recognition (ASR) technologies require a large amount of annotated data for a system to work reasonably well. For many languages in the world, though, not enough speech data is available, or these lack the annotations needed to train an ASR system. In fact, it is estimated that for only about 1% of the world languages the minimum amount of data that is needed to train an ASR is available. The “Speaking Rosetta” JSALT 2017 project laid the foundation for a new research area “Unsupervised multi-modal language acquisition”. It showed that it is possible to build useful speech and language technology (SLT) systems without any textual resources in the language for which the SLT is built, in a way that is similar to that of how infants learn a language. I will present a summary of the accomplishments of the multi-disciplinary “Speaking Rosetta” workshop exploring the computational and scientific issues surrounding the discovery of linguistic units in a language without orthography. I will focus on our efforts on 1) unsupervised discovery of acoustic units from raw speech, and 2) building language and speech technology in which the orthographic transcriptions were replaced by images and/or translated text in a well-resourced language.

Biography of the speaker:

Odette Scharenborg (PhD) is an associate professor and Delft Technology Fellow at the Multimedia Computing Group at Delft University of Technology, the Netherlands. Previously, she was an associate professor at the Centre for Language Studies, Radboud University Nijmegen, The Netherlands, and a research fellow at the Donders Institute for Brain, Cognition and Behavior at the same university. Her research interests focus on narrowing the gap between automatic and human spoken-word recognition. Particularly, she is interested in the question where the difference between human and machine recognition performance originates, and whether it is possible to narrow this difference. She investigates these questions using a combination of computational modelling, machine learning, and behavioral experimentation. In 2008, she co-organized the Interspeech 2008 Consonant Challenge, which aimed at promoting comparisons of human and machine speech recognition in noise in order to investigate where the human advantage in word recognition originates. She was one of the initiators of the EU Marie Curie Initial Training Network “Investigating Speech Processing In Realistic Environments” (INSPIRE, 2012-2015). In 2017, she co-organized a 6-week Frederick Jelinek Memorial Summer Workshop on Speech and Language Technology on the topic of the automatic discovery of grounded linguistic units for languages without orthography. In 2017 she was elected on the board of the International Speech Communication Association (ISCA).

Tacotron: End-to-End High Quality Speech Synthesis

Yuxuan Wang

Senior Research Scientist,

Google Research, Mountain View, CA

USA

Text-to-speech (TTS) synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this talk, I will describe recent advances on end-to-end neural speech synthesis at Google.

I will start from introducing Tacotron, our end-to-end TTS model that can synthesize speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization, which greatly simplifies the voice building pipeline. I will then describe Tacotron 2, which adds a modified WaveNet model on top of Tacotron to improve its audio quality. Tacotron 2 achieves superior mean opinion score that's comparable to professionally recorded speech. To deliver a truly human-like voice, however, a TTS system must learn to model prosody, the collection of expressive factors of speech. Therefore, in the second part of the talk, I will focus on our recent series of work on expressive speech synthesis based on Tacotron, including unsupervised methods for prosody and speaking style modeling.

Biography of the speaker:

Yuxuan Wang received his Ph.D in computer science at the Ohio State University, USA. During his Ph.D, he pioneered the application of deep learning in speech separation and enhancement. Notably, his work led to the first ever demonstration of improved speech intelligibility for hearing-impaired listeners in background noise. Yuxuan Wang joined Google Research as a Research Scientist in 2015. His research interest includes robust speech processing, sequence learning and generative modeling. Most recently, he focuses on developing an end-to-end neural speech synthesis system known as Tacotron.

Learning Temporally-aware Representations

Partha Pratim Talukdar

Assistant Professor,

Department of Computational and Data Sciences (CDS) and

Department of Computer Science and Automation (CSA)

Indian Institute of Science (IISc), Bangalore, India

Representation learning from text and knowledge graphs (KG) has emerged as an active area of research over the last few years. While this has resulted in the development of several representation learning methods, incorporation of temporal information in the learned representation has remained relatively unexplored. In this talk, I shall first present NeuralDater and AD3, two models which use Graph Convolution Networks (GCN) to learn document-level representation to predict its creation time. Afterwards, I shall present HyTE, a temporally-aware KG embedding method which explicitly incorporates time in the entity-relation space.

Joint work with Shib Shankar Dasgupta, Swayambhu Nath Ray, and Shikhar Vashishth.

Biography of the speaker:

Partha Talukdar is an Assistant Professor in the Department of Computational and Data Sciences (CDS) at IISc, Bangalore and also the founder of Kenome, an enterprise AI startup. Previously, he was a Postdoctoral Fellow in the Machine Learning Department at Carnegie Mellon University, working with Tom Mitchell on the NELL project. Partha received his PhD (2010) in CIS from the University of Pennsylvania, working under the supervision of Fernando Pereira, Zack Ives, and Mark Liberman. Partha is a recipient of IBM Faculty Award, Google’s Focused Research Award, and Accenture Open Innovation Award. He is a co-author of a book on Graph-based Semi-Supervised Learning published by Morgan Claypool Publishers. Homepage: http://talukdar.net

Rethinking Attention and Calibration in Sequence to Sequence Learning

Sunita Sarawagi

Professor, Computer Science and Engineering,

IIT Bombay, Mumbai, India

In this talk, I will revisit the popular soft attention model in sequence to sequence learning. I will present a simple, more transparent, joint attention model that provides easy gains on several translation and morphological inflection tasks. Next, I will expose a little known problem of miscalibration in state of the art neural machine translation (NMT) systems. For structured outputs like in NMT, calibration is important not just for reliable confidence with predictions, but also for proper functioning of beam-search inference. I will discuss reasons for miscalibration and some fixes.

Biography of the speaker:

Sunita Sarawagi researches in the fields of databases, data mining, and machine learning. Her current research interests are deep learning, graphical models and information extraction. She is institute chair professor at IIT Bombay. She got her PhD in databases from the University of California at Berkeley and a bachelors degree from IIT Kharagpur. Her past affiliations include visiting faculty at Google Research, Mountain view, CA, visiting faculty at CMU Pittsburgh, and research staff member at IBM Almaden Research Center. She has several publications in databases and data mining and several patents. She serves on the board of directors of ACM SIGKDD and VLDB foundation. She was program chair for the ACM SIGKDD 2008 conference, research track co-chair for the VLDB 2011 conference and has served as program committee member for SIGMOD, VLDB, SIGKDD, ICDE, and ICML conferences. She is/was on the editorial board of the ACM TODS, ACM TKDD, and FnT for machine learning journals.