Special Session on End-to-End Approaches for Spoken Language Understanding
There are many tasks in the speech processing domain where the longstanding paradigm could be characterized as a pipeline process that cascades several components with the outputs of each one becoming inputs to the next, before a final decision is output. While fairly successful, the approach of modeling speech processing problems as a pipeline of components suffers from several downsides. Namely, errors are propagated downstream and individual components are typically trained separately with training criteria that are specific to the component and not the final metric of the downstream task.
One such example is spoken language understanding (SLU), which is typically performed through a cascade of automatic speech recognition (ASR) and natural language components (NLU) trained separately. Recognition output from the ASR component becomes the input features to the NLU component. However, recognition output can be erroneous causing performance degradation downstream. Furthermore, ASR is usually optimized for word accuracy which is not necessarily optimal for downstream tasks.
Recently, a number of end-to-end approaches that don’t adopt the conventional pipeline approach of cascading components, have shown promising results across areas like SLU [1-5], emotion detection, language identification, and acoustic event detection, distant speech recognition. This may in part be because these end-to-end approaches for speech processing allow for the entire model to be directly optimized for the final task and do not suffer from error propagation at a component level. On the other hand, end-to-end approaches for spoken utterances must solve a challenging many-to-one or many-to-few mapping from inputs to output labels, where there is often a large number of input frames in the input sequence for speech processing tasks and many of them may not be salient for predicting the correct labels.
The aim of this special session is to bring together researchers from disparate fields working on end-to-end approaches for classification of speech signals and discuss commonalities in methods, explore opportunities for unification, understand the limitations and debate approaches to addressing these limitations.
Ryan Price
Principal Inventive Scientist, Interactions
Yao Qian
Senior Research Scientist, Educational Testing Service (ETS)
Mirco Ravanelli
Post-Doc Researcher, MILA Lab at Université de Montréal
Vikrant Singh Tomar
Founder and CTO, Fluent.ai Inc.
[1] Qian, Y., et al. “Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system.” 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017.
[2] Chen, Y.P., et al. "Spoken language understanding without speech recognition." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
[3] Serdyuk, D., et al. "Towards end-to-end spoken language understanding." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
[4] Haghani, P., et al. "From Audio to Semantics: Approaches to end-to-end spoken language understanding." 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018.
[5] Lugosch, L., et al. "Speech Model Pre-training for End-to-End Spoken Language Understanding." To Appear in proceedings of INTERSPEECH 2019. arXiv:1904.03670, 2019.