End to End SLU at ICASSP 2020

End-to-End Approaches for Spoken Language Understanding

ICASSP 2020 Special Session

Special Session on End-to-End Approaches for Spoken Language Understanding

45th IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP 2020, Barcelona, Spain

Technical Scope

There are many tasks in the speech processing domain where the longstanding paradigm could be characterized as a pipeline process that cascades several components with the outputs of each one becoming inputs to the next, before a final decision is output. While fairly successful, the approach of modeling speech processing problems as a pipeline of components suffers from several downsides. Namely, errors are propagated downstream and individual components are typically trained separately with training criteria that are specific to the component and not the final metric of the downstream task.

One such example is spoken language understanding (SLU), which is typically performed through a cascade of automatic speech recognition (ASR) and natural language components (NLU) trained separately. Recognition output from the ASR component becomes the input features to the NLU component. However, recognition output can be erroneous causing performance degradation downstream. Furthermore, ASR is usually optimized for word accuracy which is not necessarily optimal for downstream tasks.

Recently, a number of end-to-end approaches that don’t adopt the conventional pipeline approach of cascading components, have shown promising results across areas like SLU [1-5], emotion detection, language identification, and acoustic event detection, distant speech recognition. This may in part be because these end-to-end approaches for speech processing allow for the entire model to be directly optimized for the final task and do not suffer from error propagation at a component level. On the other hand, end-to-end approaches for spoken utterances must solve a challenging many-to-one or many-to-few mapping from inputs to output labels, where there is often a large number of input frames in the input sequence for speech processing tasks and many of them may not be salient for predicting the correct labels.

The aim of this special session is to bring together researchers from disparate fields working on end-to-end approaches for classification of speech signals and discuss commonalities in methods, explore opportunities for unification, understand the limitations and debate approaches to addressing these limitations.

Important Dates

October 21, 2019 - Deadline for Paper Submission
January 24, 2020 - Notification of Paper Acceptance
May 4-8, 2020 - Conference Dates

Organizers

Ryan Price

Principal Inventive Scientist, Interactions

Yao Qian

Senior Research Scientist, Educational Testing Service (ETS)

Mirco Ravanelli

Post-Doc Researcher, MILA Lab at Université de Montréal

Vikrant Singh Tomar

Founder and CTO, Fluent.ai Inc.

References

[1] Qian, Y., et al. “Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system.” 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017.

[2] Chen, Y.P., et al. "Spoken language understanding without speech recognition." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.

[3] Serdyuk, D., et al. "Towards end-to-end spoken language understanding." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.

[4] Haghani, P., et al. "From Audio to Semantics: Approaches to end-to-end spoken language understanding." 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018.

[5] Lugosch, L., et al. "Speech Model Pre-training for End-to-End Spoken Language Understanding." To Appear in proceedings of INTERSPEECH 2019. arXiv:1904.03670, 2019.

Page updated

Report abuse