NLP Beyond Text

Workshop on Cross/Multi modal Natural Language Processing

co-located online with EMNLP 2020


  • The Proceedings of the workshop can be found in the ACL Anthology website.

  • The Workshop Program is out!

  • To register to the workshop please follow the instructions in the EMNLP 2020 website. Early registration ends October 30 2020, 11:59 PM (EDT).

  • We're happy to announce that Loïc Barrault, senior Lecturer at the University of Sheffield will be the keynote speaker for NLPBT.

  • The list of accepted papers is available here.

  • Submission deadline extended to August 31st!

  • 2nd call for paper out. Updated dates to enable double submissions with EMNLP 2020. Check here for more details.

  • COVID-19 Update: We hope everyone and their loved ones are staying safe during the COVID-19 pandemic. The workshop will be held online following the policy of EMNLP 2020.


Humans interact with each other through several means (e.g., voice, gestures, written text, facial expressions, etc.) and a natural human-machine interaction system should preserve the same modality (Kiela et al., 2018). However, traditional Natural Language Processing (NLP) focuses on analyzing textual input to solve language understanding and reasoning tasks, and other modalities are only partially targeted. This workshop aims to promote research in the area of Multi/Cross-Modal NLP, i.e., studying computational approaches exploiting the different modalities humans adopt to communicate. In particular, the focus of this workshop is (i) studying how to bridge the gap between NLP on spoken and written language (Chung et al., 2018; Elizalde et al., 2019) and (ii) exploring how NLU models can be empowered by jointly analyzing multiple input sources, including language (spoken or written), vision (gestures and expressions) and acoustic (paralingustic) modalities (Abouelenien et al., 2017; Madhysastha et al., 2018). The former comes from the observation that voice-based interaction, which is typical of conversational agents, poses new challenges to NLU. The latter aims to address the way humans acquire and use language. Usually, it happens in a perceptually rich environment (Evtimova et al., 2017), where they communicate using modalities that go beyond language itself. Therefore, extending NLP to modalities beyond written text is a fundamental step in allowing AI systems to reach human-like capabilities.

The workshop would seek papers focusing on relevant topics falling under the cross and multi modal NLP. Topics of interest include but are not limited to:

  • text preprocessing on ASR transcriptions (e.g., ASR error detection and correction);

  • cross-modal NLU from written text to speech transcription;

  • multi-modal sentiment analysis, emotion recognition and sarcasm detection;

  • multi-modal dialogue systems;

  • multi-modal machine translation;

  • multi-modal question answering.

Key Dates

  • Submission Deadline: August 15, 2020 August 31, 2020

  • Retraction of papers accepted at EMNLP 2020: September 15, 2020

  • Acceptance Notification: September 29, 2020 October 2, 2020

  • Camera-ready version: October 10, 2020 October 11, 2020

  • Workshop: November 20, 2020


Mohamed Abouelenien, Veronica Perez-Rosas, Rada Mihalcea, and Mihai Burzo. 2017. Multimodal gender detection. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI ’17. ACM.

Katrina Evtimova, Andrew Drozdov, Douwe Kiela, and Kyunghyun Cho. 2017. Emergent language in a multi-modal, multi-step referential game. ArXiv, abs/1705.10369.

Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th ACL, Vancouver, Canada. Association for Computational Linguistics.

Douwe Kiela, Alexis Conneau, Allan Jabri, and Maximilian Nickel. 2018. Learning visually grounded sentence representations. In Proceedings of the 2018 NAACL. Association for Computational Linguistics.

Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass. 2018. Unsupervised cross-modal alignment of speech and text embedding spaces. In Proceedings of the 32Nd NIPS. Curran Associates Inc.

Pranava Madhysastha, Josiah Wang, and Lucia Specia. 2018. The role of image representations in vision to language tasks. Natural Language Engineering, 24(3):415–439.

Benjamin Elizalde, Shuayb Zarar, and Bhiksha Raj. 2019. Cross modal audio search and retrieval with joint embeddings based on text and audio. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing.