NLP Beyond Text

Workshop on Cross/Multi modal Natural Language Processing

co-located online with EMNLP 2020


Humans interact with each other through several means (e.g., voice, gestures, written text, facial expressions, etc.) and a natural human-machine interaction system should preserve the same modality (Kiela et al., 2018). However, traditional Natural Language Processing (NLP) focuses on analyzing textual input to solve language understanding and reasoning tasks, and other modalities are only partially targeted. This workshop aims to promote research in the area of Multi/Cross-Modal NLP, i.e., studying computational approaches exploiting the different modalities humans adopt to communicate. In particular, the focus of this workshop is (i) studying how to bridge the gap between NLP on spoken and written language (Chung et al., 2018; Elizalde et al., 2019) and (ii) exploring how NLU models can be empowered by jointly analyzing multiple input sources, including language (spoken or written), vision (gestures and expressions) and acoustic (paralingustic) modalities (Abouelenien et al., 2017; Madhysastha et al., 2018). The former comes from the observation that voice-based interaction, which is typical of conversational agents, poses new challenges to NLU. The latter aims to address the way humans acquire and use language. Usually, it happens in a perceptually rich environment (Evtimova et al., 2017), where they communicate using modalities that go beyond language itself. Therefore, extending NLP to modalities beyond written text is a fundamental step in allowing AI systems to reach human-like capabilities.

The workshop would seek papers focusing on relevant topics falling under the cross and multi modal NLP. Topics of interest include but are not limited to:

  • text preprocessing on ASR transcriptions (e.g., ASR error detection and correction);

  • cross-modal NLU from written text to speech transcription;

  • multi-modal sentiment analysis, emotion recognition and sarcasm detection;

  • multi-modal dialogue systems;

  • multi-modal machine translation;

  • multi-modal question answering.

