2nd Workshop on Linguistic Insights from and for Multimodal Language Processing
Co-located with KONVENS2024
Processing multimodal information (like visual representations of the environment, auditory cues, images, gestures, gaze etc.) and integrating them is a constant and effortless process in human language processing. Recent progress in the area of language & vision, large-scale visually grounded language models, and multimodal learning (e.g. CLIP (Radford et al., 2021), VILBERT (Lu et al., 2019) etc,) have led to breakthroughs in challenging multimodal NLP applications like image–text retrieval, image captioning (Cornia et al., 2020) or visual question answering (Antol et al., 2015). Yet, modeling the semantics and pragmatics of situated language understanding and generation and, generally, language processing beyond the linguistic context, i.e. in combination with multiple other modalities, is still one of the biggest challenges in NLP and Computational Linguistics (Bisk et al., 2020).
Recent efforts in understanding complex multimodal phenomena in language and dialogue have explored a variety of aspects of multimodality and produced a substantial amount of valuable multimodal datasets and models that include various types of text (from short and informal social media comments to more formal news, instructions/manuals and legal documents, they are also usually accompanied by an image, meme, animation or video) and dialogue (from reference games, instruction dialogues to fully situated interaction with agents and robots). The variety in this wide problem space and the downstream tasks also require variety in the approaches to tackle them. As a result, Multimodal Language Processing is approached by many different sub-areas of Computational Linguistics and NLP---in computational semantics and pragmatics, dialogue modeling, language modeling, and grounding, multimodal and crossmodal learning, and beyond, including physical or robotic actions.
While there have been recent venues and workshops targeting multimodal representation learning and large-scale Language and Vision models, there is a lack of discussion in the community that focuses on linguistic multimodal phenomena, domain- and task-specific analyses of multimodality and, generally, contributions of computational linguistics to multimodal learning and vice versa (Parcalabescu, et. al 2022). With this workshop, we aim to bring together researchers who work on various linguistic aspects of multimodal language processing to discuss and share the recent advances in this interdisciplinary field.
The main goals of this workshop are to
Discuss various tasks, phenomena, models, and problems in multimodal language processing
Discuss how insights from (computational) linguistics can inform multimodal learning and modeling
Facilitate networking and encourage collaboration between researchers working on different aspects of multimodality in computational linguistics and language processing
Important Dates
Submission deadline: 14 June 2024 21 June 2024 Deadline Extended!!
Author notification: 2 August 2024
Camera ready: 12 August 2024
Workshop: 13 September 2024
Note: All deadlines are 11:59PM UTC-12:00 ("anywhere on Earth")
Submission link: https://cmt3.research.microsoft.com/LIMO2024
Organising Committee
Piush Aggarwal, FernUniversität in Hagen, piush.aggarwal@fernuni-hagen.de
Özge Alaçam, Universität Bielefeld, oezge.alacam@uni-bielefeld.de
Carina Silberer, Universität Stuttgart, carina.silberer@ims.uni-stuttgart.de
Sina Zarrieß, Universität Bielefeld, sina.zarriess@uni-bielefeld.de
Torsten Zesch, FernUniversität in Hagen, torsten.zesch@fernuni-hagen.de