Program
22.09.2023
Kick-off for LIMO workshop 9.00 a.m.–9.10 a.m.
2 Min Teaser for Accepted Papers 9.10 a.m.–9.30 a.m.
Invited Talk - Letitia Parcalabescu 9.30 a.m.–10.15 a.m.
Coffee Break 10.15 a.m.–10.35 a.m.
Poster Session 10.35 a.m.–12.15 p.m.
Invited Talk - Sandro Pezzelle 12:15 p.m.–1.00 p.m.
Invited Talk Title and Abstracts:
Letitia Parcalabescu
Title: About Vision and Language (VL) models: What grounded linguistic phenomena do they understand? To what extent to they use the image and text modality?
Abstract: In this talk, we will introduce Vision and Language (VL) models which perform very well in VL tasks such as image-sentence alignment or visual question answering. While performance on these tasks is important, task-centered evaluation does not disentangle the fine-grained linguistic capabilities of VL models. Therefore, we present our work on VALSE, a VL benchmark comprising a suite of six specific linguistic phenomena grounded in the visual modality. Our zero-shot experiments for five widely-used pretrained VL models on VALSE –CLIP, LXMERT, VisualBERT, ViLBERT and ViLBERT 12-in-1– suggest that current VL models have considerable difficulty addressing most phenomena.
In the second part, we ask how much a VL model uses the image and text modality in each sample or dataset. To measure the contribution of each modality in a VL model, we present MM-SHAP which we developed and applied in two ways: (1) to compare VL models for their average degree of multimodality, and (2) to measure for individual models the contribution of individual modalities for different tasks and datasets. Experiments with six VL models – LXMERT, CLIP and four ALBEF variants – on four VL tasks highlight that unimodal collapse can occur to different degrees and in different directions, contradicting the wide-spread assumption that unimodal collapse is one-sided.
Sandro Pezzelle
Title: From Word Representation to Communicative Success: The Need for Semantic and Pragmatic Analysis in Visually Grounded Language Processing
Abstract: By grounding language into vision, multimodal NLP models have a key advantage over purely textual ones: they can leverage signals in one or both modalities and potentially combine this information in any way required by a given communicative context. This ranges from representing single words taking into account their multimodal semantics to resolving semantically underspecified image descriptions to adapting their way of referring to images to achieve communicative success with a given audience. Moving from word-level semantics to real-life communicative scenarios, I will present work investigating the abilities of current language and vision models to account for and deal with semantic and pragmatic aspects of human multimodal communication. I will argue these abilities are necessary for models to successfully interact with human speakers.
Accepted Papers:
A Pipeline for the Creation of Multimodal Corpora from YouTube Videos (Nathan Dykes, Anna Wilson and Peter Uhrig)
Multi-Modal Learning Application – Support Language Learners with NLP Techniques and Eye-Tracking (Robert Geislinger, Ali Ebrahimi Pourasad, Deniz Gül, Daniel Djahangir, Seid Muhie Yimam, Steffen Remus and Chris Biemann)
Context matters: evaluation of target and context features on variation of object naming (Nikolai Ilinykh and Simon Dobnik)
The Scenario Refiner: Grounding subjects in images at the morphological level (Claudia C. Tagliaferri, Denis Paperno, Albert Gatt and Sofia Axioti)
FlowchartQA: The First Large-Scale Benchmark for Reasoning over Flowcharts (Simon Tannert, Marcelo G. Feighelstein, Jasmina Bogojeska, Joseph Shtok, Assaf Arbelle, Peter W. J. Staar, Anika Schumann, Jonas Kuhn and Leonid Karlinsky)
Presenting an Annotation Pipeline for Fine-grained Linguistic Analyses of Multimodal Corpora (Elena Volkanovska, Sherry Tan, Changxu Duan, Debajyoti Chowdhury and Sabine Bartsch)
Non Verbal Communication is Rational: Information Theoretic Evidence from Co-speech Gaze Data (Abstract Submission by Yu Wang and Hendrik Buschmeier)