Program

13.09.2024

09:00 – 09:15 Kick-off for LIMO 2024 workshop

09:16 – 09: 28 Oral Presentation

09:30 – 10:15 Invited Talk – Anna Rohrbach (online)

10:15 – 10:27 Oral Presentations

09:16 - 09:28 : Sign Language Sense Disambiguation

10:15 - 10:27 : CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding

10:28 – 11:00 Workshop Group Activity – Hybrid (Ice Breaker)

11:00 – 11:30 Coffee Break

11:30 – 12:15 Invited Talk – Ece Takmaz

12:15 – 12:40 Oral Presentations

12:15 - 12:27 : Investigating Occupational Stereotypes on Multimodal text-to-image Models: a: Linguistic Analysis

12:28 - 12:40 : Evaluating Semantic Relations in Predicting Textual Labels for Images of Abstract and Concrete Concepts

12:40 – 12:45 Closing Remarks and Wrap-up

Note: Each oral presentation will be 8 minutes with an additional 4 minutes for the discussion.

Invited Talk Title and Abstracts

Anna Rohrbach

Title: Learning from Language

Abstract: Humans rely on multiple modalities to perceive the world and communicate with each other, most importantly using vision and language. We also use language to teach each other about new things or point out mistakes. Recently we have witnessed a paradigm shift toward leveraging language *for* vision. Language is being used to enhance visual models by, e.g., enabling zero-shot capabilities, improving generalization or mitigating bias. I will present my recent efforts towards building vision models that can ingest language advice to improve their behavior on a range of scenarios.

Ece Takmaz

Title: Human Signals in the Integration of Vision and Language in Deep Neural Networks

Abstract: There is an intricate relation between the properties of an image and how humans behave while describing the image. This behavior shows ample variation, as manifested in human signals such as eye movements and when humans start to describe the image. Despite the value of such signals of visuo-linguistic variation, they are virtually disregarded in the training of current pretrained models, which motivates further investigation. Using a corpus of image descriptions with concurrently collected eye-tracking data, we explore the nature of the variation in visuo-linguistic signals, and whether image representations encoded by pretrained vision encoders can capture such variation. This work opens up the possibility of using pretrained multimodal encoders to quantify patterns in human data and shed light on the underlying cognitive mechanisms, as well as identifying the shortcomings of such encoders.

Submission Talks

- - - - Sign Language Sense Disambiguation (Tanalp Agustoslu, Jana Grimm, Oliver Kraus, Miriam Winkler)
        CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding (Ivana IB Beňová, Michal Gregor, Albert Gatt)
        Investigating Occupational Stereotypes on Multimodal text-to-image Models: A Linguistic Analysis (Hermine Kleiner, Lena Altinger, Sebastian Loftus, Sarah Anna Uffelmann)
        Evaluating Semantic Relations in Predicting Textual Labels for Images of Abstract and Concrete Concepts (Tarun Tater, Sabine Schulte im Walde, Diego Frassinelli)

Google Sites

Report abuse