Dealing with Meaning Variation in NLP - 3rd Yearly Workshop

Utrecht University, 28th October 2025

Location

This will be a hybrid event - participation will be possible both in person and online - but the size of the room is limited, so please RSVP to Massimo Poesio if you would like to attend in person. Also let us know if you want to attend online so we can add you to the list of attendees on Teams.

In-person location: Serre, Utrecht Botanic Gardens

All times in Central European Time.

Schedule

10:00 Massimo Poesio (Utrecht Uni / Queen Mary Uni).: Welcome and Summary of Year 2

Abstract: TBA

10:30 Invited Talk - Manfred Stede (Uni Potsdam): Subjectivity in discourse and argument structure annotation (online)

Abstract: Tree structures have been proposed as representational devices both for discourse structure (e.g., in RST) and for the structure of complex arguments unfolding in text (i.e., which statements support or attack the argumentative impact of other statements). In both cases, annotator agreement is known to be mediocre; and in both cases, some research has proposed to drop the tree requirement and use more general graphs instead. Based on various corpus annotation projects, we compare the roots of annotator disagreement in dealing with the two types of structure and analyse how the two stories can inform each other.

11:10 Coffee break

11:40 Daniil Ignatev (Utrecht Uni) : Hypernetworks for Perspectivist Adaptation

Abstract: The task of perspective-aware classification introduces a bottleneck in terms of parametric efficiency that did not get enough recognition in existing studies. In this article, we aim to address this issue by applying an existing architecture, the hypernetwork+adapters combination, to perspectivist classification. Ultimately, we arrive at a solution that can compete with specialized models in adopting user perspectives on hate speech and toxicity detection, while also making use of considerably fewer parameters. Our solution is architecture-agnostic and can be applied to a wide range of base models out of the box.

12:00 Frances Yung (Saarland Uni): Modeling Annotator Perspectives in Implicit Discourse Relation Recognition

Abstract: Discourse relations are the logical links between spans of text, such as causal or contrastive relations, that structure meaning in discourse. Implicit discourse relations—those not marked by connectives like because or but—pose particular challenges, as they require integrating textual semantics with world knowledge. In this talk, we present our work on modeling annotator perspectives in implicit discourse relation recognition (IDRR) using the DiscoGeM corpus, which provides ten annotations per instance. We analyze cases where the perspectivist model can capture individual annotator perspectives, and cases where the bias cannot be modelled.

12:20 Invited Talk - Matthias Orlikowski (Bielefeld Uni): Modelling Annotator Subjectivity (online)

Abstract: Annotator subjectivity is often discounted in Natural Language Processing (NLP). In many supervised tasks, annotator subjectivity manifests as variability and disagreement in labeling, which is usually treated as noise to distill a single ground truth. In contrast, annotator modelling engages productively with annotator subjectivity by learning supervised models that predict the annotations of individual annotators. Annotator models promise to enable more adaptable, more performant and fairer NLP systems. Despite this potential, many aspects of creating annotator models are still unclear. Which architectures should we use, how should we represent annotators and how do we create useful datasets for annotator modelling? I will talk about findings related to each of these questions.

13:00 Lunch

14:30 Invited Talk - Rupak Sarkar (Uni Maryland): Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs (online)

Abstract: While it is commonly accepted that maintaining common ground plays a role in conversational success, little prior research exists connecting conversational grounding to success in task-oriented conversations. We study failures of grounding in the Ubuntu IRC dataset, where participants use text-only communication to resolve technical issues. We find that disruptions in conversational flow often stem from a misalignment in common ground, driven by a divergence in beliefs and assumptions held by participants. These disruptions, which we call conversational friction, significantly correlate with task success. While LLMs can identify overt cases of conversational friction, they struggle with subtler and more context-dependent instances that require pragmatic or domain-specific reasoning.

15:10 Nan Li (Utrecht Uni): Grounded Misunderstandings in Asymmetric Dialogue: A Perspectivist Annotation Scheme for MapTask

Abstract: Collaborative dialogue relies on participants incrementally establishing common ground, yet in asymmetric settings they may believe they agree while referring to different entities. We introduce a perspectivist annotation scheme for the HCRC MapTask corpus (Anderson et al., 1991) that separately captures speaker and addressee grounded interpretations for each reference expression, enabling us to trace how understanding emerges, diverges, and repairs over time. Using a scheme-constrained LLM annotation pipeline, we obtain 13k annotated reference expressions with reliability estimates and analyze the resulting understanding states. The results show that full misunderstandings are rare once lexical variants are unified, but multiplicity discrepancies systematically induce divergences, revealing how apparent grouding can mask referential misalignment. Our framework provides both a resource and an analytic lens for studying grounded misunderstanding and for evaluating (V)LLMs’ capacity to model perspective-dependent grounding in collaborative dialogue.

15:30 Hugh Mee Wong (Utrecht Uni): How do language models rate text?

Abstract: LLM-as-a-judge has become a widely-adopted method in benchmarking in NLP for cases where there is no ground truth, or when comparing against a reference is difficult to do with simple rule-based metrics or heuristics. One way of using language models to judge, is by asking it to produce a rating on a (vague) scale, often accompanied by rubrics. My current research focuses on the question of what these rating actually mean, and how the models translates the semantics of a judgment into the rating that it emits. We do this by first studying the symbol binding problem in multiple-choice question answering (MCQA) from a mechanistic interpretability angle, after which we extend our experiments to simple rating setups. This is still work in progress and in this talk, I will be presenting preliminary results on symbol binding in MCQA and rating setups.

15:50 Anh Dang (Utrecht Uni): Can LLMs Detect Ambiguous Plural Reference? An Analysis of Split-Antecedent and Mereological Reference

Abstract: Our goal is to study how LLMs represent and interpret plural reference in ambiguous and unambiguous contexts. We ask the following research questions: (1) Do LLMs exhibit human-like preferences in representing plural reference? (2) Are LLMs able to detect ambiguity in plural anaphoric expressions and identify possible referents? To address these questions, we design a set of experiments, examining pronoun production using next-token prediction tasks, pronoun interpretation, and ambiguity detection using different prompting strategies. We then assess how comparable LLMs are to humans in formulating and interpreting plural reference. We find that LLMs are sometimes aware of possible referents of ambiguous pronouns. However, they do not always follow human reference when choosing between interpretations, especially when the possible interpretation is not explicitly mentioned. In addition, they struggle to identify ambiguity without direct instruction. Our findings also reveal inconsistencies in the results across different types of experiments. We will also present some follow-up directions in which we use mechanistic interpretability techniques to uncover the mechanism of LLMs when processing ambiguous plural references.

16:10 Tea Break

16:30 Invited Talk - Beiduo Chen (LMU Munich): Explanations as a Catalyst: Leveraging Large Language Models to Embrace Human Label Variation (in person)

Abstract: Human label variation (HLV), where annotators provide different valid labels for the same data, is a rich signal often dismissed as noise. This talk demonstrates how Large Language Models (LLMs), catalyzed by explanations, can efficiently model this variation. I will present a three-part research journey: First, we show that LLMs can accurately approximate full human judgment distributions using just a few human-provided explanations. Next, to overcome the cost of human input, we prove that LLM-generated explanations are effective and scalable proxies for human ones. Finally, we introduce a more authentic forward-reasoning paradigm by extracting nuanced explanations directly from an LLM’s Chain-of-Thought (CoT) process. This is paired with a novel, rank-based evaluation framework that better aligns with human decision-making. Together, these studies offer a scalable approach to embrace HLV, paving the way for more pluralistic and trustworthy AI.

17:10 Sanne Hoeken (Bielefeld Uni): Modeling Human Variation in Hateful Word Interpretation

Abstract: This talk examines where variation comes from in judging whether a word is hateful and how computational models can reflect it. Using the HateWiC dataset of word-in-context annotations, we analyze how both linguistic properties and annotator characteristics shape disagreement in hatefulness judgments. Results show that variation arises mainly from the interaction between who interprets what rather than either factor alone. We then test BERT-based models that incorporate annotator information and find that, while they reproduce some surface-level disagreement, they fail to capture the deeper structure of human variation. These findings highlight the need for models that go beyond accuracy to account for genuine interpretive diversity.

17:40 Final discussion, End

Page updated

Google Sites

Report abuse