Dealing with Meaning Variation in NLP - 2nd Yearly Workshop
Utrecht University, Janskerhof 3 Building, 29th October 2024
Utrecht University, Janskerhof 3 Building, 29th October 2024
This will be a hybrid event - participation will be possible both in person and online - but the size of the room is limited, so please RSVP to Massimo Poesio if you would like to attend in person. Also let us know if you want to attend online so we can add you to the list of invited speakers on Webex.
In-person location: University of Utrecht, Janskerhof 3 Building, Room 0.21
All times in Central European Time.
10:40 Raquel Fernandez (Uni Amsterdam): Language models and human language processing: from variability to prediction
Abstract: Language models have been shown to simulate human language surprisingly well, generating fluent, grammatical text and encoding meaning representations that resemble human semantic knowledge. Can known properties of human language use provide valuable insights for language models? And conversely, can language modelling techniques contribute to our understanding of human language processing? In this talk, I will start by arguing that to be considered good statistical models of language production, language models should entertain levels of uncertainty calibrated to the degree of variability observed in humans, and show that to a large extent they do, albeit with some caveats. Building on this result, I will then propose a novel measure to quantify the predictability of an utterance using neural text generators and show that it correlates with reading times and acceptability judgements remarkably well, complementing classic measures of surprisal.
11:20 Coffee break
11:40 Sandro Pezzelle (Uni Amsterdam): Implicit and underspecified language as a communicative testbed for large language models
Abstract: The language we use in everyday communicative contexts exhibits a variety of phenomena—such as ambiguity, missing information, or semantic features expressed only indirectly—that make it often implicit or underspecified. Despite this, people are good at understanding and interpreting it. This is possible because we can exploit additional information from the linguistic or extralinguistic context and shared or prior knowledge. Given the ubiquity of these phenomena, NLP models must handle them appropriately to communicate effectively with users and avoid biased behavior, that can be potentially harmful. In this talk, I will present recent work from my group investigating how state-of-the-art transformer large language models (LLMs) handle these phenomena. In particular, I will focus on the understanding of sentences with atypical animacy (“a peanut fell in love”) and on the interpretation of sentences that are ambiguous (“Bob looked at Sam holding a yellow bag”) or where some information is missing or implicit (“don't spend too much”). I will show that, in some cases, LMs behave surprisingly similarly to speakers; in other cases, they fail quite spectacularly. I will argue that having access to multimodal information (e.g., from language and vision) should, in principle, give these models an advantage.
Abstract: TBA
Abstract: The presentation reflects on disagreement in annotations in discourse parsing datasets within the RST framework.
The talk identifies the downsides of the existing formalisms that aim to integrate disagreeing annotations.
We incorporate cross-lingual evidence from two closely related languages that are different from English: Dutch and German.
Based on that data, we aim to understand better how various types of disagreement in discourse parsing can be disentangled
and make a proposal for collection of data to address this question.
13:00 Lunch
Abstract: Moving away from the primary paradigm in NLP according to which there is a single correct label and any disagreements should be resolved by discussion (or a majority label is chosen based on very small numbers of annotations), we have collected several discourse-relation-annotated datasets in recent years, which contain a distribution of labels (DiscoGem 1.0 and DiscoGem 2.0). Our experiments show that distributions are replicable, and that they can reveal systematic patterns of disagreement between annotators and co-occurrence of discourse interpretations.
In my talk, I will focus on the role of individual factors that contribute to variation in interpretation. In a series of experiments, we have found that the annotation biases are stable within an individual, and that they can be linked to specific cognitive factors and/or background knowledge.
This raises the question whether our NLP models should move from a single model that captures some “average” to models which mimic specific individuals or groups of individuals with specific properties. At the example of a model that predicts reading times of two groups that differ in domain expertise, we show that models that match humans in terms of expertise are better at predicting their reading times, compared to generic models.
Abstract: As different application verticals race to include AI in their data, analysis, control, and interaction pipelines, the need for accurate confidence estimation is also increasing. AI as an empirical research discipline has historically had very weak statistical analysis of empirical results, relying far more on replication across different datasets to provide a "seat of the pants" feeling of confidence, rather than the more rigorous (though still quite flawed) statistical analysis we find in medicine, psychology, and other forms of human testing. We have been exploring the role of human evaluation variance [1,2] -- largely due to forms of meaning variation stemming from the ambiguity and subjectivity of natural language communication -- on confidence estimation in empirical studies of AI performance against a human standard. I will present recent work in using human response variance to perform more rigorous power analysis of evaluation results from human annotation.
[2] How Many Raters Do You Need? Power Analysis for Foundation Models
15:20 Lora Aroyo (Google Deep Mind): The Many Faces of Responsible AI (online)
Abstract: Conventional machine learning paradigms often rely on binary distinctions between positive and negative examples, disregarding the nuanced subjectivity that permeates real-world tasks and content. This simplistic dichotomy has served us well so far, but because it obscures the inherent diversity in human perspectives and opinions, as well as the inherent ambiguity of content and tasks, it poses limitations on model performance aligned with real-world expectations. This becomes even more critical when we study the impact and potential multifaceted risks associated with the adoption of emerging generative AI capabilities across different cultures and geographies. To address this, we argue that to achieve robust and responsible AI systems we need to shift our focus away from a single point of truth and weave in a diversity of perspectives in the data used by AI systems to ensure the trust, safety and reliability of model outputs.
In this talk, I present a number of data-centric use cases that illustrate the inherent ambiguity of content and natural diversity of human perspectives that cause unavoidable disagreement that needs to be treated as signal and not noise. This leads to a call for action to establish culturally-aware and society-centered research on impacts of data quality and data diversity for the purposes of training and evaluating ML models and fostering responsible AI deployment in diverse sociocultural contexts.
16:00 Tea Break
Abstract: TBA
Abstract: With increased power and prevalence of AI systems, it is ever more critical that AI systems are designed to serve all, i.e., people with diverse values and perspectives.
To improve AI systems to better reflect value pluralism, the first-order challenge is to explore the extent to which AI systems can model pluralistic human values, rights, and duties. We introduce ValuePrism, a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations. We conduct a large-scale study with annotators across diverse social and demographic backgrounds to understand whose values are represented. With ValuePrism, we build Kaleido, an open, light-weight, and structured language-based multi-task model that generates, explains, and assesses the relevance and valence (i.e., support or oppose) of human values, rights, and duties within a specific context. Humans prefer the sets of values output by our system over the teacher GPT-4, finding them more accurate and with broader coverage. In addition, we demonstrate that Kaleido can help explain variability in human decision-making by outputting contrasting values.
Building on this, we propose a roadmap to pluralistic alignment. We identify and formalize three possible ways to define and operationalize pluralism in AI systems: 1) Overton pluralistic models that present a spectrum of reasonable responses; 2) Steerably pluralistic models that can steer to reflect certain perspectives; and 3) Distributionally pluralistic models that are well-calibrated to a given population in distribution. We also propose and formalize three possible classes of pluralistic benchmarks: 1) Multi-objective benchmarks, 2) Trade-off steerable benchmarks, which incentivize models to steer to arbitrary trade-offs, and 3) Jury-pluralistic benchmarks which explicitly model diverse human ratings. We use this framework to argue that current alignment techniques may be fundamentally limited for pluralistic AI; indeed, we highlight empirical evidence, both from our own experiments and from other work, that standard alignment procedures might reduce distributional pluralism in models, motivating the need for further research on pluralistic alignment.
Abstract: AI systems like ChatGPT are now being used by millions of people, who have a diversity of values and preferences. In my talk, I will explore practical and effective methods for addressing this diversity in the process of AI alignment, with a particular focus on dataset construction. First, I will introduce two contrasting paradigms for managing subjectivity in annotations for subjective NLP tasks. Second, I will transfer these paradigms to the context of human feedback data, to motivate a framework for operationalising the alignment of AI systems. Finally, I will provide a concrete example of how we addressed subjectivity in human feedback during the construction of PRISM, a large-scale dataset that reflects the vast diversity of human preferences—and the value this diversity offers in building AI systems that benefit everyone.
18:00 Final discussion, End