EurIPS 2025 Workshop
Copenhagen, DK
December 6th, 2025
Benchmarking has played a central role in the progress of machine learning research since the 1980s. Although the benchmarking ecosystem is vast and growing, researchers still know little about how and why benchmarks promote scientific progress. At the same time, rapidly advancing model capabilities make it increasingly difficult to evaluate exactly what models can do and how they fail. Increasingly, the community faces an evaluation crisis that threatens progress, undermines the validity of scientific claims, and distorts public discourse about AI.
This workshop contributes to the scientific foundations of benchmarking and evaluation by connecting leading experts from different areas of AI to share their experience and insights with the EurIPS community. The focus is on identifying key challenges, risks, and emerging methodologies that can advance the science of benchmarking and evaluation.
This workshop invites researchers to interrogate the science of evaluation and benchmarking and to chart its future. We aim to bring together theorists, empiricists, practitioners, and surveyors of the field to explore core challenges and emerging directions, including questions such as:
Foundations of validity: How should we formalize when and why evaluations capture the constructs we intend, and when apparent progress is misleading?
Benchmark design under adaptivity: What mechanisms or protocols can preserve reliability when test sets are reused or when models influence future data?
Aggregation and dynamics: How can insights from statistics, social choice, and game theory guide the construction of multi-task, dynamic, or community-driven benchmarks?
Evaluation at the frontier: What methods—human, statistical, or model-based—can credibly assess models that rival or exceed human judgment?
Beyond leaderboards: What complementary forms of evidence—such as causal analyses, surveys, uplift studies, error characterizations, or validity frameworks—can support trustworthy conclusions?
Through these discussions, this workshop seeks to advance the science of evaluation and benchmarking, ensuring they provide reliable guidance in an era of increasingly large, adaptive, and interactive AI systems.
To participate in the workshop, please submit a text-only extended abstract of no more than 500 words. The abstract should clearly describe the research question, methodology, and main findings. Note that this is a non-archival venue. We welcome papers that have been published or accepted to an archival venue in 2025.
Please submit your abstract using this Google form.
Authors of the accepted abstracts will be invited to present a poster at the workshop.
All submissions will undergo a single-blind review by the workshop’s program committee.
Abstracts will be evaluated primarily on:
Relevance to the workshop theme – how clearly the work engages with the challenges of benchmarking and evaluation in modern machine learning.
Clarity and coherence – whether the research question, methodology, and key findings are clearly presented.
Novelty and insight – the extent to which the work offers original ideas, perspectives, or findings that advance understanding.
Potential to stimulate discussion – the degree to which the work can foster insightful debate and contribute to the goals of the workshop.
The review process is designed to identify submissions that best align with the workshop’s objectives, rather than to provide a full technical assessment as in a standard conference track.
All accepted papers will be presented during our poster sessions. Detailed information will be announced later through email.
Here are the key deadlines for the workshop. Please note that all deadlines are Anywhere on Earth (AoE).
Submission Open: 22 September 2025
Submission Deadline: 10 October 2025
Reviewer Deadline: 25 October 2025
Notification: 31 October 2025
Workshop Date: 6 December 2025
For additional information, please contact olawale[at]mit[dot]edu, yatong.chen[at]tuebingen[dot]mpg[dot]de
08:50 - Opening Remarks: Moritz Hardt
9:00 - Keynote 1: Laura Weidinger
9:45 - Keynote 2: Isabelle Augenstein
10:30 - Break and Poster Session
11:30 - Keynote 3: José Hernández-Orallo
12:15 - Lunch break
13:30 - Keynote 4: Emine Yilmaz
14:15 - Keynote 5: Gaël Varoquaux
15:00 - Break
15:15 - Panel Discussion
16:00 - Poster Session
Staff Research Scientist, DeepMind
Professor of Computer Science, University of Copenhagen
Professor, Universitat Politècnica de València, Spain
Senior Research Fellow, Leverhulme Centre for the Future of Intelligence, University of Cambridge
Professor and Turing Fellow, University College London
Research Director, Inria
Title: Sociotechnical Approach to AI Evaluation
Abstract: As AI systems increasingly permeate our lives, institutions, and societies, measuring their capabilities and failures has become ever more important. But current evaluation methods aren’t up to the challenge -- benchmarking, red teaming, and experimentation methods are limited in what they can predict about AI outcomes in the real world. In this talk, I take a step back and consider the goals of AI evaluation. On this basis, I propose a sociotechnical approach forward, to better capture the need for understanding AI systems across different contexts. By situating capability-based approaches in an expanded picture of AI evaluation, we can come to better understand AI systems, and to build a science of evaluation that can stand the test of time.
Bio: Laura Weidinger is a Staff Research Scientist at Google DeepMind, where she leads research on novel approaches to ethics and safety evaluation. Laura’s work focuses on detecting, measuring, and mitigating risks from generative AI systems. Previously, Laura worked in cognitive science research and as policy advisor at UK and EU levels. She holds degrees from Humboldt Universität Berlin and University of Cambridge.
Title: Understanding the Interplay between LLMs' Utilisation of Parametric and Contextual Knowledge
Abstract: Language Models (LMs) acquire parametric knowledge from their training process, embedding it within their weights. The increasing scalability of LMs, however, poses significant challenges for understanding a model's inner workings and further for updating or correcting this embedded knowledge without the significant cost of retraining. Moreover, when using these language models for knowledge-intensive language understanding tasks, LMs have to integrate relevant context, mitigating their inherent weaknesses, such as incomplete or outdated knowledge. Nevertheless, studies indicate that LMs often ignore the provided context as it can be in conflict with the pre-existing LM's memory learned during pre-training. Conflicting knowledge can also already be present in the LM's parameters, termed intra-memory conflict. This underscores the importance of understanding the interplay between how a language model uses its parametric knowledge and the retrieved contextual knowledge. In this talk, I will aim to shed light on this important issue by presenting our research on evaluating the knowledge present in LMs, diagnostic tests that can reveal knowledge conflicts, as well as on understanding the characteristics of successfully used contextual knowledge.
Bio: Isabelle Augenstein is a Professor at the University of Copenhagen, Department of Computer Science, where she heads the Natural Language Processing section. Her main research interests are fair and accountable NLP, including challenges such as explainability, factuality and bias detection. Prior to starting a faculty position, she was a postdoctoral researcher at University College London, and before that a PhD student at the University of Sheffield. In October 2022, Isabelle Augenstein became Denmark’s youngest ever female full professor. She currently holds a prestigious ERC Starting Grant on 'Explainable and Robust Automatic Fact Checking’, and her research has been recognised by a Karen Spärck Jones Award, as well as a Hartmann Diploma Prize. She is a member of the Royal Danish Academy of Sciences and Letters, and co-leads the Danish Pioneer Centre for AI.
Title: General Scales for AI Evaluation
Abstract: Much is being said about the need for a Science of Evaluation in AI, yet the answer may simply be found in what any science should provide: explanatory power to understand what AI systems are capable of, and predictive power to anticipate where they will be correct and safe. For increasingly more general and capable AI, this power should not be limited to aggregated tasks, benchmarks or distributions, but should happen for each task *instance*. However, identifying the demands of each individual instance has been elusive, with limited predictability so far. I will present a new paradigm in AI evaluation based on general scales that are exclusively derived from task demands, and can be applied through both automatable and human-interpretable rubrics. These scales can explain what common AI benchmarks truly measure, extract ability profiles quantifying the limits of what AI systems can do, and predict the performance for new task instances robustly. This brings key insights on the construct validity (sensitivity and specificity) of different benchmarks, and the way distinct abilities (e.g., knowledge, metacognition and reasoning) are affected by model size, chain-of-thought integration and dense distillation. Since these general scales do not saturate as average performance does, and do not depend on human or model populations, we can explore how they can be extended for high levels of cognitive abilities of AI and (enhanced) humans.
Bio: Jose H. Orallo is Director of Research at the Leverhulme Centre for the Future of Intelligence, University of Cambridge, UK, and Professor (on partial leave) at TU Valencia, Spain. His academic and research activities have spanned several areas of artificial intelligence, machine learning, data science and intelligence measurement, with a focus on a more insightful analysis of the capabilities, generality, progress, impact and risks of artificial intelligence. He has published five books and more than two hundred journal articles and conference papers on these topics. His research in the area of machine intelligence evaluation has been covered by several popular outlets, such as The Economist, WSJ, FT, New Scientist or Nature. He keeps exploring a more integrated view of the evaluation of natural and artificial intelligence, as vindicated in his book "The Measure of All Minds" (Cambridge University Press, 2017, PROSE Award 2018). He is a founder of aievaluation.substack.com and ai-evaluation.org. He is a member of AAAI, CAIRNE and ELLIS, and a EurAI Fellow.
Title: Using Large Language Models for Evaluation: Opportunities and Limitations
Abstract: Large Language Models (LLMs) have shown significant promise as tools for automated evaluation across diverse domains. While using LLMs for evaluation come with significant advantages potentially alleviating the reliance on costly and subjective human assessments, the adoption of LLM-based evaluation is not without challenges. In this talk we discuss the transformative potential and the inherent constraints of using LLMs for evaluation tasks. In particular, we describe some of the challenges that come with LLM-based evaluation, such as biases and variability in judgments. We further discuss how LLMs can augment traditional evaluation practices while acknowledging the need for cautious and informed integration.
Short bio: Emine Yilmaz is a Professor and an EPSRC Fellow at University College London, Department of Computer Science. She is also a faculty fellow at the Alan Turing Institute and an ELLIS fellow. At UCL, she is one of the faculty members affiliated with the UCL Centre for Artificial Intelligence, where she leads the Web Intelligence Group. She also works as an Amazon Scholar for Amazon with the Alexa Shopping team. She is a co-founder of Humanloop, a UCL spinout company. Dr. Yilmaz's research interests lie in the fields of information retrieval and natural language processing. Her research in these areas is mainly guided by principles from machine learning, statistics and information theory.
Title: The slow progress of AI on problems with small datasets
Abstract: Benchmarking and empirical evaluations have been central to the modern progress of AI, tackling domains such as vision, language, voice... Methods have progressed building on a lot of trial and error. But other domains, such as medical imaging or tabular learning, paint another picture, where progress is slow. I will detail the evidence of slow progress, possible reasons for this, as well as ingredients of success, as recently seen in tabular learning.
Bio: Gaël Varoquaux is a research director working on data science at Inria (French computer science national research) where he leads the Soda team. He is also co-founder and scientific advisor of Probabl. Varoquaux's research covers fundamentals of artificial intelligence, statistical learning, natural language processing, causal inference, as well as applications to health, with a current focus on public health and epidemiology. He also creates technology: he co-founded scikit-learn, one of the reference machine-learning toolboxes, and helped build various central tools for data analysis in Python. Varoquaux has worked at UC Berkeley, McGill, and University of Florence. He did a PhD in quantum physics supervised by Alain Aspect and is a graduate from Ecole Normale Superieure, Paris.