EurIPS 2025 Workshop
Copenhagen, DK / December 2025 (exact date and location TBD)
Benchmarking has played a central role in the progress of machine learning research since the 1980s. Although the benchmarking ecosystem is vast and growing, researchers still know little about how and why benchmarks promote scientific progress. At the same time, rapidly advancing model capabilities make it increasingly difficult to evaluate exactly what models can do and how they fail. Increasingly, the community faces an evaluation crisis that threatens progress, undermines the validity of scientific claims, and distorts public discourse about AI.
This workshop contributes to the scientific foundations of benchmarking and evaluation by connecting leading experts from different areas of AI to share their experience and insights with the EurIPS community. The focus is on identifying key challenges, risks, and emerging methodologies that can advance the science of benchmarking and evaluation.
This workshop invites researchers to interrogate the science of evaluation and benchmarking and to chart its future. We aim to bring together theorists, empiricists, practitioners, and surveyors of the field to explore core challenges and emerging directions, including questions such as:
Foundations of validity: How should we formalize when and why evaluations capture the constructs we intend, and when apparent progress is misleading?
Benchmark design under adaptivity: What mechanisms or protocols can preserve reliability when test sets are reused or when models influence future data?
Aggregation and dynamics: How can insights from statistics, social choice, and game theory guide the construction of multi-task, dynamic, or community-driven benchmarks?
Evaluation at the frontier: What methods—human, statistical, or model-based—can credibly assess models that rival or exceed human judgment?
Beyond leaderboards: What complementary forms of evidence—such as causal analyses, surveys, uplift studies, error characterizations, or validity frameworks—can support trustworthy conclusions?
Through these discussions, this workshop seeks to advance the science of evaluation and benchmarking, ensuring they provide reliable guidance in an era of increasingly large, adaptive, and interactive AI systems.
To participate in the workshop, please submit a text-only extended abstract of no more than 500 words. The abstract should clearly describe the research question, methodology, and main findings. Note that this is a non-archival venue. We welcome papers that have been published or accepted to an archival venue in 2025.
Please submit your abstract using this Google form.
Authors of the accepted abstracts will be invited to present a poster at the workshop.
All submissions will undergo a single-blind review by the workshop’s program committee.
Abstracts will be evaluated primarily on:
Relevance to the workshop theme – how clearly the work engages with the challenges of benchmarking and evaluation in modern machine learning.
Clarity and coherence – whether the research question, methodology, and key findings are clearly presented.
Novelty and insight – the extent to which the work offers original ideas, perspectives, or findings that advance understanding.
Potential to stimulate discussion – the degree to which the work can foster insightful debate and contribute to the goals of the workshop.
The review process is designed to identify submissions that best align with the workshop’s objectives, rather than to provide a full technical assessment as in a standard conference track.
All accepted papers will be presented during our poster sessions. Detailed information will be announced later through email (?)
Here are the key deadlines for the workshop. Please note that all deadlines are Anywhere on Earth (AoE).
Submission Open: 22 September 2025
Submission Deadline: 10 October 2025
Reviewer Deadline: 25 October 2025
Notification: 31 October 2025
Workshop Date: TBD
For additional information, please contact olawale[at]mit[dot]edu, yatong.chen[at]tuebingen[dot]mpg[dot]de
08:50 - Opening remarks
9:00 - Keynote 1
9:45 - Keynote 2
10:30 - Break and Poster Session
11:30: - Keynote 3
12:15 - Lunch break
13:30 - Keynote 4
14:15 - Keynote 5
15:00 - Break
15:15 - Panel
16:00 - Poster Session
Professor of Computer Science, University of Copenhagen
Professor, Universitat Politècnica de València, Spain
Senior Research Fellow, Leverhulme Centre for the Future of Intelligence, University of Cambridge
Research Director, Inria
Staff Research Scientist, DeepMind
Professor and Turing Fellow, University College London
...