The Science of Proportionality in Systemic Risk Evaluations of General-Purpose AI
Presenter: Carlos Mougan
Benchmarking Optimizers for Large Language Model Pretraining
Presenter: Andrei Semenov
Falsifiable, Claim-Centric Evaluation: Pre-Registered Evidence and Anytime Validity for Adaptive AI
Presenter: Ismail Lamaakal
How Benchmark Prediction from Fewer Data Misses the Mark
Presenter: Guanhua Zhang
Train-before-Test Harmonizes Language Model Rankings
Presenter: Guanhua Zhang
SATA-BENCH: SELECT ALL THAT APPLY BENCH-MARK FOR MULTIPLE CHOICE QUESTIONS
Presenter: Shixian Cui
Limits to scalable evaluation at the frontier
Presenter: Vivian Nastl
Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Presenter: Yotam Perlitz
Towards Reproducible Evaluation of Scientific Machine Learning: A Common Task Framework
Presenter: Georg Maierhofer
ROC-n-reroll: How verifier imperfection affects test-time scaling
Presenter: Florian Dorner
The Evaluation Crisis in Medical AI: Evidence from 241 Challenges
Presenter: Annika Reinke
False Promises in Medical Imaging AI? Outperformance Claims Highlighted by Bold-Face Numbers
Presenter: Evangelia Christodoulou
Medical Imaging AI benchmarking reloaded
Presenter: Lucas Luttner
EvalCards: A Framework for Standardized Evaluation Reporting
Presenter: Ruchira Dhar
On the Measure of a Model: From Intelligence to Generality
Presenter: Ruchira Dhar
Reproducibility by Design: A Modular Framework for Benchmarking Evolving Probabilistic AI Systems
Presenter: Philip Müller
Quantifying the Assistance Threshold: Progressive Hints for Long-Horizon LLMs Tasks
Presenter: Aissatou Diallo
Beyond Leaderboards: Enabling Scalable, Community-Curated, Domain-First Benchmarks for Real-World AI Use
Presenter: Francesco Carli
Crossing the validation crisis: a study of cross-validation for robust benchmarking
Presenter: Célestin Eve
Benchmarking footprints of optimization algorithms: Explainable insights into algorithm success and failure
Presenter: Ana Nikolikj
Dropping Just a Handful of Preferences Can Change Top LLM Rankings
Presenter: Jenny Y. Huang
Quantifying Uncertainty in Error Consistency: Towards Reliable Behavioral Comparison of Classifiers
Presenter: Thomas Klein
Recovering Benchmark Rankings from Pairwise Preferences
Presenter: Mina Remeli
What Matters in Deep Learning for Time Series Forecasting?
Presenter: Valentina Moretti
DyRuLe: Learning Dynamic Evaluation Rubrics for LLMs
Presenter: Catalin Brita
Rethinking Metrics for speech recognition Evaluation
Presenter: Ting-Hui Cheng
Reconciling Divergent Views Through a Critical Analysis of Iterative Self-Improvement in LLMs
Presenter: Ankita Maity
ReasonBENCH: How (Un)Certain Are LLMs When They Reason?
Presenter: Nearchos Potamitis
Self-Supervised Learning for Label-Efficient Benchmarking of Optimizer Performance
Presenter: Sintija Stevanoska
Possibilities in Multi-Criteria Benchmarking: A Social Choice Perspective
Presenter: Polina Gordienko
Evaluating and Understanding Scheming Propensity in LLM Agents
Presenter: Jannes Elstner
pyScInsilico: Streaming, hierarchical simulation of scRNA-seq for benchmarking pre-processing at billion-cell scale
Presenter: Matthias Flotho
A Case Study in Rigorous Benchmarking: Position Domain Generalisation in Myoelectric Control
Presenter: Katarzyna Szymaniak
DECAF: A Dynamically Extensible Corpus Analysis Framework
Presenter: Anna Rogers
Evaluating Agentic AI in Manufacturing: Taxonomy and Future Directions
Presenter: Nastaran Moradzadeh Farid
Measuring Language Model Hallucinations Through Distributional Correctness
Presenter: Tom Burns
Task Alignment Outweighs Framework Choice in Scientific LLM Agents
Presenter: Martiño Rios-Garcia
MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering
Presenter: Yuexing Hao
MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning
Presenter: Tristan Tomilin
Stop evaluating AI with human tests, develop principled, AI-specific tests instead
Presenter: Tom Sühr