Accepted Posters

The Science of Proportionality in Systemic Risk Evaluations of General-Purpose AI

Presenter: Carlos Mougan

Benchmarking Optimizers for Large Language Model Pretraining

Presenter: Andrei Semenov

Falsifiable, Claim-Centric Evaluation: Pre-Registered Evidence and Anytime Validity for Adaptive AI

Presenter: Ismail Lamaakal

How Benchmark Prediction from Fewer Data Misses the Mark

Presenter: Guanhua Zhang

Train-before-Test Harmonizes Language Model Rankings

Presenter: Guanhua Zhang

SATA-BENCH: SELECT ALL THAT APPLY BENCH-MARK FOR MULTIPLE CHOICE QUESTIONS

Presenter: Shixian Cui

Limits to scalable evaluation at the frontier

Presenter: Vivian Nastl

Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation

Presenter: Yotam Perlitz

Towards Reproducible Evaluation of Scientific Machine Learning: A Common Task Framework

Presenter: Georg Maierhofer

ROC-n-reroll: How verifier imperfection affects test-time scaling

Presenter: Florian Dorner

The Evaluation Crisis in Medical AI: Evidence from 241 Challenges

Presenter: Annika Reinke

False Promises in Medical Imaging AI? Outperformance Claims Highlighted by Bold-Face Numbers

Presenter: Evangelia Christodoulou

Medical Imaging AI benchmarking reloaded

Presenter: Lucas Luttner

EvalCards: A Framework for Standardized Evaluation Reporting

Presenter: Ruchira Dhar

On the Measure of a Model: From Intelligence to Generality

Presenter: Ruchira Dhar

Reproducibility by Design: A Modular Framework for Benchmarking Evolving Probabilistic AI Systems

Presenter: Philip Müller

Quantifying the Assistance Threshold: Progressive Hints for Long-Horizon LLMs Tasks

Presenter: Aissatou Diallo

Beyond Leaderboards: Enabling Scalable, Community-Curated, Domain-First Benchmarks for Real-World AI Use

Presenter: Francesco Carli

Crossing the validation crisis: a study of cross-validation for robust benchmarking

Presenter: Célestin Eve

Benchmarking footprints of optimization algorithms: Explainable insights into algorithm success and failure

Presenter: Ana Nikolikj

Dropping Just a Handful of Preferences Can Change Top LLM Rankings

Presenter: Jenny Y. Huang

Quantifying Uncertainty in Error Consistency: Towards Reliable Behavioral Comparison of Classifiers

Presenter: Thomas Klein

Recovering Benchmark Rankings from Pairwise Preferences

Presenter: Mina Remeli

What Matters in Deep Learning for Time Series Forecasting?

Presenter: Valentina Moretti

DyRuLe: Learning Dynamic Evaluation Rubrics for LLMs

Presenter: Catalin Brita

Rethinking Metrics for speech recognition Evaluation

Presenter: Ting-Hui Cheng

Reconciling Divergent Views Through a Critical Analysis of Iterative Self-Improvement in LLMs

Presenter: Ankita Maity

ReasonBENCH: How (Un)Certain Are LLMs When They Reason?

Presenter: Nearchos Potamitis

Self-Supervised Learning for Label-Efficient Benchmarking of Optimizer Performance

Presenter: Sintija Stevanoska

Possibilities in Multi-Criteria Benchmarking: A Social Choice Perspective

Presenter: Polina Gordienko

Evaluating and Understanding Scheming Propensity in LLM Agents

Presenter: Jannes Elstner

pyScInsilico: Streaming, hierarchical simulation of scRNA-seq for benchmarking pre-processing at billion-cell scale

Presenter: Matthias Flotho

A Case Study in Rigorous Benchmarking: Position Domain Generalisation in Myoelectric Control

Presenter: Katarzyna Szymaniak

DECAF: A Dynamically Extensible Corpus Analysis Framework

Presenter: Anna Rogers

Evaluating Agentic AI in Manufacturing: Taxonomy and Future Directions

Presenter: Nastaran Moradzadeh Farid

Measuring Language Model Hallucinations Through Distributional Correctness

Presenter: Tom Burns

Task Alignment Outweighs Framework Choice in Scientific LLM Agents

Presenter: Martiño Rios-Garcia

MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering

Presenter: Yuexing Hao

MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

Presenter: Tristan Tomilin

Stop evaluating AI with human tests, develop principled, AI-specific tests instead

Presenter: Tom Sühr

Page updated

Google Sites

Report abuse