Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance
Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance
ICML 2026
Seoul, South Korea
July 10 or 11, 2026
About The Workshop
Benchmarks such as HELM [1] and Big-Bench [2] have significantly advanced quantitative model evaluation. However, current practice remains largely empirical and while measuring performance, does not provide guarantees on what capabilities a model has, when those capabilities will or will not manifest, and why. Debates over emergent abilities at scale illustrate this lack of predictive understanding [3,4]. In parallel, a substantial body of theoretical work addresses scaling laws for pre-training [5,6], generalization [7-9], and benchmark predictability [10]. Yet these theoretical advances are often disconnected from real-world benchmarking, and as a result, theoretical insights rarely inform benchmark design. This structural disconnect limits our ability to make reliable claims about model behavior, constrains trustworthy deployment, and slows the development of foundation models.
Scientific Objectives. This workshop will focus on advancing a predictive science of foundation model performance. Whereas current benchmarks measure capability retrospectively, predictive science requires ex ante guarantees: when will performance extrapolate under scale, shift, or composition? We structure the workshop around three key research challenges:
Quantification of capabilities across levels: How can we move from scores to formal, quantitative guarantees on performance across levels? We aim to understand performance at multiple levels: individual instances, structured task families, and capability classes shaped by inductive biases such as model architecture, training regime, and scale.
Foundations of generalization and composition: Which mathematical frameworks can explain when and why models generalize? How can benchmarks be designed so that we can test hypotheses derived from theory? We are particularly interested in compositional generalization: predicting how capabilities interact, transfer, and combine across tasks.
Reliable and structured empirical evaluation: How should benchmarks be constructed to evaluate reasoning, robustness under distribution shift, and calibrated uncertainty? We seek principled evaluation protocols that enable falsifiable theoretical predictions.
These objectives aim to establish a feedback loop between theory and benchmarking, where theory guides evaluation design and empirical findings refine theoretical models. The workshop will produce following outcomes: (i) a white paper on theory-and-benchmarks including a theory-informed benchmark design checklist, (ii) workshop proceedings (non-archivable) compiling selected contributions, and (iii) a curated set of challenge problems.
Potential Impact. This workshop will convene researchers across mathematics, statistics, machine learning, and industry to catalyze a new research agenda that tightly couples theory and empirical evaluation of foundation models. By advancing frameworks that make performance predictable and quantifiable, we aim to influence: (i) how benchmarks are designed, (ii) how models are stress-tested, and (iii) how reliability claims are substantiated. These developments have direct implications for large-scale deployment, evaluation pipelines, and red-teaming practices in industry. More broadly, the workshop will help accelerate the emergence of a principled science of foundation models grounded in predictive theory, structured evaluation, and rigorous performance guarantees.
For more information on topics, see our Call for Papers page.
ICML 2026 Website: https://icml.cc/