Introduction to workshop 9:00-9:05
9:05 - 9:55
40m + 10m questions
Title: Evaluating NGI (Natural General Intelligence)
Abstract: There is good evidence that primates differ in their intelligence: there are differences in brain size; meta-analyses suggest differences in performance across a range of tests; and recently large-scale collaborations have found significant differences between primate species in tests of inhibitory control and short-term memory. However, evaluating what cognitive difference makes one species more capable or skilled than another is a difficult task, because of the ‘task impurity problem’, a problem common to evaluating intelligence of all kinds. We have adopted a psychometric approach to addressing this problem, using a multi-trait multi-method test battery to examine mechanisms of executive function or cognitive control in chimpanzees and human children. We find evidence for shared variance between tasks, though we could not determine which structure best explained the results in this study (i.e. whether or which further latent variables could be identified). Nevertheless, I will argue that evaluating convergent and divergent validity of measures is an important tool in evaluating the make-up of intelligence and comparing its expression.
9:55 - 10:45
Evaluating Understanding on Conceptual Abstraction Benchmarks (paper, , presentations)
Victor Vikram Odouard and Melanie Mitchell
On Young Children’s Exploration, Aha! Moments and Explanations in Model Building for Self-Regulated Problem-Solving (paper, presentation)
Vicky Charisi, Natalia Díaz Rodríguez, Barbara Mawhin and Luis Merino
Evaluating Object Permanence in Embodied Agents using the Animal-AI Environment (paper, presentation)
Konstantinos Voudouris, Niall Donnelly, Danaja Rutar, Ryan Burnell, John Burden, Lucy Cheke and José Hernández-Orallo
Behavioral experiments for understanding catastrophic forgetting (paper, presentation)
Samuel Bell and Neil Lawrence
FERM: A FEature-space Representation Measure for Improved Model Evaluation (paper, presentation)
Guyver Fu, Wenbo Ge and Jo Plested
Coffee Break 10:45 - 11:15
11:15 - 11:40
The Relevance of Non-Human Errors in Machine Learning (paper, presentation)
Ricardo Baeza-Yates and Marina Estévez-Almenzar
Robustness Testing of Machine Learning Families using Instance-Level IRT-Difficulty (paper, presentation)
Raül Fabra-Boluda, Cèsar Ferri, Fernando Martínez-Plumed and Maria Jose Ramirez-Quintana
Item Response Theory to Evaluate Speech Synthesis: Beyond Synthetic Speech Difficulty (paper, presentation)
Chaina Oliveira and Ricardo Prudêncio
11:40 - 12:30
Tomer D. Ullman
Murray Shanahan
Amanda Seed
Moderated by Lucy Cheke
Lunch 12:30 - 14:00
14:00 - 14:45
remote, 35m + 10m questions
Title: No Escape from Qualitative Evaluation: A Recent View from Benchmarking in NLP
Abstract: The classical evaluation paradigm in natural language processing measures accuracy on a single, community agreed upon benchmark dataset, often as a way to isolate the "best" model. This paradigm has been taken to an extreme recently with several works---such as the GLUE and SuperGLUE benchmarks (Wang et al 2019, 2020), the BigBench benchmark (Google 2022) and the Dynabench benchmark (Kiela et al. 2021)---that collate datasets into a larger "benchmark suites", exploding the number of quantitative measurements taken of each model. First, I will touch on several issues with NLP benchmarking practices, such as difficulties of metric aggregation, metric trade-offs, and compute concerns. Then, I will make a case in favor of collecting more theoretically informed data annotations, and finally argue that quantitative metrics alone are unlikely to be sufficient to get a complete picture of model performance.
14:45 - 15:30
Matthias Samwald
Lama Ahmad
Jo Plested
Moderated by Jose Hernandez-Orallo
Break 15:30 - 16:00
16:00 - 16:35
Evaluating Sports Analytics Models: Challenges, Approaches, and Lessons Learned (paper, presentation)
Jesse Davis, Lotte Bransen, Laurens Devos, Wannes Meert, Pieter Robberechts, Jan Van Haaren and Maaike Van Roy
A Framework for Categorising AI Evaluation Instruments (paper, presentation)
Anthony G Cohn, José Hernández-Orallo, Julius Sechang Mboli, Yael Moros-Daval, Zhiliang Xiang and Lexin Zhou
Reject Before You Run: Small Assessors Anticipate Big Language Models (paper, presentation)
Lexin Zhou, Fernando Martínez-Plumed, José Hernández-Orallo, Cèsar Ferri and Wout Schellaert
Red Teaming Language Models with Language Models (paper, presentation) Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese and Geoffrey Irving
16:35 - 17:20
Invited Talk: Stuart Elliot (OECD)
Title: "Assessing AI Capabilities for Policymakers"
Abstract: AI can be evaluated for a variety of purposes, which will affect the approach used. This session focuses on AI evaluation for a group outside the field: policymakers and the closely related audience of the general public. Policymakers have needs that are distinctly different than those of computer scientists or supporters of AI research and development. Although AI evaluations for policymakers will not be designed for the computer science community, they need to be designed in large part by the computer science community. This session discusses the OECD's effort to assess AI capabilities for policymakers and explores the implications for the approaches that could meet the goals of this kind of evaluation.
Panel:
Virginia Dignum (Umeå universitet)
Tony Cohn (Leeds University)
Songül Tolan (JRC - European Commission)
Moderated by Stuart Elliot
17:20 - 17:30
Concluding Remarks
Going for drinks/food together 18:00 - 21:00