📐 AI Evaluation Beyond Metrics

Sunday July 24th

(Schubert 1 Room)

Introduction to workshop 9:00-9:05

9:05 - 9:55

Invited Speaker: Amanda Seed

40m + 10m questions

Title: Evaluating NGI (Natural General Intelligence)

Abstract: There is good evidence that primates differ in their intelligence: there are differences in brain size; meta-analyses suggest differences in performance across a range of tests; and recently large-scale collaborations have found significant differences between primate species in tests of inhibitory control and short-term memory. However, evaluating what cognitive difference makes one species more capable or skilled than another is a difficult task, because of the ‘task impurity problem’, a problem common to evaluating intelligence of all kinds. We have adopted a psychometric approach to addressing this problem, using a multi-trait multi-method test battery to examine mechanisms of executive function or cognitive control in chimpanzees and human children. We find evidence for shared variance between tasks, though we could not determine which structure best explained the results in this study (i.e. whether or which further latent variables could be identified). Nevertheless, I will argue that evaluating convergent and divergent validity of measures is an important tool in evaluating the make-up of intelligence and comparing its expression.

9:55 - 10:45

Paper Presentations

Evaluating Understanding on Conceptual Abstraction Benchmarks (paper, , presentations)
Victor Vikram Odouard and Melanie Mitchell

On Young Children’s Exploration, Aha! Moments and Explanations in Model Building for Self-Regulated Problem-Solving (paper, presentation)
Vicky Charisi, Natalia Díaz Rodríguez, Barbara Mawhin and Luis Merino
Evaluating Object Permanence in Embodied Agents using the Animal-AI Environment (paper, presentation)
Konstantinos Voudouris, Niall Donnelly, Danaja Rutar, Ryan Burnell, John Burden, Lucy Cheke and José Hernández-Orallo
Behavioral experiments for understanding catastrophic forgetting (paper, presentation)
Samuel Bell and Neil Lawrence
FERM: A FEature-space Representation Measure for Improved Model Evaluation (paper, presentation)
Guyver Fu, Wenbo Ge and Jo Plested

Coffee Break 10:45 - 11:15

11:15 - 11:40

Paper Presentations

The Relevance of Non-Human Errors in Machine Learning (paper, presentation)
Ricardo Baeza-Yates and Marina Estévez-Almenzar
Robustness Testing of Machine Learning Families using Instance-Level IRT-Difficulty (paper, presentation)
Raül Fabra-Boluda, Cèsar Ferri, Fernando Martínez-Plumed and Maria Jose Ramirez-Quintana
Item Response Theory to Evaluate Speech Synthesis: Beyond Synthetic Speech Difficulty (paper, presentation)
Chaina Oliveira and Ricardo Prudêncio

11:40 - 12:30

Panel: Cognitive Evaluation with the Animal AI Environment

Tomer D. Ullman
Murray Shanahan
Amanda Seed

Moderated by Lucy Cheke

Lunch 12:30 - 14:00

14:00 - 14:45

Invited Speaker: Adina Williams

remote, 35m + 10m questions

Title: No Escape from Qualitative Evaluation: A Recent View from Benchmarking in NLP

Abstract: The classical evaluation paradigm in natural language processing measures accuracy on a single, community agreed upon benchmark dataset, often as a way to isolate the "best" model. This paradigm has been taken to an extreme recently with several works---such as the GLUE and SuperGLUE benchmarks (Wang et al 2019, 2020), the BigBench benchmark (Google 2022) and the Dynabench benchmark (Kiela et al. 2021)---that collate datasets into a larger "benchmark suites", exploding the number of quantitative measurements taken of each model. First, I will touch on several issues with NLP benchmarking practices, such as difficulties of metric aggregation, metric trade-offs, and compute concerns. Then, I will make a case in favor of collecting more theoretically informed data annotations, and finally argue that quantitative metrics alone are unlikely to be sufficient to get a complete picture of model performance.

14:45 - 15:30

Panel: Evaluating pre-trained, generative and prompted systems.

Matthias Samwald
Lama Ahmad
Jo Plested

Moderated by Jose Hernandez-Orallo

Break 15:30 - 16:00

16:00 - 16:35

Paper Presentations

Evaluating Sports Analytics Models: Challenges, Approaches, and Lessons Learned (paper, presentation)
Jesse Davis, Lotte Bransen, Laurens Devos, Wannes Meert, Pieter Robberechts, Jan Van Haaren and Maaike Van Roy
A Framework for Categorising AI Evaluation Instruments (paper, presentation)
Anthony G Cohn, José Hernández-Orallo, Julius Sechang Mboli, Yael Moros-Daval, Zhiliang Xiang and Lexin Zhou
Reject Before You Run: Small Assessors Anticipate Big Language Models (paper, presentation)
Lexin Zhou, Fernando Martínez-Plumed, José Hernández-Orallo, Cèsar Ferri and Wout Schellaert
Red Teaming Language Models with Language Models (paper, presentation) Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese and Geoffrey Irving

16:35 - 17:20

OECD Session: Artificial Intelligence and the Future of Skills (AIFS)

Invited Talk: Stuart Elliot (OECD)

Title: "Assessing AI Capabilities for Policymakers"

Abstract: AI can be evaluated for a variety of purposes, which will affect the approach used. This session focuses on AI evaluation for a group outside the field: policymakers and the closely related audience of the general public. Policymakers have needs that are distinctly different than those of computer scientists or supporters of AI research and development. Although AI evaluations for policymakers will not be designed for the computer science community, they need to be designed in large part by the computer science community. This session discusses the OECD's effort to assess AI capabilities for policymakers and explores the implications for the approaches that could meet the goals of this kind of evaluation.

Panel:

Virginia Dignum (Umeå universitet)
Tony Cohn (Leeds University)
Songül Tolan (JRC - European Commission)

Moderated by Stuart Elliot

17:20 - 17:30

Concluding Remarks

Going for drinks/food together 18:00 - 21:00

Page updated

Google Sites

Report abuse