EBeM Workshop 2022

📐 AI Evaluation Beyond Metrics

workshop at IJCAI-ECAI 2022 (Vienna, Austria)

July 24th (Schubert 1 Room)

Invited Speakers & PANELS

Amanda Seed

University of St Andrews

+ panel on "Cognitive Evaluation with the Animal AI Environment", with

Murray Shanahan (Imperial, Deepmind)
Tomer D. Ullman (Harvard)
Amanda Seed (St. Andrews)

+ panel on "Evaluating pre-trained, generative and prompted systems", with

Matthias Samwald (Medical Univ. Vienna)
Lama Ahmad (OpenAI)
Jo Plested (University of New South Wales)

+ special session on "OECD’s Artificial Intelligence and the Future of Skills (AIFS)"

with Stuart Elliot (OECD), Virginia Dignum (Umeå), Tony Cohn (Leeds) and Songül Tolan (European Commission)

Call for Papers

The 1st international workshop on AI Evaluation Beyond Metrics (EBeM) will be held in Vienna, Austria (July 23-25, 2022).

Cutting edge AI and ML systems are able to solve a variety of problems that were not solvable a few years ago, such as machine translation and medical image analysis. With these AI systems starting to be deployed across important and consequential contexts, robust evaluation of their capabilities and limitations is critical. More generally, traditional approaches to evaluation lack the necessary robustness to analyse the capabilities of complex AI systems. Many AI systems solve a task or excel at a particular benchmark, but then fail at other tasks or instances that putatively represent the same capability.

Therefore, the goal of this workshop is to challenge the widespread but limited approach of evaluating the performance of intelligent systems with aggregated metrics over a benchmark or distribution of tasks. We will discuss further alternative approaches that draw on ideas and recent progress in cognitive and developmental psychology, psychometrics, software testing, and other areas.

Topics (not exhaustive)

Evaluation methods founded on cognitive, developmental or comparative psychology
Measurement of skills, capabilities, or cognitive abilities
Evaluation methods based on software testing or other engineering practices
Meta-analysis or comparisons of evaluation instruments
The role of evaluation in AI development, policy making, and modeling of social impact
Measurements of generality or common-sense
Capture and use of evaluation data
Analysis of the task space and its relation to corresponding capabilities
The role of causality in evaluation
Topics complementary to evaluation such as documentation or auditing
Alternative evaluation methods with added benefits
Discussion and progress in hard to evaluate scenarios