Evaluating Foundation/Frontier Models
Workshop hosted by The Alan Turing Institute
November 28th, 2023.
Workshop Overview
The capabilities of foundation/frontier models are growing rapidly, which has led to the widespread deployment of theses systems across a wide range of applications.
The growing capabilities of these systems and the methods used to train them pose many new challenges for system evaluation. For example:
In order to cover the breadth of system capabilities, benchmarks are growing in size and complexity, making them more expensive and time consuming to run.
A lack of knowledge and access to cutting-edge systems is making it difficult for researchers and policymakers to conduct evaluations or to determine why a system is performing well (or poorly).
Benchmarks are becoming obsolete increasingly fast due to data contamination and rapid capability improvements.
The increasingly frequent release of new models is making it hard for evaluators to get a full grasp of a system before it becomes out-of-date.
The goal of this workshop is to discuss the latest approaches to evaluation of LLMs, to enumerate the most important challenges faced by evaluators, and to spark collaborations aimed at addressing these challenges.
Keynote Speaker
Hinrich Schütze
Ludwig Maximilian University of Munich (LMU)
Glot500: Creating and Evaluating a Language Model for 500 Languages
Most work on large language models (LLMs) has focused on what we call "vertical" scaling: making LLMs even better for a relatively small number of high-resource languages. We address "horizontal" scaling instead: extending LLMs to a large subset of the world's languages, focusing on low-resource languages. Our Glot500-m model is trained on more than 500 languages, many of which are not covered by any other language model. But how do we know that the model has actually learned these 500 languages? Broad low-resource evaluation turns out to be a difficult problem in itself and we tried to innovate in several ways. One issue we were not able to solve is that parts of our evaluation standard cannot be distributed due to copyright restrictions. We also find that attributing good/bad performance to the so-called curse of multilinguality is naive and there is in fact also a "boon of multilinguality". We have released Glot500-m and are in the process of making our training corpus Glot500-c publicly available.
PANEL: Evaluating complex cognitive abilities in LLMs
Professor at the University of Leeds, Foundational Models Theme lead at the Alan Turing Institute
Professor at the University of Oxford, Foundational Models Theme lead at the Alan Turing Institute
Professor of NLP at Queen Mary University of London
Professorial Research Fellow at the University of Oxford
Professor at the Technical University Darmstadt, Head of the UKP Lab
Director of Research at Google DeepMind, Honorary Professor at UCL
Tentative Schedule
Time
Session
10:00 - 10.05
Introduction - Michael Wooldridge & Jean Innes
10:05 - 11:00
Session 1 - Keynote (Hinrich Schütze)
11:00 - 11:15
Coffee break
11:15-12:30
Session 2 - Panel: Evaluating complex cognitive abilities in LLMs
12:30 - 13:15
Lunch break
13:15 - 14:45
Session 3 - Short talks x6
13:15 - 13:30 -- Harish Tayyar Madabushi (University of Bath) - "Are Emergent Abilities in Large Language Models just In-Context Learning?"
13:30 - 13:45 -- Yulan He (King's College London) - "Narrative Understanding with Large Language Models"
13:45 - 14:00 -- Tony Lee (Stanford) (hybrid) - "Holistic Evaluation of Foundation Models"
14:00 - 14:15 -- Nouha Dziri (AI2) (hybrid) - "What it Can Create, It May Not Understand: Limits of Transformers on Compositionality and their Generative Paradoxical Behavior"
14:15 - 14:30 -- Emanuele La Malfa (University of Oxford) - "Benchmarking Language Models as-a-Service"
14:30 - 14:45 -- José Hernández-Orallo (Universitat Politecnica de Valencia) - "Two Approaches for Predicting the Validity of Foundation Models"
14:45 - 15:00
Coffee break
15:00-16:20
Session 4 - Activity/Breakout groups: Solving problems in evaluation
16:20-16:30
Closing & Feedback
Workshop Organisers
University of Oxford
University of Leeds
University of Oxford
The Alan Turing Institute
University of Oxford
The Alan Turing Institute
The Alan Turing Institute
University of Oxford
Venue & Registration
For any questions or issues please email rburnell@turing.ac.uk or emanuele.lamalfa@cs.ox.ac.uk
On the day, register at the main reception of The Alan Turing Institute inside the British Library, 96 Euston Rd, London, NW1 2DB