Evaluating Foundation/Frontier Models

Workshop hosted by The Alan Turing Institute

November 28th, 2023.

Workshop Overview

The capabilities of foundation/frontier models are growing rapidly, which has led to the widespread deployment of theses systems across a wide range of applications. 


The growing capabilities of these systems and the methods used to train them pose many new challenges for system evaluation. For example:


The goal of this workshop is to discuss the latest approaches to evaluation of LLMs, to enumerate the most important challenges faced by evaluators, and to spark collaborations aimed at addressing these challenges.

Keynote Speaker

Hinrich Schütze

Ludwig Maximilian University of Munich (LMU)

Glot500: Creating and Evaluating a Language Model for 500 Languages


Most work on large language models (LLMs) has focused on what we call "vertical" scaling: making LLMs even better for a relatively small number of high-resource languages. We address "horizontal" scaling instead: extending LLMs to a large subset of the world's languages, focusing on low-resource languages. Our Glot500-m model is trained on more than 500 languages, many of which are not covered by any other language model. But how do we know that the model has actually learned these 500 languages? Broad low-resource evaluation turns out to be a difficult problem in itself and we tried to innovate in several ways. One issue we were not able to solve is that parts of our evaluation standard cannot be distributed due to copyright restrictions.  We also find that attributing good/bad performance to the so-called curse of multilinguality is naive and there is in fact also a "boon of multilinguality". We have released Glot500-m and are in the process of making our training corpus Glot500-c publicly available.


PANEL: Evaluating complex cognitive abilities in LLMs

Anthony Cohn (Moderator)

Professor at the University of Leeds, Foundational Models Theme lead at the Alan Turing Institute

Professor at the University of Oxford, Foundational Models Theme lead at the Alan Turing Institute

Maria Liakata

Professor of NLP at Queen Mary University of London

Professorial Research Fellow at the University of Oxford

Iryna Gurevych

Professor at the Technical University Darmstadt, Head of the UKP Lab

Edward Grefenstette

Director of Research at Google DeepMind, Honorary Professor at UCL

Tentative Schedule

Time

Session

10:00 - 10.05

Introduction - Michael Wooldridge & Jean Innes

10:05 - 11:00

Session 1 - Keynote (Hinrich Schütze)

11:00 - 11:15

Coffee break

11:15-12:30

Session 2 - Panel: Evaluating complex cognitive abilities in LLMs

12:30 - 13:15

Lunch break

13:15 - 14:45

Session 3 - Short talks x6

13:15 - 13:30 -- Harish Tayyar Madabushi (University of Bath) - "Are Emergent Abilities in Large Language Models just In-Context Learning?"

13:30 - 13:45 -- Yulan He (King's College London) - "Narrative Understanding with Large Language Models"

13:45 - 14:00 -- Tony Lee (Stanford) (hybrid) - "Holistic Evaluation of Foundation Models"

14:00 - 14:15 -- Nouha Dziri (AI2) (hybrid) - "What it Can Create, It May Not Understand: Limits of Transformers on Compositionality and their Generative Paradoxical Behavior"

14:15 - 14:30 -- Emanuele La Malfa (University of Oxford) - "Benchmarking Language Models as-a-Service"

14:30 - 14:45 -- José Hernández-Orallo (Universitat Politecnica de Valencia) -  "Two Approaches for Predicting the Validity of Foundation Models"

14:45 - 15:00

Coffee break

15:00-16:20

Session 4 - Activity/Breakout groups: Solving problems in evaluation

16:20-16:30

Closing & Feedback

Workshop Organisers

University of Oxford

University of Leeds

University of Oxford

The Alan Turing Institute

University of Oxford

The Alan Turing Institute

The Alan Turing Institute

University of Oxford

Venue & Registration

For any questions or issues please email rburnell@turing.ac.uk or emanuele.lamalfa@cs.ox.ac.uk 

On the day, register at the main reception of The Alan Turing Institute inside the British Library, 96 Euston Rd, London, NW1 2DB