FM-Eval Workshop

Evaluating Foundation/Frontier Models

Workshop hosted by The Alan Turing Institute

November 28th, 2023.

Workshop Overview

The capabilities of foundation/frontier models are growing rapidly, which has led to the widespread deployment of theses systems across a wide range of applications.

The growing capabilities of these systems and the methods used to train them pose many new challenges for system evaluation. For example:

In order to cover the breadth of system capabilities, benchmarks are growing in size and complexity, making them more expensive and time consuming to run.
A lack of knowledge and access to cutting-edge systems is making it difficult for researchers and policymakers to conduct evaluations or to determine why a system is performing well (or poorly).
Benchmarks are becoming obsolete increasingly fast due to data contamination and rapid capability improvements.
The increasingly frequent release of new models is making it hard for evaluators to get a full grasp of a system before it becomes out-of-date.

The goal of this workshop is to discuss the latest approaches to evaluation of LLMs, to enumerate the most important challenges faced by evaluators, and to spark collaborations aimed at addressing these challenges.

Keynote Speaker

Hinrich Schütze

Ludwig Maximilian University of Munich (LMU)

Glot500: Creating and Evaluating a Language Model for 500 Languages

Most work on large language models (LLMs) has focused on what we call "vertical" scaling: making LLMs even better for a relatively small number of high-resource languages. We address "horizontal" scaling instead: extending LLMs to a large subset of the world's languages, focusing on low-resource languages. Our Glot500-m model is trained on more than 500 languages, many of which are not covered by any other language model. But how do we know that the model has actually learned these 500 languages? Broad low-resource evaluation turns out to be a difficult problem in itself and we tried to innovate in several ways. One issue we were not able to solve is that parts of our evaluation standard cannot be distributed due to copyright restrictions. We also find that attributing good/bad performance to the so-called curse of multilinguality is naive and there is in fact also a "boon of multilinguality". We have released Glot500-m and are in the process of making our training corpus Glot500-c publicly available.

PANEL: Evaluating complex cognitive abilities in LLMs

Anthony Cohn (Moderator)

Professor at the University of Leeds, Foundational Models Theme lead at the Alan Turing Institute

Michael Wooldridge

Professor at the University of Oxford, Foundational Models Theme lead at the Alan Turing Institute

Maria Liakata

Professor of NLP at Queen Mary University of London

Nigel Shadbolt

Professorial Research Fellow at the University of Oxford

Iryna Gurevych

Professor at the Technical University Darmstadt, Head of the UKP Lab

Edward Grefenstette

Director of Research at Google DeepMind, Honorary Professor at UCL

Tentative Schedule

Time

Session

10:00 - 10.05

Introduction - Michael Wooldridge & Jean Innes

10:05 - 11:00

Session 1 - Keynote (Hinrich Schütze)

11:00 - 11:15

Coffee break

11:15-12:30

Session 2 - Panel: Evaluating complex cognitive abilities in LLMs

12:30 - 13:15

Lunch break

13:15 - 14:45

Session 3 - Short talks x6

13:15 - 13:30 -- Harish Tayyar Madabushi (University of Bath) - "Are Emergent Abilities in Large Language Models just In-Context Learning?"

13:30 - 13:45 -- Yulan He (King's College London) - "Narrative Understanding with Large Language Models"

13:45 - 14:00 -- Tony Lee (Stanford) (hybrid) - "Holistic Evaluation of Foundation Models"

14:00 - 14:15 -- Nouha Dziri (AI2) (hybrid) - "What it Can Create, It May Not Understand: Limits of Transformers on Compositionality and their Generative Paradoxical Behavior"

14:15 - 14:30 -- Emanuele La Malfa (University of Oxford) - "Benchmarking Language Models as-a-Service"

14:30 - 14:45 -- José Hernández-Orallo (Universitat Politecnica de Valencia) - "Two Approaches for Predicting the Validity of Foundation Models"