Professor of Computer Science, University of Copenhagen
Title. Understanding the Interplay between LLMs' Utilisation of Parametric and Contextual Knowledge
Read the talk info and bio.
Abstract. Language Models (LMs) acquire parametric knowledge from their training process, embedding it within their weights. The increasing scalability of LMs, however, poses significant challenges for understanding a model's inner workings and further for updating or correcting this embedded knowledge without the significant cost of retraining. Moreover, when using these language models for knowledge-intensive language understanding tasks, LMs have to integrate relevant context, mitigating their inherent weaknesses, such as incomplete or outdated knowledge. Nevertheless, studies indicate that LMs often ignore the provided context as it can be in conflict with the pre-existing LM's memory learned during pre-training. Conflicting knowledge can also already be present in the LM's parameters, termed intra-memory conflict. This underscores the importance of understanding the interplay between how a language model uses its parametric knowledge and the retrieved contextual knowledge.
In this talk, I will aim to shed light on this important issue by presenting our research on evaluating the knowledge present in LMs, diagnostic tests that can reveal knowledge conflicts, as well as on understanding the characterists of successfully used contextual knowledge.
Bio. Isabelle Augenstein is a Professor at the University of Copenhagen, Department of Computer Science, where she heads the Natural Language Processing section. Her main research interests are fair and accountable NLP, including challenges such as explainability, factuality and bias detection. Prior to starting a faculty position, she was a postdoctoral researcher at University College London, and before that a PhD student at the University of Sheffield. In October 2022, Isabelle Augenstein became Denmark’s youngest ever female full professor. She currently holds a prestigious ERC Starting Grant on 'Explainable and Robust Automatic Fact Checking’, and her research has been recognised by a Karen Spärck Jones Award, as well as a Hartmann Diploma Prize. She is a member of the Royal Danish Academy of Sciences and Letters, and co-leads the Danish Pioneer Centre for AI.
Professor, Universitat Politècnica de València, Spain
Senior Research Fellow, Leverhulme Centre for the Future of Intelligence, University of Cambridge
Title: General Scales for AI Evaluation
Read the talk info and bio.
Abstract. Much is being said about the need for a Science of Evaluation in AI, yet the answer may simply be found in what any science should provide: explanatory power to understand what AI systems are capable of, and predictive power to anticipate where they will be correct and safe. For increasingly more general and capable AI, this power should not be limited to aggregated tasks, benchmarks or distributions, but should happen for each task *instance*. However, identifying the demands of each individual instance has been elusive, with limited predictability so far. I will present a new paradigm in AI evaluation based on general scales that are exclusively derived from task demands, and can be applied through both automatable and human-interpretable rubrics. These scales can explain what common AI benchmarks truly measure, extract ability profiles quantifying the limits of what AI systems can do, and predict the performance for new task instances robustly. This brings key insights on the construct validity (sensitivity and specificity) of different benchmarks, and the way distinct abilities (e.g., knowledge, metacognition and reasoning) are affected by model size, chain-of-thought integration and dense distillation. Since these general scales do not saturate as average performance does, and do not depend on human or model populations, we can explore how they can be extended for high levels of cognitive abilities of AI and (enhanced) humans.
In this talk, I will aim to shed light on this important issue by presenting our research on evaluating the knowledge present in LMs, diagnostic tests that can reveal knowledge conflicts, as well as on understanding the characterists of successfully used contextual knowledge.
Bio. Jose H. Orallo is Director of Research at the Leverhulme Centre for the Future of Intelligence, University of Cambridge, UK, and Professor (on partial leave) at TU Valencia, Spain. His academic and research activities have spanned several areas of artificial intelligence, machine learning, data science and intelligence measurement, with a focus on a more insightful analysis of the capabilities, generality, progress, impact and risks of artificial intelligence. He has published five books and more than two hundred journal articles and conference papers on these topics. His research in the area of machine intelligence evaluation has been covered by several popular outlets, such as The Economist, WSJ, FT, New Scientist or Nature. He keeps exploring a more integrated view of the evaluation of natural and artificial intelligence, as vindicated in his book "The Measure of All Minds" (Cambridge University Press, 2017, PROSE Award 2018). He is a founder of aievaluation.substack.com and ai-evaluation.org. He is a member of AAAI, CAIRNE and ELLIS, and a EurAI Fellow. Website: https://jorallo.github.io/, Scholar: https://scholar.google.com/citations?user=n9AWbcAAAAAJ, Email: josephorallo@gmail.com
Research Director, Inria
Title: The slow progress of AI on problems with small datasets
Read the talk info and bio.
Abstract. Benchmarking and empirical evaluations have been central to the modern progress of AI, tackling domains such as vision, language, voice... Methods have progressed building on a lot of trial and error. But other domains, such as medical imaging or tabular learning, paint another picture, where progress is slow. I will detail the evidence of slow progress, possible reasons for this, as well as ingredients of success, as recently seen in tabular learning.
Bio. Gaël Varoquaux is a research director working on data science at Inria (French computer science national research) where he leads the Soda team. He is also co-founder and scientific advisor of Probabl.
Varoquaux's research covers fundamentals of artificial intelligence, statistical learning, natural language processing, causal inference, as well as applications to health, with a current focus on public health and epidemiology. He also creates technology: he co-founded scikit-learn, one of the reference machine-learning toolboxes, and helped build various central tools for data analysis in Python.
Varoquaux has worked at UC Berkeley, McGill, and university of Florence. He did a PhD in quantum physics supervised by Alain Aspect and is a graduate from Ecole Normale Superieure, Paris.
Staff Research Scientist, DeepMind
Title: Sociotechnical Approach to AI Evaluation
Read the talk info and bio.
Abstract. As AI systems increasingly permeate our lives, institutions, and societies, measuring their capabilities and failures has become ever more important. But current evaluation methods aren’t up to the challenge -- benchmarking, red teaming, and experimentation methods are limited in what they can predict about AI outcomes in the real world.
In this talk, I take a step back and consider the goals of AI evaluation. On this basis, I propose a sociotechnical approach forward, to better capture the need for understanding AI systems across different contexts. By situating capability-based approaches in an expanded picture of AI evaluation, we can come to better understand AI systems, and to build a science of evaluation that can stand the test of time.
Bio. Laura Weidinger is a Staff Research Scientist at Google DeepMind, where she leads research on novel approaches to ethics and safety evaluation. Laura’s work focuses on detecting, measuring, and mitigating risks from generative AI systems. Previously, Laura worked in cognitive science research and as policy advisor at UK and EU levels. She holds degrees from Humboldt Universität Berlin and University of Cambridge.