Nikolai Ilinykh
Tutorial
Title: Visualising what LLMs learn about images and words
Description:
Large language models show great performance in many tasks because they can produce grammatically correct and coherent texts. A sub-type of these models called “multi-modal models” not only works with text but also with images, using architectures that can answer questions about the environment or simply describe what they see. But how do these models generate such descriptions? What do they learn about the relationship between images and texts? And how much of this can we as humans visualise, interpret and understand? In this tutorial we will explore the inner workings of transformer-based models currently in use. We will examine the self-attention mechanism, a crucial component of these models that stores information about how different regions of an image correspond to different words in the text. We will also have a demo where we will learn how to visualise and interpret the connections that models build between images and text. The tutorial consists of two parts: a short, lecture-style introduction and a practical session. Participants can either follow the practical session or test the visualisation tool online by accessing it through the provided link.
https://colab.research.google.com/drive/18Sc9MFAyHmrPnH1P8cZwKMG_ara93tkn?usp=sharing