Understanding the Geometry of Transformer Representations
Transformers have led to impressive performance in many tasks, but their black-box nature limits understanding their inner workings. Interpretability research aims to bridge the gap between empirical success and scientific comprehension of these models. This seminar will delve into how the geometric properties of hidden representations help us understand the semantic information encoded by transformers and their data-processing mechanisms. In the first part of the seminar, we will focus on the intrinsic dimension of the representations, showing that it is a valuable tool for identifying the layers encoding the semantic content of data across different domains, such as images, biological sequences, and text. In the second part, we will explore the topography of the probability density of hidden representations. Specifically, we will compare the density peaks formed when a transformer language model learns "in context" versus when it is fine-tuned for a specific task. Despite achieving similar downstream task performance, these strategies produce different internal structures, in both cases undergoing a sharp transition in the middle of the network.