All seminars are on Mondays at 8:30 am PT / 11:30 am ET / 4:30 pm London / 5:30 pm Berlin.
Date: February 3, 2025 [Cancelled]
Speaker: Emily Sullivan
Date: March 3, 2025
Speaker: Jacqueline Harding
Title: Goal-Directedness in AI Systems
Abstract: Read a recent press release from any AI lab, and you’ll get the same message: 2025 is the year of the AI agent. In the coming months, we’re told, frontier AI systems will not merely generate single steps of dialogue, images or video. Instead, they will act autonomously within complex environments, in the sense that they’ll produce whole sequences of outputs without direct supervision. Crucially, their behaviour will be goal-directed: they will produce outputs in order to achieve some goal. But what does this mean, and why does it matter? In this talk, I’ll first identify some desiderata on an account of goal-directedness. Next, I’ll develop a behavioural account of goal-directedness, drawing on recent work within the computer science literature. The basic idea is that a system’s policy in an environment is goal-directed when it can be compressed by a goal. By applying it to current AI systems, I’ll argue that this behavioural account is surprisingly useful and flexible; in particular, it avoids many of the issues which plagued cybernetic accounts of agency. Nevertheless, it cannot do everything we want an account of goal-directedness to do; amongst other things, it attributes the wrong goals to the system in environments in which the system fails systematically. To deal with these cases, we need to look inside the system, identifying mechanisms whose function is to achieve the goal in question. I’ll conclude by sketching some ways to apply this idea to current and future AI systems.
Date: March 17, 2025
Speaker: Catherine Stinson
Title: Moving Goalposts or Degenerating Research?: AI Benchmarks and their Critics
Abstract: Artificial Intelligence tends to have a dismissive attitude toward its critics. This is true in particular of critique of benchmarks. While critics claim that benchmark datasets are often poorly constructed, that an overemphasis on benchmark leaderboards corrupts research incentives, and that results on benchmark tasks are overgeneralized to broader capacities than they test, the response from some big names in AI is that critics are illegitimately denying AI its successes by ‘moving the goalposts’. However, critique is widely recognized to be essential to progress by historians, sociologists and philosophers of science, and failure to engage with critique is seen as a path to a degenerating research program (or worse, pseudoscience). If that is correct, then AI would do well to heed its critics as helpful voices rather than try to shut down dissent. As an example of how embracing critique can be fruitful, I highlight the case of adversarial examples research, where critique of embarrassing gaffes by image recognition tools (that had better than human performance on benchmark tasks) inspired a research method in which mistakes are explicitly sought out as a way of improving models. This approach, where critique is treated as useful input, has been immensely successful, not only in improving image recognition tools, but also by adding to our knowledge of how primate brains process images. The lesson this case suggests is that AI would benefit from taking a less adversarial stance to its critics.
Date: March 31, 2025
Speaker: Cameron Buckner & Raphaël Milliere
Title: Interventionist methods for interpreting deep neural networks
Abstract: Recent breakthroughs in artificial intelligence have primarily resulted from training deep neural networks (DNNs) with vast numbers of adjustable parameters on enormous datasets. Due to their complex internal structure, DNNs are frequently characterized as inscrutable "black boxes," making it challenging to interpret the mechanisms underlying their impressive performance. This opacity creates difficulties for explanation, safety assurance, trustworthiness, and comparisons to human cognition, leading to divergent perspectives on these systems. This chapter examines recent developments in interpretability methods for DNNs, with a focus on interventionist approaches inspired by causal explanation in philosophy of science. We argue that these methods offer a promising avenue for understanding how DNNs process information compared to merely behavioral benchmarking and correlational probing. We review key interventionist methods and illustrate their application through practical case studies. These methods allow researchers to identify and manipulate specific computational components within DNNs, providing insights into their causal structure and internal representations. We situate these approaches within the broader framework of causal abstraction, which aims to align low-level neural computations with high-level interpretable models. While acknowledging current limitations, we contend that interventionist methods offer a path towards more rigorous and theoretically grounded interpretability research, potentially informing both AI development and computational cognitive neuroscience.
Date: April 14, 2025
Speaker: Alexandra Oprea (paper co-authored with Ryan Muldoon and Justin Bruner)
Title: Pluralism and AI Alignment
Abstract: AI researchers and policymakers agree that powerful new AI technologies ought to be aligned with human values and ought to serve the public good. Increasingly, they also agree that such alignment ought to be pluralistic. Our analysis of existing methods of AI alignment such as reinforcement learning from human feedback (RLHF) and reinforcement learning from AI feedback (RLAIF) identifies three key challenges for developing a pluralistic model of AI alignment. The first is the selection challenge of recruiting a diverse group of participants to provide the relevant feedback and/or generating a sufficiently diverse range of answers underpinned by pluralistic moral and political views. The second is the incentive challenge of structuring the incentive system so that participants providing feedback aim for reasonable diversity instead of mirroring the preferences of AI programmers. The final challenge is the aggregation challenge of preserving pluralism while turning feedback into reward functions. In particular, the goal is to avoid treating diverse answers as noise and to instead treat it as the pluralistic signal it represents. Drawing on existing work in political philosophy, game theory, and business ethics, we attempt to sketch an integrated solution to these challenges that can advance the goal of pluralistic AI alignment.
Date: April 28, 2025
Speaker: Emily Sullivan
Title: Idealization Failure in ML
Abstract: Idealizations, deliberate distortions introduced into scientific theories and models, are commonplace in science. This has led to a puzzle in epistemology and philosophy of science: How could a deliberately false claim or representation lead to the epistemic successes of science? In answering this question philosophers have been single-focused on explaining how and why idealizations are successful. But surely some idealizations fail. I propose that if we ask a slightly different question, whether a particular idealization is successful, then that not only gives insight into idealization failure, but will make us realize that our theories of idealization need revision. In this talk I consider idealizations in computation and machine learning.