All seminars are on Mondays at 8:30 am PT / 11:30 am ET / 4:30 pm London / 5:30 pm Berlin.
Date: March 3, 2025
Speaker: Jacqueline Harding
Title: Goal-Directedness in AI Systems
Abstract: Read a recent press release from any AI lab, and you’ll get the same message: 2025 is the year of the AI agent. In the coming months, we’re told, frontier AI systems will not merely generate single steps of dialogue, images or video. Instead, they will act autonomously within complex environments, in the sense that they’ll produce whole sequences of outputs without direct supervision. Crucially, their behaviour will be goal-directed: they will produce outputs in order to achieve some goal. But what does this mean, and why does it matter? In this talk, I’ll first identify some desiderata on an account of goal-directedness. Next, I’ll develop a behavioural account of goal-directedness, drawing on recent work within the computer science literature. The basic idea is that a system’s policy in an environment is goal-directed when it can be compressed by a goal. By applying it to current AI systems, I’ll argue that this behavioural account is surprisingly useful and flexible; in particular, it avoids many of the issues which plagued cybernetic accounts of agency. Nevertheless, it cannot do everything we want an account of goal-directedness to do; amongst other things, it attributes the wrong goals to the system in environments in which the system fails systematically. To deal with these cases, we need to look inside the system, identifying mechanisms whose function is to achieve the goal in question. I’ll conclude by sketching some ways to apply this idea to current and future AI systems.
Date: March 17, 2025
Speaker: Catherine Stinson
Title: Moving Goalposts or Degenerating Research?: AI Benchmarks and their Critics
Abstract: Artificial Intelligence tends to have a dismissive attitude toward its critics. This is true in particular of critique of benchmarks. While critics claim that benchmark datasets are often poorly constructed, that an overemphasis on benchmark leaderboards corrupts research incentives, and that results on benchmark tasks are overgeneralized to broader capacities than they test, the response from some big names in AI is that critics are illegitimately denying AI its successes by ‘moving the goalposts’. However, critique is widely recognized to be essential to progress by historians, sociologists and philosophers of science, and failure to engage with critique is seen as a path to a degenerating research program (or worse, pseudoscience). If that is correct, then AI would do well to heed its critics as helpful voices rather than try to shut down dissent. As an example of how embracing critique can be fruitful, I highlight the case of adversarial examples research, where critique of embarrassing gaffes by image recognition tools (that had better than human performance on benchmark tasks) inspired a research method in which mistakes are explicitly sought out as a way of improving models. This approach, where critique is treated as useful input, has been immensely successful, not only in improving image recognition tools, but also by adding to our knowledge of how primate brains process images. The lesson this case suggests is that AI would benefit from taking a less adversarial stance to its critics.
Date: March 31, 2025
Speaker: Cameron Buckner & Raphaël Milliere
Title: Interventionist methods for interpreting deep neural networks
Abstract: Recent breakthroughs in artificial intelligence have primarily resulted from training deep neural networks (DNNs) with vast numbers of adjustable parameters on enormous datasets. Due to their complex internal structure, DNNs are frequently characterized as inscrutable "black boxes," making it challenging to interpret the mechanisms underlying their impressive performance. This opacity creates difficulties for explanation, safety assurance, trustworthiness, and comparisons to human cognition, leading to divergent perspectives on these systems. This chapter examines recent developments in interpretability methods for DNNs, with a focus on interventionist approaches inspired by causal explanation in philosophy of science. We argue that these methods offer a promising avenue for understanding how DNNs process information compared to merely behavioral benchmarking and correlational probing. We review key interventionist methods and illustrate their application through practical case studies. These methods allow researchers to identify and manipulate specific computational components within DNNs, providing insights into their causal structure and internal representations. We situate these approaches within the broader framework of causal abstraction, which aims to align low-level neural computations with high-level interpretable models. While acknowledging current limitations, we contend that interventionist methods offer a path towards more rigorous and theoretically grounded interpretability research, potentially informing both AI development and computational cognitive neuroscience.
Date: April 14, 2025
Speaker: Alexandra Oprea (paper co-authored with Ryan Muldoon and Justin Bruner)
Title: Pluralism and AI Alignment
Abstract: AI researchers and policymakers agree that powerful new AI technologies ought to be aligned with human values and ought to serve the public good. Increasingly, they also agree that such alignment ought to be pluralistic. Our analysis of existing methods of AI alignment such as reinforcement learning from human feedback (RLHF) and reinforcement learning from AI feedback (RLAIF) identifies three key challenges for developing a pluralistic model of AI alignment. The first is the selection challenge of recruiting a diverse group of participants to provide the relevant feedback and/or generating a sufficiently diverse range of answers underpinned by pluralistic moral and political views. The second is the incentive challenge of structuring the incentive system so that participants providing feedback aim for reasonable diversity instead of mirroring the preferences of AI programmers. The final challenge is the aggregation challenge of preserving pluralism while turning feedback into reward functions. In particular, the goal is to avoid treating diverse answers as noise and to instead treat it as the pluralistic signal it represents. Drawing on existing work in political philosophy, game theory, and business ethics, we attempt to sketch an integrated solution to these challenges that can advance the goal of pluralistic AI alignment.
Date: April 28, 2025
Speaker: Emily Sullivan
Title: Idealization Failure in ML
Abstract: Idealizations, deliberate distortions introduced into scientific theories and models, are commonplace in science. This has led to a puzzle in epistemology and philosophy of science: How could a deliberately false claim or representation lead to the epistemic successes of science? In answering this question philosophers have been single-focused on explaining how and why idealizations are successful. But surely some idealizations fail. I propose that if we ask a slightly different question, whether a particular idealization is successful, then that not only gives insight into idealization failure, but will make us realize that our theories of idealization need revision. In this talk I consider idealizations in computation and machine learning.
Date: January 26, 2026
Speaker: Madeleine Ransom (joint work with Nicole Menard)
Title: A Dilemma for Skeptics of Trustworthy AI
Abstract: Can AI ever be (un)trustworthy? A growing number of philosophers argue that it cannot, because AI lack some human feature deemed essential for the trust relation, such as moral agency or being responsive to reasons. Here we propose a dilemma for these skeptics. Either such theorists must hold there is either only one kind of trust (monism), or that there are multiple varieties of trust (pluralism). The first horn of the dilemma is that a monistic view of trust is implausible: no one analysis can capture all kinds of trust relationships. The second horn of the dilemma is that if such theorists adopt a pluralistic account of trust they have little reason to deny that AI is the sort of thing that can be trustworthy: while AI may fail to possess characteristics required for some kinds of trust relations, these are not necessary conditions for trustworthiness.
Date: Feb 9, 2026
Speaker: Daniel J Singer and Luca Garzino Demo
Title: The Future of AI is Many, Not One
Abstract: Generative AI is currently being developed and used in a way that is distinctly singular. We see this not just in how users interact with models but also in how models are built, how they're benchmarked, and how commercial and research strategies using AI are defined. We argue that this singular approach is a flawed way to engage with AI if we're hoping for it to support groundbreaking innovation and scientific discovery. Drawing on research in complex systems, organizational behavior, and philosophy of science, we show why we should only expect deep intellectual breakthroughs to come from epistemically diverse teams of AI models, not singular superintelligent models. Having a diverse team broadens the search for solutions, delays premature consensus, and allows for the pursuit of unconventional approaches. Developing AI teams like these directly addresses critics' concerns that current models are constrained by past data and lack the creative insight required for innovation. In the paper, we explain what constitutes genuine diverse teams of AI models, distinguishing it from current multi-agent systems, and outline how to implement meaningful diversity in AI collectives. The upshot, we argue, is that the future of transformative transformer-based AI is fundamentally many, not one.
Date: Feb 23, 2026
Speaker: Huzeyfe Demirtas
Title: (How) Does Accountability Require Explainable AI?
Abstract: Autonomous systems powered by artificial intelligence (AI) are said to generate responsibility gaps (RGs)—cases in which AI causes harm, yet no one is blameworthy. This paper has three aims. First, I argue that we should stop worrying about RGs. This is because, on the most popular contemporary theories, blameworthiness is determined at the development or deployment stage, making post-deployment outcomes irrelevant to blameworthiness. Another upshot of this argument is that questions about blameworthiness do not motivate the demand for explainable AI (XAI). Second, I distinguish blameworthiness from liability and show that blameworthiness is not necessary—nor is it sufficient—for liability. Third, I explore how AI opacity complicates identifying who caused harm—an essential step in assigning liability. However, I argue that identifying who caused the harm—even if we use opaque AI models—is within our reach and not too costly. But liability in the context of AI requires further inquiry, which again suggests that we should stop worrying about RGs and focus on liability. Two further results emerge. One, my discussion presents a framework for analyzing how accountability might require XAI. Two, if my arguments based on this framework are on the right track, XAI is of little significance for accountability. Hence, we should worry about transparency around the AI—its training, deployment, and broader sociopolitical context—not inside the AI.
Date: March 9, 2026
Speaker: Mike Barnes
Title: TBD
Abstract: TBD
Date: March 23, 2026
Speaker: Iwan Williams
Title: Intention-like representations in Large Language Models?
Abstract: A growing chorus of AI researchers and philosophers posit internal representations in large language models (LLMs). But how do these representations relate to the kinds of mental states we routinely ascribe to our fellow humans? While some research has focused on belief- or knowledge- like states in LLMs, there has been comparatively little focus on the question of whether LLMs have intentions. I survey five properties that have been associated with intentions in the philosophical literature, and assess two candidate classes of LLM representations against this set of features. The result is mixed: LLMs have representations that are intention-like in many—perhaps surprising—respects, but they differ from human intentions in important ways.
Date: April 6, 2026
Speaker: Jessie Hall
Title: TBD
Abstract: TBD
Date: April 20, 2026
Speaker: Parisa Moosavi
Title: TBD
Abstract: TBD