Philosophy Meets Machine Learning: What Counts As Trustworthy?

Workshop at ICML 2026, Seoul, South Korea

11th July, 2026

Workshop Description

Philosophers have long thought deeply about many concepts that are used colloquially in the machine learning (ML) community such as epistemology, counterfactuals, explainability, reliability, uncertainty and causality. As ML systems are now embedded in high-stakes decisions across science, industry, and public life, it is urgent that when ML researchers claim properties such as "explainability", "reliability", "intelligence" or "cognition", these claims are made with awareness of what practitioners, policymakers, and affected users mean by those terms. In particular, we argue that the ML community needs to take a step back and review whether the mathematical objectives used in optimisation and evaluation procedures truly take into account how philosophers have analysed them—analyses that explicitly aim to connect notions like explanation, evidence, and uncertainty to human understanding, justification, and use.

Philosophers of science and psychologists are more actively engaged than ever in such questions; however, their interaction with ML researchers remains sparse and fragmented. The goal of the proposed workshop is to facilitate a lively dialogue between the two otherwise largely separate communities, to promote more principled and grounded advances in ML and artificial intelligence.

Important dates

11th May: Paper submission deadline
31st May: Notification of decision
11th July: Workshop

Invited speakers

Been Kim

Been Kim is a director at Google DeepMind, studying effective communication and collaboration between humans and complex machine learning models. Her research aims to harness machine intelligence for human benefit. Notably, her recent work in teaching superhuman chess concepts to grandmasters, one of them becoming the youngest World Chess Champion (Gukesh).

Dr. Kim is an accomplished speaker, having given a talk at the G20 meeting in Argentina (2019) and keynotes at ICLR (2022) and ECML (2020). Her influential work, TCAV, was recognized with the UNESCO Netexplo award and featured at Google I/O '19. Her contributions are also discussed in Brian Christian's book, "The Alignment Problem." A leader in the ML community, she is the General Chair for ICLR 2024, was Senior Program Chair for ICLR 2023, and is on the advisory board for TRAILS. She has extensive experience as a Senior Area Chair for conferences such as NeurIPS, ICML, ICLR, and AISTATS. She earned her PhD from MIT.

Raphaël Millière

Raphaël Millière is an Associate Professor at the University of Oxford and a Fellow of Jesus College, with an affiliation at the Institute for Ethics in AI. He also holds an AI2050 Fellowship from Schmidt Sciences.

Raphaël's research mainly focuses on understanding modern artificial neural networks, such as large language models, through theoretical analysis, behavioural evaluation, and interpretability methods.

Misha Belkin

Mikhail Belkin is HDSI Endowed Chair Professor in AI at Halicioglu Data Science Institute and Computer Science and Engineering Department at UCSD.

His research interests are broadly in theory and applications of Artificial Intelligence, deep learning and data analysis. One of his key findings has been the "double descent" risk curve that extends the textbook U-shaped bias-variance trade-off curve beyond the point of interpolation. His recent work focusses on understanding feature learning and over-parameterization in deep learning.

Naftali Weinberger

Naftali Weinberger is a postdoc at the Munich Center for Mathematical Philosophy, working on general philosophy of science with a focus on causation and causal modelling.

His research addresses a wide range of questions concerning the nature and scope of causal explanation, as well as epistemological questions about how to choose among the theories compatible with one’s evidence. He is also interested in the use of causal concepts in particular sciences, and have engaged with fields as diverse as population genetics, psychometrics, cognitive science, and neuroscience. His current projects are on causally modelling dynamical systems and causal issues related to discrimination.

Alice Huang

Alice Huang is an assistant professor and the Duncanson Chair in Ethics and Technology at the University of Western Ontario, jointly appointed in computer science and in philosophy. She is also a faculty affiliate at the Schwartz Reisman Institute for Technology & Society.

Her projects fall into two broad categories. The first connects formal results in artificial intelligence research to ethical issues related to interpretability, collaboration and fairness. The second uses formal and computational models to investigate pressing issues in our social discourse today, such as questions about misinformation, scientific practices and polarization.

Cameron Buckner

Cameron Buckner is a Professor and Donald F. Cronin Endowed Chair in the Humanities at the University of Florida.

In his current work, he focuses on the relationship between learning and meaning, by offering approaches to mental content, cognition, and knowledge representation that take the latest empirical theories of learning as their starting point. While his main focus remains on cognitive science (especially animal cognition and artificial intelligence), these insights also ground solutions to more traditional philosophical problems.

Schedule

The workshop will take place from 8am to 5pm on Saturday 11th July, 2026 at Coex, Seoul, South Korea.

8:20-8:30 Opening remarks
8:30-9:05 Invited talk: Naftali Weinberger

Title: Surrogate Metric Evaluation is a Causal Inference Problem

Abstract: Many AI systems are evaluated and trained using surrogate metrics that stand in for quantities of ultimate interest. Analyses of surrogate trustworthiness often focus on whether the surrogate and the target quantity are correlated, but this leaves open whether improvements in the surrogate will continue to improve the target under increasing optimization pressure. In my talk, I argue that this question can be conceptualized and addressed using dynamic causal models of feedback control systems. Using the classic example of the Watt governor, I illustrate how such systems exploit feedback loops to create higher-scale relationships by which one quantity adjusts to respond to another (e.g. the governor adjusts steam supply to match steam demand). Surrogate optimization similarly employs feedback in an attempt to make improvements in the surrogate correspond to improvements in the target, and optimization ceases to be useful once proposed improvements to the surrogate fail to match improvements in the target. It is this “matching” relationship, rather than the correlation between surrogate and target, that is relevant for analyzing optimization failures, reward hacking, and Goodhart-style effects.

9:05-9:40 Invited talk: Misha Belkin

Title: 75 years of Turing’s AGI

Abstract: I argue that the long-standing problem of building human-level machine intelligence has been solved.

9:40-10:00 Coffee break
10:00-10:35 Invited talk: Raphaël Millière

Title: A Bayesian epistemology for LLM evaluation

Abstract: Evaluating the capacities of large language models (LLMs) requires inferring latent abilities from observable task performance. We propose to formalize such inferences within a Bayesian framework, drawing on the epistemology of comparative cognitive science. The framework makes explicit how posterior credences about a system's abilities depend on prior probabilities over hypotheses about computational strategies, likelihood functions shaped by task demands, and marginalization over auxiliary factors that influence performance independently of the target ability. Within this framework, we argue that the relative evidential weight of behavioral versus mechanistic evidence for algorithmic-level hypotheses concerning the representations and computations underlying task performance depends systematically on background knowledge about the target system. For human cognition, decades of research in cognitive science, neuroscience, and evolutionary biology provide strong priors that constrain the space of plausible computational strategies, while well-characterized auxiliary factors (attention, motivation, working memory limits) allow informative marginalization. Behavioral paradigms consequently provide substantial evidential leverage for discriminating between algorithmic hypotheses. For LLMs, this evidential hierarchy shifts: weak priors about learned computational strategies and poorly characterized auxiliary factors (e.g., sensitivity to prompt formatting, tokenization artifacts, training distribution) make behavioral evidence systematically underdetermining across competing hypotheses. Mechanistic interpretability, by contrast, provides more direct access to representations and computations, offering evidence that can discriminate where behavioral evidence cannot.

10:35-11:10 Invited talk: Been Kim

Title: TBD

Abstract: TBD

11:10-12:00 Get lunch!
12:00-13:10 Poster session (bring your own lunch)
13:10-13:25 Break
13:25-14:00 Invited talk: Cameron Buckner

Title: “Reasoning” models, philosophy of inference, and artificial epistemic agency

Abstract: The frontier of AI is being pushed forward now by “Large Reasoning Models” (LRMs) that self-generate long textual “Chains-of-Thought” (CoTs) before answering user queries. While there is already a cottage industry investigating whether these chains of thought are “faithful” to underlying computations producing answers, existing faithfulness research has largely analyzed faithfulness in purely behavioral terms, as a particular input-output profile on reasoning problems. More broadly, there has been little attempt in this literature to define reasoning or to distinguish rational inference from other forms of decision-making. By contrast, philosophical work on rational inference has tended to focus on internal psychological connections between evidence and conclusions—in the philosophical tradition, rational inference involves the exercise of a distinctive form of epistemic agency to construe evidence in particular ways and to draw conclusions because of these construals. In this talk, I will contrast the behavioral and philosophical approaches to rational inference. I will argue that while the philosophical tradition runs the risk of overintellectualizing inference, there are some pragmatic reasons for engineers to review the richer philosophical notion of reasoning when designing the next generation of artificial agents with more powerful forms of epistemic self-calibration and agency.

14:00-14:50 Oral presentations (6 x 7 minutes, followed by 7 minutes of joint Q&A)
1. Phongsakon Mark Konrad, Toygar Tanyel, Serkan Ayvaz
Self-Reports Do Not Identify Self-Models: An Identifiability Test for Counterfactual Reports
2. Kola Ayonrinde, Raphaël Millière
Getting Monosemantic About Monosemanticity
3. Amine M'Charrak, Thong Pham, Thomas Lukasiewicz, Yuxiao Dong, Shohei Shimizu
Before Normative and Moral Alignment: Causal Contract Faithfulness as a Precondition for Trustworthy AI
4. Gilad Landau, Aviv Keren
The Concept of Representation in ML: Beyond Plato and Aristotle
5. Louis Mahon, Elliot Ford, Callum Hackett
A Definition of Good Explanations and the Challenges Explaining LLM Outputs
6. Joseph Keshet
Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models
14:50-15:25 Coffee break
15:25-16:00 Invited talk: Alison Huang

Title: (How) Does Accountability Require Understanding ML Models?

Abstract: Concerns about artificial intelligence—particularly systems based on machine learning models whose internal operations are opaque to human understanding—are frequently framed as concerns about accountability. Yet the concept of accountability itself is often left underspecified, encompassing a heterogeneous set of issues that call for distinct responses. This paper focuses on a specific subset of accountability concerns: the “who” questions. We distinguish between two questions: When an AI system causes harm, who is to blame, and who bears the obligation to provide compensation? We argue that the answers to these two questions about accountability require different information and do not entail one another. Then we investigate whether the opacity of contemporary machine learning models undermines our capacity to hold the relevant actors accountable in these two different senses of accountability, and how to rethink efforts aimed at increasing understanding.

16:00-17:00 Panel discussion & Closing

Accepted papers

Self-Reports Do Not Identify Self-Models: An Identifiability Test for Counterfactual Reports
Phongsakon Mark Konrad, Toygar Tanyel, Serkan Ayvaz
Getting Monosemantic About Monosemanticity
Kola Ayonrinde, Raphaël Millière
Before Normative and Moral Alignment: Causal Contract Faithfulness as a Precondition for Trustworthy AI
Amine M'Charrak, Thong Pham, Thomas Lukasiewicz, Yuxiao Dong, Shohei Shimizu
The Concept of Representation in ML: Beyond Plato and Aristotle
Gilad Landau, Aviv Keren
A Definition of Good Explanations and the Challenges Explaining LLM Outputs
Louis Mahon, Elliot Ford, Callum Hackett
Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models
Joseph Keshet
Proleptic Epistemology for Societal Impacts of AGI
Priyansh Singhal, Sandeep Kumar, Piyush Joshi
The Opacity of Descent: Optimization, Epistemic Asymmetry, and the Semantics of Convergence in Deep Learning
Mahdi Ghaznavi
Taxonomizing Arguments for Language Model Skepticism in the Language of the Theory of Computation
Michael Guerzhoy
Fair Learning with Biased Labels: When Observed Accuracy Is the Wrong Target
Heng-Chien Liou, I-Hsiang Wang
Reliability, Faithfulness, and the Limits of Post-hoc Explanations of Opaque Scientific Models
Nick Oh, Helen Jin
From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models
Leonard Engmann, Christian Medeiros Adriano, Holger Giese
Mistakes as Epistemic Signatures: An Efficiency-Modulated Cumulative Error Framework for Comparison and Diagnosis of AI Errors
Darshini N
Lifted Representation Hypothesis in Language Models
Bumjin Park, Jaesik Choi
Explanation in an Emerging Science of Large Language Models
James Ming Liang Ang
On Epistemic Diversity in Large Language Models
Elisabeth Kirsten, Nicole C. Krämer, Muhammad Bilal Zafar
Dignity as Answerability: How World-Model AI Reframes Human Moral Standing
Junghoon Justin Park, Jiook Cha
Operative Contexts: Belief Revision and Memory in Agentic AI
Emma Cabalé, Philippe Beraud, Philippe Limantour
A Relativistic Perspective of Reliability in Machine Learning
Rajeev Verma
When Belief Bends to Belief: Sycophancy as a Single-Layer Truth–Compliance Tension in LLMs
Valentin NOËL
Can LLMs Navigate Beliefs and Facts? Depends on How You Phrase It
Quang Minh Nguyen, Luis Frentzen Salim
Beyond Accuracy: Epistemic Justification in Trustworthy Machine Learning
Poojak Patel, Maneth Perera
An Evolutionary Epistemology of Post-Training
Nicholas Clark, Bill Howe
Trust as Predictive Precision: Reliability and Influence in Representation Alignment
Hidenori Tanaka
Explanation for Whom? Hospitable Interpretability for Machine Learning
Abutalib Namazov
Constituting What Counts: A Phenomenological Approach to Human-AI Ontological Translation
Prerna Luthra, Manojshyaam C J
Online Boundary-Aware Memory for Case-Based Reasoning Agents
Zheng Dong, Luming Shang
When Do Transformer Components Compose? Validating a Log-Pool Decomposition Criterion
Junyu Ren, Su Hyeong Lee, Risi Kondor
From Prompts to Proof Obligations: Formal Sidecars as an Epistemic Interface for Trustworthy ML
Junyu Ren
Savage Without Monotonicity
Shuo Li Liu, Jingni Yang
Explanations are a Means to an End: Decision Theoretic Explanation Evaluation
Ziyang Guo, Berk Ustun, Jessica Hullman
Interpretability Should Prioritise Use-Inspired Basic Research for AI Safety
Kola Ayonrinde
Fictionalism about Personas: Folk Psychology as an Interpretability Strategy
Weiming Sheng
AI Review Is a Systemic Risk to Peer Review: Toward a Blockchain-Supported Claim-Level Ledger for Accountability
Yibo Miao, Yichi Zhang, Yinpeng Dong
On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?
Mingmeng Geng, Thierry Poibeau
AI Wellbeing: Measuring and Improving the Functional Pleasure and Pain of AIs
Richard Ren, Kunyang Li, Mantas Mazeika, Wenyu Zhang, Yury Orlovskiy, Rishub Tamirisa, Wenjie Jacky Mo, Thuy Dung Nguyen, Long Phan, Steven Basart, Austin Meek, Aditya Mehta, Oliver Ingebretsen, Alice Blair, Brianna Adewinmbi, Vy Phan, Alice Gatti, Adam Khoja, Jason Hausenloy, Devin Kim, Dan Hendrycks
Vision-Language Asymmetry in Bistable Image Captioning
Arohan Agate
Measuring the Ruler: Reading Benchmark Saturation as Evidence
Sebastian M Schmon
Do LLMs Really Represent the World? A Teleosemantic Assessment of Pre-Trained, Fine-Tuned, and Agentic LLMs
Eliot Du Sordet
Where Does Prediction Error Come From When the Data Is Perfect? A Decomposition of the Model–World Gap in Predictive Uncertainty
Johanna Einsiedler, Rosa Lavelle-Hill, Constantin T. A. Wiegand
Reconciling Causality and Non-Equilibrium Thermodynamics with Hamiltonian Causal Models
Dario Rancati, Max Welling, Francesco Locatello
Towards Automated Evaluation of Socio-Technical Harms in LLMs: A Normative Taxonomy and Multi-Turn Red-Teaming Framework
Byeongho Lee, Hyundeuk Cheon
Explaining What Machine Learning Learns through Explainable AI
Jinyeong Gim
Procedural Generalization: A Resource-Sensitive Account of Knowing-How
Tomer Galanti, Saharsh Koganti, Priyadarsi Mishra, Pierfrancesco Beneventano
The Wrong Question? Artificial Consciousness and the Politics of AI Agency
Thierry Poibeau
Unsafe Consensus in Diagnostic Deliberation
Yuting Yan, Yinghao Fu, Haozhou Gao, Tianjian Zhang, Aoxi Liu, Shuang Li
Uncertainty as Perceptual Testimony in Vision-Language Models
Ahmad A Rushdi
Factuality Beyond Reference in LLMs
Thierry Poibeau
Reality and Practice: A Relational Reading of the Platonic Representation Hypothesis
Sebastian M Schmon
Privileged Self-Access Matters for Introspection in AI
Siyuan Song, Harvey Lederman, Jennifer Hu, Kyle Mahowald
Epistemic Misalignment in Human-AI Systems: A Four-Quadrant Taxonomy of Uncertainty
Mayank Kejriwal
Can Standard MARL Metrics Distinguish Communicative from Strategic Action?
Majid Ghasemi, Mark Crowley
Efficient Counterfactual Reasoning in ProbLog via Single-World Intervention Programs
Saimun Habib, Vaishak Belle, Fengxiang He
DeepSWIP: Single-World Counterfactual Semantics for DeepProbLog
Saimun Habib, Vaishak Belle, Fengxiang He
The Hawk Effect: Why We Need a Two-Dimensional Measure of Machine Intelligence
Fryderyk Kuzma
Counterfactuals Without Worlds: When ML Counterfactual Explanations Are Ill-Posed
Muhammet Anil Yagiz
Reliable for Whom? Directional Reliability in AI-Mediated Political Dialogue
Jaeyoun You
Noticing the Watcher: LLM Agents Can Infer CoT Monitoring from Blocking Feedback
Thomas Jiralerspong, Flemming Kondrup, Yoshua Bengio
Trustworthiness and co-cognition in artificial intelligence systems
Silvère Gangloff
Articulate Intuition or Genuine Analysis? Benchmarking Epistemic Reliability in LLM-as-a-Judge Peer Review
Nuo Chen, Bingsheng He

Organisers

ETH Zürich

ETH Zürich

Stanford

TU Nuremberg, Helmholtz AI

Jaesik Choi

KAIST

Konstantin Genin

University of Utah

Bernhard Schölkopf

MPI Tübingen

Call for papers

Submissions are now closed. Thank you for your interest in PhilML workshop!

We invite short paper submissions (up to 4 pages, excluding references and appendix) from both philosophers and ML researchers on the following topics:

Epistemology of learning systems: knowledge, belief, evidence, justification, understanding, etc.
Uncertainty: interpretations of probability and credence; confidence, ignorance, ambiguity, etc.
Counterfactual reasoning: when counterfactual questions are well-posed, and what makes counterfactual answers meaningful.
Foundations of causal modelling: in particular, links between causal formalisms used in ML and philosophical accounts of causation.
Explainability and interpretability: explanation vs. prediction; understanding as a cognitive and social achievement; what counts as an explanation for whom, and why.
Reliability, robustness, and generalisation: principled notions of “reliability” beyond accuracy, statistical/philosophical perspectives on “reliable” scientific or societal use.

Submissions should be made by 11th May (anywhere on earth) on openreview.

Author guidelines

Format: All submissions must be in PDF format. Submissions are limited to four content pages. Unlimited additional pages are allowed for references and supplementary materials. Reviewers may choose to read the supplementary materials but will not be required to. Camera-ready versions may go up to five content pages.
Style file: You must format your submission using the ICML 2026 LaTeX style file. Please include the references and supplementary materials in the same PDF as the main paper.
Double-blind reviewing: The reviewing process will be double blind. As an author, you are responsible for anonymizing your submission. In particular, you should not include author names, author affiliations, or acknowledgements in your submission and you should avoid providing any other identifying information (even in the supplementary material).
LLM policy: The use of LLMs are permitted only as a writing assistance tool.
Dual-submission policy: We welcome ongoing and unpublished work. We will also accept papers that are under review at the time of submission, or that have been recently accepted for publication at a non-ML venue (i.e., any venue that is not ICML, NeurIPS, ICLR, or a similar conference or journal). Submissions published in venues for related fields (in particular, philosophy) are welcome.
Non-archival: The workshop is a non-archival venue and will not have official proceedings. Workshop submissions can be subsequently or concurrently submitted to other venues.
Visibility: Submissions and reviews will not be public. Only accepted papers will be made public.
Reciprocal reviewing: Authors of submitted works are encouraged to volunteer as reviewers for other submissions, to ensure a fair and high-quality review process.

For questions, please contact philml.icml26@gmail.com.

Programme committee

Florian Dorner (MPI Tübingen)

Liang Wendong (MPI Tübingen)

Cheongwoong Kang (KAIST)

Haksoo Lim (KAIST)

Won Jo (KAIST)

Junho Choi (KAIST)

Annika Schneider (Helmholtz Munich)

Nikos Papanikolaou (MPI Tübingen)

Javier Abad (ETH Zürich)

Sarah Martinson (ETH Zürich)

Hanti Lin (UC Davis)

Raphaël Millière (Oxford)

Moritz Miller (MPI Tübingen)

Sergio Hernan Garrido Mejia (MPI Tübingen)

Seongun Kim (KAIST)

Page updated

Google Sites

Report abuse