I am a postdoctoral scholar at Harvard University, advised by Himabindu Lakkaraju and Seth Neel. I received my PhD from the University of Tübingen, advised by Gjergji Kasneci. During that time, I was also a research associate at Harvard University, the Max-Planck Institute for Security & Privacy and an intern at JP Morgan AI Research where I worked with the Explainable AI team on explaining automated trading models.
My research interests lie within the broad area of AI safety and Data-Centric AI. Specifically, I develop machine learning techniques as well as evaluation frameworks to improve the safety, interpretability, privacy, and reasoning capabilities of predictive and generative models, including large language models (LLMs).
My work addresses fundamental questions such as:
Privacy and Unlearning: How do we build privacy and unlearning frameworks that better balance privacy and utility?
Data Curation: What influence does data have on the safety of AI models?
Interpretability: How can we build interpretable and accurate models to assist in human decision-making?
Evaluation: How can we devise large scale evaluations that tests state-of-the-art ML methods effectively?
Recent News
Sep '25: New paper: Train Once, Answer All: Many Pretraining Experiments for the Cost of One
Sep '25: Paper accepted at NeurIPS'25: Efficiently Verifiable Proofs of Data Attribution
Sep '25: I will give a talk at the Sorbonne Winter School on Causality and Explainable AI in Paris.
Aug '25: I will serve as an Area Chair for ICLR'26.
June '25: Spotlight Talks at ICML'25 CFAgentic & MOSS workshops: Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
Jan '25: Paper accepted at ICLR'25: Machine Unlearning Fails to Remove Data Poisoning Attacks
Jan '25: New paper on Arxiv: Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
Dec '24: ICLR'25 Workshop accepted @ICLR'25: Trust in LLMs & LLM Applications
July '24: Presenting In-Context Unlearning @ICML'24 in Vienna
June '24: Spotlight Talk at ICML'24 GenLaw workshop: Machine Unlearning Fails to Remove Data Poisoning Attacks
May '24: Paper accepted at ICML'24: In-Context Unlearning: Language Models as Few Shot Unlearners
Research Directions
Ensuring that ML models are aligned with human values and norms seems like one of the most critical problems we can be working on now. The specific directions I am interested in include:
Data influence on model behavior:
Efficiently Verifiable Proofs of Data Attribution, NeurIPS 2025;
Machine Unlearning Fails to Remove Data Poisoning Attacks, ICLR 2025;
Train Once, Answer All: Many Pretraining Experiments for the Cost of One, Preprint 2025.
Privacy and unlearning:
Gaussian Membership Inference Privacy, Neurips 2023;
Language Models are Realistic Tabular Data Generators, ICLR 2023;
I Prefer not to Say - Protecting User Consent in Models with Optional Personal Data, AAAI 2024;
In-Context Unlearning: Language Models as Few Shot Unlearners, ICML 2024;
Reliable ML evaluations:
OpenXAI: Towards Transparent Evaluations of Model Explanations, NeurIPS 2022;
CARLA: A Counterfactual Explanation Benchmark, NeurIPS 2021.
Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models, Preprint 2025.
Fundamental tradeoffs between trustworthiness properties:
Probabilistically Robust Recourse: Navigating the Trade-offs between Costs and Robustness in Algorithmic Recourse, ICLR 2023;
On the Trade-Off between Actionable Explanations and the Right to be Forgotten, ICLR 2023;
On the Privacy Risks of Algorithmic Recourse, AISTATS 2023;
Exploring Counterfactual Explanations Through the Lens of Adversarial Attacks, AISTATS 2022;
On Counterfactual Explanations under Predictive Multiplicity, UAI 2020;
Learning Model-Agnostic Counterfactual Explanations, WWW 2020.
Selected Publications
Preprint
Train Once, Answer All: Many Pretraining Experiments for the Cost of One , [arxiv]
Sebastian Bordt, Martin Pawelczyk
Tags: LLMs, Pretraining, Adv Robustness, Privacy, Memorization, Poisoning
NeurIPS 2025
Efficiently Verifiable Proofs of Data Attribution, [arxiv]
Ari Karchmer, Seth Neel, Martin Pawelczyk
Tags: Data Attributions, Auditing, PAC-Verification
ICLR 2023
NeurIPS 2022
OpenXAI: Towards a Transparent Evaluation of Model Explanations, [arxiv] [github] [website], Spotlight @ ICLR 2022 PAIR^2Struc Workshop
Chirag Agarwal, Eshika Saxena, Satyapriya Krishna, Martin Pawelczyk, Nari Johnson, Isha Puri, Marinka Zitnik, Himabindu Lakkaraju
Tags: Feature Attributions, Evaluation, Software
NeurIPS 2021
CARLA - A Python Library to Benchmark Counterfactual Explanations and Algorithmic Recourse Methods, [arXiv] [documentation] [github] [slides] [poster], Spotlight @ ICML 2022 Algorithmic Recourse Workshop
Martin Pawelczyk, Sascha Bielawski, Johan van den Heuvel, Tobias Richter* and Gjergji Kasneci* (*equal contribution)
Tags: Counterfactual Explanations, Evaluation, Software