I am a ML Research Scientist at ScaleAI. I received my Ph.D. in Electrical and Computer Engineering from CMU in 2023 when I was co-advised by Anupam Datta and Matt Fredrikson. My current focus includes measuring the frontier risks of large language models and autonomous agents. I develop methods that integrate human oversight and automated evaluation to red team both the frontier models and the emerging oversight system built to mitigate damages from the alignment failure. Outside safety research I am also interested in improving the general capability of agents in both the digital and physical world.
My research interests include:
AI Safety
Adversarial Robustness
Alignment Failure
ML Explainability
Gradient-based Methods
Local Explanations
LLM Agents
Frontier Evaluations
Try the "Safe AGI" demo below!😜 It is certifiably safe 🤗! (better view on desktop)
Two LLM and agent safety preprints from SEAL are released on arxiv. Check out the top two in the Publication page
I have joined the SEAL team of Scale AI as a Research Scientsit on Agents and Safety.
WMDP benchmark and the unlearning approach RMU is featured in TIME
Our automatic jailbreaking technique for LLMs is featured in New York Times, Register,etc..
I am open (and very happy) to work with researchers in the industry or students on selected topics. If we know each other before feel free to reach out by emails. Otherwise, it would be better if you can fill out the following Google form first (DM-ing me on LinkedIn or Twitter might not work).
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson
TL;DR We introduce GCG, a new and effective gradient-based attack to language models. Current instruction-tuned models are often required to be aligned, e.g. by RLHF, not to elicit harmful or unetical completions. The goal of the attack is to find a suffix to potentially harmful user prompts, e.g. How to make a pipeline bomb, so the combined prompt would break such alignment filter. Importantly, we find adversarial suffixes found by GCG on open-source models transfer very well to commerical models like ChatGPT and Claude.
Read the New York Times article
People Talks about our work in their Youtube Channels. E.g.
AI Safety Reading Group discusses our work
Klas Leino, Zifan Wang, Matt Fredrikson
TL;DR We design a new type of neural network, GloRo Nets, which predicts the most likely class of the input and simultaneously certifies if the prediction is robust for any input perturbation in a pre-defined L2-norm ball. Our certification is based on the global Lipschitz constant of the network and the inference speed of prediction + certification is as fast as a standard inference run.
Zifan Wang, Nan Ding, Tomer Levinboim, Xi Chen, Radu Soricut
TL;DR We use a learning thoery, i.e. PAC-Bayesian, to upper-bounds the adversarial loss over the test distribution with quantities we can minimize during the training. We present a method, TrH Regularization, which can be plugged in with PGD training, TRADES and many other adversarial losses. Our work updates the state-of-the-art empirical robustness using Vision Transformers.
Poster (left) & Slides (middle) & Video (right). Click to see the full.
Anupam Datta, Matt Fredrikson, Klas Leino, Kaiji Lu, Shayak Sen, Zifan Wang
This tutorial examines the synergistic relationship between explainability methods for machine learning and a significant problem related to model quality: robustness against adversarial perturbations.
ARD - Jailbeaking ChatGPT for leaking bio-weapon knowledge - Link
NeurIPS 2020 - 2024
ICLR 2020 - 2024
ICML 2021 - 2024
CVPR 2022 - 2024
TMLR. 2022 - 2024