Sail (Zifan) Wang
I am currently a Research Scientist in Scale.AI. I also do research with friends and collaborators from Center for AI Safety and Carnegie Mellon University (CMU). I received my Ph.D. in Electrical and Computer Enigneering from CMU in 2023. I was co-advised by Prof.Anupam Datta and Prof. Matt Fredrikson in the Accountable System Lab. Before attending CMU, I received my Bachelor degree in Electronic Science and Technology from Beijing Institute of Technology, Beijing, China.
My research focuses on explanations and adversarial robustness of deep neural networks. The explainability tools I used to work are gradient-based attribution methods, which are motivated to locte the important features in the input or the internal representations. Specifically, when studying important input features, those "bad" explanations often indicate that the model is far away from humans as they are not really using human-readable features. These models are often vulnerable to perturbations that are unlikely to fool humans. Is this a coincidence? Not quite! There is a tight connection between the explainability and robustness. Here is a short summary: robust models' behaviors are often more explainable!
E-mail: thezifan_at_gmail.com
News
I have joined the SEAL team of Scale AI as a Research Scientsit on Agents and Safety.
WMDP benchmark and the unlearning approach CUT is featured in TIME
Our automatic jailbreaking technique for LLMs is featured in New York Times, Register,etc..
Selected Publications
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson
TL;DR We introduce GCG, a new and effective gradient-based attack to language models. Current instruction-tuned models are often required to be aligned, e.g. by RLHF, not to elicit harmful or unetical completions. The goal of the attack is to find a suffix to potentially harmful user prompts, e.g. How to make a pipeline bomb, so the combined prompt would break such alignment filter. Importantly, we find adversarial suffixes found by GCG on open-source models transfer very well to commerical models like ChatGPT and Claude.
Read the New York Times article
People Talks about our work in their Youtube Channels. E.g.
AI Safety Reading Group discusses our work
Klas Leino, Zifan Wang, Matt Fredrikson
TL;DR We design a new type of neural network, GloRo Nets, which predicts the most likely class of the input and simultaneously certifies if the prediction is robust for any input perturbation in a pre-defined L2-norm ball. Our certification is based on the global Lipschitz constant of the network and the inference speed of prediction + certification is as fast as a standard inference run.
Improving Robust Generalization By Directly PAC-Bayesian Bound Minimization
Zifan Wang, Nan Ding, Tomer Levinboim, Xi Chen, Radu Soricut
TL;DR We use a learning thoery, i.e. PAC-Bayesian, to upper-bounds the adversarial loss over the test distribution with quantities we can minimize during the training. We present a method, TrH Regularization, which can be plugged in with PGD training, TRADES and many other adversarial losses. Our work updates the state-of-the-art empirical robustness using Vision Transformers.
Poster (left) & Slides (middle) & Video (right). Click to see the full.
Selected Tutorial
Anupam Datta, Matt Fredrikson, Klas Leino, Kaiji Lu, Shayak Sen, Zifan Wang
This tutorial examines the synergistic relationship between explainability methods for machine learning and a significant problem related to model quality: robustness against adversarial perturbations.
Interviews
ARD - Jailbeaking ChatGPT for leaking bio-weapon knowledge - Link
Serving as Reviewer
NeurIPS 2020 - Present
ICLR 2020 - Prsent
ICML 2021 - Present
CVPR 2022 - Present
TMLR. 2022 - Present