梓帆 

Zifan (/dz-fahn/) Wang                                                                                           

LinkedInLinkGitHub

I am currently the Lead Machine Learning research engineer at Tiny Fish (a stealth mode start-up based in Palo Alto, CA) where I study and build AI agents.  I also do research with friends and collaborators from Center for AI Safety and Carnegie Mellon University (CMU). I received my Ph.D. in Electrical and Computer Enigneering from CMU in 2023. I was co-advised by Prof.Anupam Datta and Prof. Matt Fredrikson in the Accountable System Lab. Before attending CMU, I received my Bachelor degree in Electronic Science and Technology from Beijing Institute of Technology, Beijing, China.

My research focuses on explanations and adversarial robustness of deep neural networks. The explainability tools I used to work are gradient-based attribution methods, which are motivated to locte the important features in the input or the internal representations. Specifically, when studying important input features, those "bad" explanations often indicate that the model is far away from humans as they are not really using human-readable features. These models are often vulnerable to perturbations that are unlikely to fool humans. Is this a coincidence? Not quite! There is a tight connection between the explainability and robustness. Here is a short summary: robust models' behaviors are often more explainable! 

E-mail: thezifan_at_gmail.com

News

Selected Publications

Universal and Transferable Adversarial Attacks on Aligned Language Models [PDF | Code | Demo]


Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson

TL;DR  We introduce GCG, a new and effective gradient-based attack to language models. Current instruction-tuned models are often required to be aligned, e.g. by RLHF, not to elicit harmful or unetical completions. The goal of the attack is to find a suffix to potentially harmful user prompts, e.g. How to make a pipeline bomb, so the combined prompt would break such alignment filter. Importantly, we find adversarial suffixes found by GCG on open-source models transfer very well to commerical models like ChatGPT and Claude. 

         Read the New York Times  article

People Talks about our work in their Youtube Channels. E.g. 

AI Safety Reading Group discusses our work 

Globally-Robust Neural Networks [ICML 2021] [PDF | Code]

Klas Leino, Zifan Wang, Matt Fredrikson

TL;DR  We design a new type of neural network, GloRo Nets, which predicts the most likely class of the input and simultaneously certifies if the prediction is robust for any input perturbation in a pre-defined L2-norm ball. Our certification is based on the global Lipschitz constant of the network and the inference speed of prediction + certification is as fast as a standard inference run.

icml21_slides.pdf

Improving Robust Generalization By Directly PAC-Bayesian Bound Minimization 

[CVPR2023 highlight (top10%)] [PDF]

Zifan Wang, Nan Ding, Tomer Levinboim, Xi Chen, Radu Soricut 

TL;DR We use a learning thoery, i.e. PAC-Bayesian, to upper-bounds the adversarial loss over the test distribution with quantities we can minimize during the training. We present a method, TrH Regularization, which can be plugged in with PGD training, TRADES and many other adversarial losses. Our work updates the state-of-the-art empirical robustness using Vision Transformers.

Poster (left) & Slides (middle) & Video (right).   Click to see the full.

Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization.pdf

Selected Tutorial

Machine Learning Explainability and Robustness: Connected At Hip [SIGKDD 2021] [homepage]


Anupam Datta, Matt Fredrikson, Klas Leino, Kaiji Lu, Shayak Sen, Zifan Wang 

This tutorial examines the synergistic relationship between explainability methods for machine learning and a significant problem related to model quality: robustness against adversarial perturbations. 

kdd_tutorial_to_upload.pdf

Interviews

ARD - Jailbeaking ChatGPT for leaking bio-weapon knowledge - Link 

Teaching

Guest Lecture - 18661 CMU Spring 2023 [slides]

Guest Lecture - CS 329T Stanford Fall 2023 [link, slides]

Guest Lecture - UW, Madison CS 2024 Spring [slides]

Serving as Reviewer

NeurIPS    2020 - Present

ICLR           2020 - Prsent

ICML          2021 - Present

CVPR         2022 - Present

TMLR.        2022 - Present