Publications

Publication Categories

Machine Learning Robustness

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning [PDF|Project Page]

Nathaniel Li and Alexander Pan and Anjali Gopal and Summer Yue and Daniel Berrios and Alice Gatti and Justin D. Li and Ann-Kathrin Dombrowski and Shashwat Goel and Long Phan and Gabriel Mukobi and Nathan Helm-Burger and Rassin Lababidi and Lennart Justen and Andrew B. Liu and Michael Chen and Isabelle Barrass and Oliver Zhang and Xiaoyuan Zhu and Rishub Tamirisa and Bhrugu Bharathi and Adam Khoja and Ariel Herbert-Voss and Cort B. Breuer and Andy Zou and Mantas Mazeika and Zifan Wang and Palash Oswal and Weiran Liu and Adam A. Hunt and Justin Tienken-Harder and Kevin Y. Shih and Kemper Talley and John Guan and Russell Kaplan and Ian Steneker and David Campbell and Brad Jokubaitis and Alex Levinson and Jean Wang and William Qian and Kallol Krishna Karmakar and Steven Basart and Stephen Fitz and Mindy Levine and Ponnurangam Kumaraguru and Uday Tupakula and Vijay Varadharajan and Yan Shoshitaishvili and Jimmy Ba and Kevin M. Esvelt and Alexandr Wang and Dan Hendrycks

TL;DR  The Weapons of Mass Destruction Proxy (WMDP) benchmark is a dataset of 4,157 multiple-choice questions surrounding hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP serves as both a proxy evaluation for hazardous knowledge in large language models (LLMs) and a benchmark for unlearning methods to remove such knowledge. 



HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal [PDF|Project Page]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks

TL;DR  We introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria. Using HarmBench, we conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs and defenses, yielding novel insights.


Transfer Attacks and Defenses for Large Language Models on Coding Tasks [PDF]

Chi Zhang, Zifan Wang, Ravi Mangal, Matt Fredrikson, Liming Jia, Corina Pasareanu

TL;DR  We find adversarial programs crafted for small seq-to-seq models are also effective in attacking general large language models (LLMs), e.g. GPT-3.5, GPT-4, Claude-2, and models that specialized for code understanding, e.g. CodeLlama. 


Can LLMs Follow Simple Rules? [PDF|Code| Project Page | Demo]

Noman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, David Wagner

TL;DR  We present a test suite, RuLEs, which benchmarks LLMs' capability of following user-defined rules. RuLES contains 15 text scenarios with different rules, along with code to evaluate assistant responses. We also constructed a test suite of tricky user messages in order to evaluate current models and find many failures cases in all current models. The larger models, e.g. GPT-4 and Claude2, can pass more tests but they are still far away from acting as a reliable rule follower.


(Position Paper) Is Certifying Lp Robustness Still Worthwhile? [PDF]

Ravi Mangal*, Klas Leino*, Zifan Wang*, Kai Hu*, Weicheng Yu, Corina Pasareanu, Anupam Datta, Matt Fredrikson

TL;DR  The study of adversarial robustness in machine learning models, which is their ability to resist malicious input perturbations, has grown over the past decade with various attacks and defenses being developed. This paper focuses on defenses providing provable guarantees against lp-bounded attacks and examines the importance of such robustness research, the relevance of the lp-bounded threat model, and the value of certification over empirical defenses. It posits that the lp-bounded model is essential for secure applications, local robustness offers additional benefits, and certification offers a solution to adversarial challenges without necessarily compromising accuracy or robustness.


A Recipe for Improved Certifiable Robustness: Capacity and Data [PDF|Code]

Kai Hu, Klas Leino, Zifan Wang, Matt Fredrikson

TL;DR  We delve into the design space of methods for training certifiable robust models. A key challenge, supported both theoretically and empirically, is that adversarial robustness demands greater network capacity and more data than standard training. Our recipe includes both our successes and pitfalls, with the best option significantly improving the Verified Robust Accuracy (i.e. the portion of data that is both correctly predicted and robust), taking the state-of-the-art to new heights.


Universal and Transferable Adversarial Attacks on Aligned Language Models [PDF|Code]

Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson

TL;DR  We instroduce GCG, a new and effective gradient-based attack to language models. Current instruction-tuned models are often required to be aligned, e.g. by RLHF, not to elicit harmful or unetical completions. The goal of the attack is to find a suffix to potentially harmful user prompts, e.g. How to make a pipeline bomb, so the combined prompt would break such alignment filter. Importantly, we find adversarial suffixes found by GCG on open-source models transfer very well to commerical models like ChatGPT and Claude. 


On The Feature Alignment of Deep Vision Models [PDF]

Zifan Wang

TL;DR  Deep Neural Networks (DNNs) have shown impressive performance but often suffer from misalignment with human values and behaviors, particularly in their response to adversarial noise and feature interpretation. This thesis focuses on evaluating and enhancing the feature alignment of deep vision classifiers, introducing metrics like "locality" and innovative techniques such as TrH regularization and GloRo Nets, which aim to increase robustness and promote better alignment with human perspectives in deep learning.


Scaling in Depth: Unlocking Robustness Certification on ImageNet [NeurIPS 2023] [PDF|Code]

Kai Hu, Andy Zou, Zifan Wang, Klas Leino, Matt Fredrikson

TL;DR  We find that the conventional residual block has a loose Lipschizt bound, which is ceritification-unfriendly. We fix this by introducing linear residual connection, which becomes the building block for LiResNet. We use LiResNet to reach ~70% verifiable robust accuracy on CIFAR-10 and we are the first to scale deterministic certification to ImageNet.


Improving Robust Generalization By Directly PAC-Bayesian Bound Minimization 

[CVPR2023] Highlight (top10%) [PDF]

Zifan Wang, Nan Ding, Tomer Levinboim, Xi Chen, Radu Soricut 

TL;DR We use PAC-Bayesian theory to upper-bound the adversarial loss over the test distribution with quantities we can minimize during the training. We present a method, TrH Regularization, which can be plugged in with PGD training, TRADES and many other adversarial losses. 


On the Perils on Cascading Certifiablly Robust Classifiers [ICLR 2023] [PDF | Code | Slides | Video]

Ravi Mangal*, Zifan Wang*, Chi Zhang*, Klas Leino, Corina Pasareanu, Matt Fredrikson

TL;DR If you have N certifiable robust models, and think you can ensemble them by cascading (i.e. if the i-th mode fail to ceritfy, you try with (i+1)-th model and break if the robustness is certified by the (i+1)-th model otherwise you increase i till N), then you will end up with an unsound certification process. Namely, even if the cascading ceritfication returns ROBUST but the point may not actually be locally robust.


Globally-Robust Neural Networks [ICML 2021] [PDF | Code]

Klas Leino, Zifan Wang, Matt Fredrikson

TL;DR  We design a new type of neural network, GloRo Nets, which predicts the most likely class of the input and simultaneously certifies if the prediction is robust for any input perturbation in a pre-defined L2-norm ball. Our certification is based on the global Lipschitz constant of the network and the inference speed of prediction + certification is as fast as a standard inference run.

Machine Learning Explainability

Representation Engineering: A Top-Down Approach to AI Transparency [PDF|Code]

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks

TL;DR  We identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.


Consistent Counterfactuals in Deep Model [ICLR 2022] [PDF

Zifan Wang*, Emily Black*, Matt Fredrikson, Anupam Datta

TL;DR  If you provide a conterfactual input for the recourse of the prediction in explaining a model's decision on why you do not get the loan, you need to be careful if you decide to fine-tune this model every a few months. What would happen is that your counterfactual algorithm may give different counterfactuals every tie you retrain the model, or just removing one data point from the training set. We provide a SNS algorithm to help find a much stable counterfactual resistant to retraining or changes to the dataset.

Robust Models Are More Interpretable Because Attributions Look Normal [ICML 2022] [PDF | Code

Zifan Wang, Matt Fredrikson, Anupam Datta

TL;DR  We observe adversarailly-trained models have much sharper visualizations for feature attributions. This paper shows that one way to explain this result is that attribution vectors align better with the normal vectors of the decision boundaries.

Influence Patterns for Explaining Information Flow in BERT [NeurIPS 2021] [PDF]

Kaiji Lu, Zifan Wang, Piotr Mardziel, Anupam Datta

TL;DR  We introduce a method called Influence Pattern that can find how information from the input flows to the output in Transformer-based models, e.g. BERT. With this approach. we provide a set of empirical findings about how BERT makes correct or incorrect predictions.

Smoothed Geometry for Robust Attribution [NeurIPS 2020] [PDF | Code]

Zifan Wang, Haofan Wang, Shakul Ramkumar, Matt Fredrikson, Piotr Mardziel, Anupam Datta

TL;DR  We introduce a method called SSR training that regularizes Hessian of the loss w.r.t the input during training to make the landscape of the model smoother. As a result, feature attribution maps are more robust against input perturbations.

Interpreting Interpretations: Organizing Attribution By Criteria [CVPR Workshop 2020] [PDF | Code]

Zifan Wang, Piotr Mardziel, Anupam Datta, Matt Fredrikson

TL;DR  We benchmark a set of feature attribution methods by well-designed ablation studies to measure to what extend they identify necessary or sufficient features to the model's prediction.

Score-CAM: Score-Weighted Visual Explanation for Convolutional Neural Network [CVPR Workshop 2020] [PDF | Code]

Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, Xia Hu

TL;DR  Grad-CAM is a popular method to visualize the model's internal activations and see how they are related to the input features. However, one issue in Grad-CAM is that the gradient information can be noisy in deep models. This works provides an alternative for gradients: we can use zeroth-order information by abalting the input to replace the role of gradients in GradCAM, which results in our approach: ScoreCAM.

Faithful Explanations for Deep Graph Models [PDF]

Zifan Wang, Yuhang Yao, Chaoran Zhang, Han Zhang, Youjie Kang, Carlee Joe-Wong, Matt Fredrikson, Anupam Datta

TL;DR  Sythesizing graphs and using the groundtruth feature important to evaluate attribution methods are very wrong and lead to nothing but wrong conclusions. This is a metric to evaluate how well the model learns the graph relations, not evaluations for attributions themselves. Therefore, to evaluate graph attributions, we introduce a general faithfulness score and an explanation method KEC that maximizes the general faithfulness by construction -- so our explanation is faithful to begin with. 


Machine Learning and Logical Reasoning

Grounding Neural Inference with Satisfiability Modulo Theories [NeurIPS 2023 Spotlight] [PDF]

Zifan Wang, Saranya Vijayakumar, Kaiji Lu, Vijay Ganesh, Somesh Jha, Matt Fredrikson

TL;DR  One way to solve a Sudoku board shown in image (e.g. your friend took a picture of the problem during their Sudoku Cup and asks for your help) is to use a deep model to extract features and later uses a logical solver, e.g. z-3, to solve it. This work moves a step further on this pipeline -- w direclty learn a deep model that can solve Sudoku, together with other logical problems, by inserting a logical solver as a layer, which we refer to as SMTLayer, in the model. Our layer wrapper solves the problem that most every logical solvers are black-box and they do not propogate gradients as convolutions. On the contradictory, GPT-4, still can not solve Sudoku.