Probing and Improving Adversarial Robustness 

Understanding LLM limitations and enhancing their robustness to user inputs is crucial for safe deployment. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. 

In comes Rainbow Teaming, a novel approach that automatically produces a diverse collection of adversarial prompts for any LLM, requiring only black box access!

Powered by LLMs, Rainbow Teaming can uncover hundreds of adversarial prompts in domains such as safety, question answering, cybersecurity and more! Those synthetic prompts can be used to diagnose failure modes, or to improve the adversarial robustness of state-of-the-art LLMs without hurting their general capabilities.

Rainbow Teaming

Rainbow Teaming builds on MAP-Elites to populate and archive of adversarial prompts:

Versatility

Safety

Trivia Q&A

Cybersecurity

Rainbow Teaming is applicable to a wide range of domains, each requiring the user to define only 3 components:

Natural Adversarial Prompts

By performing mutations in language space, Rainbow Teaming generates sensible adversarial prompts that resemble natural user-generated attacks.

 

Table 1: Agreement between different evaluators.

This allows us to achieve >90% attack success rate against all models tested, where our AI evaluators are validated with human experiments.

The prompts are also transferable to other models, including closed source ones, such as GPT-4o.

Table 2 Transfer of adversarial prompts across different models. We take 3 archives for each original target, apply them to the transfer target, and report the mean and standard deviation of the ASR as evaluated by Llama Guard (best of 4 responses). 50% of adversarial prompts transfer on average, but the exact transfer varies.

Improving Robustness through Adversarial Fine-Tuning

By running Rainbow Teaming for multiple seeds, we build synthetic dataset covering the vulnerabilities of the target model. This enables us to do supervised fine-tuning, greatly improving the adversarial robustness of the model without hurting its general capabilities.

Table 3 Safety and capabilities scores before and after supervised fine-tuning on Rainbow Teaming-generated data. Fine-tuning greatly improves robustness to adversarial prompts without hurting capabilities.






The fine-tuned model is also substantially more robust to subsequent rounds of Rainbow Tuning. This paves the path to automated improvement of LLMs by alternating between synthetic data generation and adversarial fine-tuning.

Open-Ended Generation





Specifying additional features increases the number of adversarial examples multiplicatively, as in the 3D Q&A archive to the left.

Rainbow Teaming also relies on a preference-based Judge LLM to improve the quality of adversarial prompts even after the archive is full.

We believe Rainbow Teaming to be an invaluable tool for diagnosing and improving the robustness of LLMs, that complements manual red teaming and addresses gaps in the existing portfolio of red teaming and safety methods. Ultimately, we see it contributing to the safe and responsible deployment of foundational models.

Read our paper for implementation details, example prompts, ablations and additional results!

@article{samvelyan2024rainbow,
title={Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts},
author={Mikayel Samvelyan and Sharath Chandra Raparthy and Andrei Lupu and Eric Hambro and Aram H. Markosyan and Manish Bhatt and Yuning Mao and Minqi Jiang and Jack Parker-Holder and Jakob Foerster and Tim Rocktäschel and Roberta Raileanu},
year={2024},
eprint={2402.16822},
archivePrefix={arXiv},
primaryClass={cs.CL}
}