Rainbow Teaming
Open-Ended Generation of Diverse Adversarial Prompts
Mikayel Samvelyan*, Sharath Chandra Raparthy*, Andrei Lupu*,
Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder,
Jakob Foerster, Tim Rocktäschel, Roberta Raileanu
NeurIPS 2024
Probing and Improving Adversarial Robustness
Understanding LLM limitations and enhancing their robustness to user inputs is crucial for safe deployment. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations.
In comes Rainbow Teaming, a novel approach that automatically produces a diverse collection of adversarial prompts for any LLM, requiring only black box access!
Powered by LLMs, Rainbow Teaming can uncover hundreds of adversarial prompts in domains such as safety, question answering, cybersecurity and more! Those synthetic prompts can be used to diagnose failure modes, or to improve the adversarial robustness of state-of-the-art LLMs without hurting their general capabilities.
Rainbow Teaming
Rainbow Teaming builds on MAP-Elites to populate and archive of adversarial prompts:
Selection: Sample a parent prompt and candidate feature categories
Mutation: Mutate the parent to create a candidate prompt matching the categories
Evaluation: Evaluate whether the candidate prompt is more adversarial than the previous occupant of the archive
Update: Update the archive with the candidate prompt, if better
Repeat!
Versatility
Safety
Trivia Q&A
Cybersecurity
Rainbow Teaming is applicable to a wide range of domains, each requiring the user to define only 3 components:
Feature Descriptors: Features defining the archive. These can be categorical (e.g., Attack Style, Risk Category or Topic) or numerical (e.g., Length).
Mutation Operator: The method generating new candidates, such as a prompt to instantiate a Mutator LLM.
Preference Model: A method for ranking prompts, in our case operated by a Judge LLM.
Natural Adversarial Prompts
By performing mutations in language space, Rainbow Teaming generates sensible adversarial prompts that resemble natural user-generated attacks.
Table 1: Agreement between different evaluators.
This allows us to achieve >90% attack success rate against all models tested, where our AI evaluators are validated with human experiments.
The prompts are also transferable to other models, including closed source ones, such as GPT-4o.
Table 2 Transfer of adversarial prompts across different models. We take 3 archives for each original target, apply them to the transfer target, and report the mean and standard deviation of the ASR as evaluated by Llama Guard (best of 4 responses). 50% of adversarial prompts transfer on average, but the exact transfer varies.
Improving Robustness through Adversarial Fine-Tuning
By running Rainbow Teaming for multiple seeds, we build synthetic dataset covering the vulnerabilities of the target model. This enables us to do supervised fine-tuning, greatly improving the adversarial robustness of the model without hurting its general capabilities.
Table 3 Safety and capabilities scores before and after supervised fine-tuning on Rainbow Teaming-generated data. Fine-tuning greatly improves robustness to adversarial prompts without hurting capabilities.
The fine-tuned model is also substantially more robust to subsequent rounds of Rainbow Tuning. This paves the path to automated improvement of LLMs by alternating between synthetic data generation and adversarial fine-tuning.
Open-Ended Generation
Specifying additional features increases the number of adversarial examples multiplicatively, as in the 3D Q&A archive to the left.
Rainbow Teaming also relies on a preference-based Judge LLM to improve the quality of adversarial prompts even after the archive is full.
We believe Rainbow Teaming to be an invaluable tool for diagnosing and improving the robustness of LLMs, that complements manual red teaming and addresses gaps in the existing portfolio of red teaming and safety methods. Ultimately, we see it contributing to the safe and responsible deployment of foundational models.
Read our paper for implementation details, example prompts, ablations and additional results!
@article{samvelyan2024rainbow,
title={Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts},
author={Mikayel Samvelyan and Sharath Chandra Raparthy and Andrei Lupu and Eric Hambro and Aram H. Markosyan and Manish Bhatt and Yuning Mao and Minqi Jiang and Jack Parker-Holder and Jakob Foerster and Tim Rocktäschel and Roberta Raileanu},
year={2024},
eprint={2402.16822},
archivePrefix={arXiv},
primaryClass={cs.CL}
}