Amassed 28,133 red-team questions/prompts derived from the open-source datasets referenced in [1] and [2].
Annotated 30,207 QA-pairs across 14 potential harm categories, which correspond to 7,774 unique prompts. Of these prompts, 75.3% received three unique responses, 20.7% received six unique responses, and the remaining 4.1% received over six unique responses.
Within the total set of 30,207 QA pairs, 42.68% were assigned the safe meta-label, while the remaining 57.32% were categorized under the unsafe meta-label.
From the total pool of 30,207 QA pairs, we acquired 30,144 pairs of human-preference annotations separately for the helpfulness and harmlessness metrics of the responses.
The safety evaluation delineated in the following figure employs a dataset of 140 red-team prompts, evenly distributed across 14 harm categories. These prompts were then utilized to prompt four distinct Large Language Models (LLMs), yielding 140 Question-Answer (QA) pairs for each model. The generated outputs were subsequently scrutinized for harmlessness by three evaluation entities: \textit{QA-moderation, GPT-4 (Prompted), and Human Feedback}, the latter of which was sourced from our previously introduced data annotation team.
Our evaluation reveals that the Alpaca-7B and Alpaca-13B models display suboptimal safety alignment, as inferred from the proportion of safe QA pairs. Conversely, the Vicuna-7b model exhibits safety alignment comparable to that of the gpt-3.5-turbo model. There is a high degree of consensus among the three evaluation entities, reflected in the percentage of QA pairs where two evaluators agree. GPT-4 being the considerably deeper model showcases higher alignment with human perspectives compared to our QA-Moderation model, which is built on the LLaMa-7B model. The evaluation results further suggest greater discordance regarding the safety meta-label between evaluators when models lack adequate safety alignment (i.e., Alpaca-7B and Alpaca-13B). In contrast, models with robust safety alignment (i.e., Vicuna-7b and gpt-3.5-turbo) witness significantly fewer disagreements. This observation implies that while the evaluators share similar views on safe QA pairs, they differ slightly on classifying unsafe pairs.
The following figures provide a comparative analysis of the distribution pertaining to the Alpaca-7B model both before and after the application of the Reinforcement Learning with Human Feedback (RLHF) supplemented with a safety constraint. A leftward shift observed in the cost distribution (Figure\ref{fig:supp-cost}) indicates a decrease in safety cost, consequently leading to an enhanced harmlessness in model responses to red-team prompts. Conversely, the rightward shift observed in the reward distribution (Figure~\ref{fig:supp-reward}) points to the increased helpfulness in the model responses to user prompts. Data for both figures were generated using two static preference models obtained from prior training sessions.
References:
[1]. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
[2]. Safety Assessment of Chinese Large Language Models