BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
PKU-Alignment Team @ Peking-University
In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety alignment in large language models (LLMs). This dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. In total, we have compiled safety meta-labels for 30,207 question-answer (QA) pairs and gathered 30,144 pairs of expert comparison data for both the helpfulness and harmlessness metrics. We further showcase applications of BeaverTails in content moderation and reinforcement learning with human feedback (RLHF), emphasizing its potential for practical safety measures in LLMs. We believe this dataset provides vital resources for the community, contributing towards the safe development and deployment of LLMs.
Warning: this paper contains example data that may be offensive or harmful.
Dataset Composition
Amassed 28,133 red-team questions/prompts derived from the open-source datasets referenced in [1] and [2].
Annotated 30,207 QA-pairs across 14 potential harm categories, which correspond to 7,774 unique prompts. Of these prompts, 75.3% received three unique responses, 20.7% received six unique responses, and the remaining 4.1% received over six unique responses.
Within the total set of 30,207 QA pairs, 42.68% were assigned the safe meta-label, while the remaining 57.32% were categorized under the unsafe meta-label.
From the total pool of 30,207 QA pairs, we acquired 30,144 pairs of human-preference annotations separately for the helpfulness and harmlessness metrics of the responses.
Moderation API
Safety Evaluation of Different Models
The safety evaluation delineated in the following figure employs a dataset of 140 red-team prompts, evenly distributed across 14 harm categories. These prompts were then utilized to prompt four distinct Large Language Models (LLMs), yielding 140 Question-Answer (QA) pairs for each model. The generated outputs were subsequently scrutinized for harmlessness by three evaluation entities: \textit{QA-moderation, GPT-4 (Prompted), and Human Feedback}, the latter of which was sourced from our previously introduced data annotation team.
Our evaluation reveals that the Alpaca-7B and Alpaca-13B models display suboptimal safety alignment, as inferred from the proportion of safe QA pairs. Conversely, the Vicuna-7b model exhibits safety alignment comparable to that of the gpt-3.5-turbo model. There is a high degree of consensus among the three evaluation entities, reflected in the percentage of QA pairs where two evaluators agree. GPT-4 being the considerably deeper model showcases higher alignment with human perspectives compared to our QA-Moderation model, which is built on the LLaMa-7B model. The evaluation results further suggest greater discordance regarding the safety meta-label between evaluators when models lack adequate safety alignment (i.e., Alpaca-7B and Alpaca-13B). In contrast, models with robust safety alignment (i.e., Vicuna-7b and gpt-3.5-turbo) witness significantly fewer disagreements. This observation implies that while the evaluators share similar views on safe QA pairs, they differ slightly on classifying unsafe pairs.
(Safe)RLHF using BeaverTails Dataset
The following figures provide a comparative analysis of the distribution pertaining to the Alpaca-7B model both before and after the application of the Reinforcement Learning with Human Feedback (RLHF) supplemented with a safety constraint. A leftward shift observed in the cost distribution (Figure\ref{fig:supp-cost}) indicates a decrease in safety cost, consequently leading to an enhanced harmlessness in model responses to red-team prompts. Conversely, the rightward shift observed in the reward distribution (Figure~\ref{fig:supp-reward}) points to the increased helpfulness in the model responses to user prompts. Data for both figures were generated using two static preference models obtained from prior training sessions.
References:
[1]. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
[2]. Safety Assessment of Chinese Large Language Models