BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

PKU-Alignment Team @ Peking-University

In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety alignment in large language models (LLMs). This dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. In total, we have compiled safety meta-labels for 30,207 question-answer (QA) pairs and gathered 30,144 pairs of expert comparison data for both the helpfulness and harmlessness metrics. We further showcase applications of BeaverTails in content moderation and reinforcement learning with human feedback (RLHF), emphasizing its potential for practical safety measures in LLMs. We believe this dataset provides vital resources for the community, contributing towards the safe development and deployment of LLMs. 

Warning: this paper contains example data that may be offensive or harmful.

Dataset Composition

The pie charts demonstrate the distribution of our data across the 14 potential harm categories. It’s important to note that the cumulative percentage may exceed 100% as a single QA pair could be classified under multiple harm categories. Left: QA pairs are annotated with a meta-label, safe or unsafe. Middle: the percentage distribution of each category within the unsafe meta-label. Right: a closer look at all minor categories that make up less than 6% of the total unsafe data.
Correlation Table presents the relationships among the 14 categories.

Moderation API

Comparison of different content moderation methods applied to LLMs. (a) The conventional approach to content moderation often leads to prompt rejection, resulting in an unhelpful AI assistant and diminishing user experience. (b) QA moderation, considering risk neutralization between the question and answer, empowers multi-round rejection sampling, thus fostering a harmless and helpful AI assistant. (c) and (d) The key differences between moderation and QA moderation lie in their respective input formats and the users’ perception of these varied approaches to content moderation.

Safety Evaluation of Different Models

The safety evaluation delineated in the following figure employs a dataset of 140 red-team prompts, evenly distributed across 14 harm categories. These prompts were then utilized to prompt four distinct Large Language Models (LLMs), yielding 140 Question-Answer (QA) pairs for each model. The generated outputs were subsequently scrutinized for harmlessness by three evaluation entities: \textit{QA-moderation, GPT-4 (Prompted), and Human Feedback}, the latter of which was sourced from our previously introduced data annotation team.

Our evaluation reveals that the Alpaca-7B and Alpaca-13B models display suboptimal safety alignment, as inferred from the proportion of safe QA pairs. Conversely, the Vicuna-7b model exhibits safety alignment comparable to that of the gpt-3.5-turbo model. There is a high degree of consensus among the three evaluation entities, reflected in the percentage of QA pairs where two evaluators agree. GPT-4 being the considerably deeper model showcases higher alignment with human perspectives compared to our QA-Moderation model, which is built on the LLaMa-7B model. The evaluation results further suggest greater discordance regarding the safety meta-label between evaluators when models lack adequate safety alignment (i.e., Alpaca-7B and Alpaca-13B). In contrast, models with robust safety alignment (i.e., Vicuna-7b and gpt-3.5-turbo) witness significantly fewer disagreements. This observation implies that while the evaluators share similar views on safe QA pairs, they differ slightly on classifying unsafe pairs.

Proportion of safe Question-Answer (QA) pairs and mutual agreement ratio, as flagged by three distinct evaluators across four different models. Bar Chart: Proportion of safe QA pairs. Line Chart: Agreement ratio.

(Safe)RLHF using BeaverTails Dataset

The following figures provide a comparative analysis of the distribution pertaining to the Alpaca-7B model both before and after the application of the Reinforcement Learning with Human Feedback (RLHF) supplemented with a safety constraint. A leftward shift observed in the cost distribution (Figure\ref{fig:supp-cost}) indicates a decrease in safety cost, consequently leading to an enhanced harmlessness in model responses to red-team prompts. Conversely, the rightward shift observed in the reward distribution (Figure~\ref{fig:supp-reward}) points to the increased helpfulness in the model responses to user prompts. Data for both figures were generated using two static preference models obtained from prior training sessions.

(a) Cost distributions before and after the implementation of RLHF with safety constraints on the Alpaca-7B model, as assessed using the static cost model. Before: Cost distribution of the Alpaca-7B model, denoted by the red curve. After: Cost distribution of the safety-enhanced RLHF model, denoted by the blue curve. (b) Reward distributions before and after the implementation of RLHF with safety constraints on the Alpaca-7B model, as assessed using the static reward model. Before: Reward distribution of the Alpaca-7B model, denoted by the red curve. After: Reward distribution of the safety-enhanced RLHF model, denoted by the blue curve.

References:

[1]. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

[2]. Safety Assessment of Chinese Large Language Models