Additional User Study Results

Results of user study are presented in figures. Subfigure (a) shows the results of r_{toxic} of question (1) across different datasets. Subfigures (b)-(f) show the results of r_{answer} across different datasets. Subfigures (g)-(k) show the results of r_{toxic} for question (3) across different LLMs on the datasets. Subfigures (l)-(p) show the results of r_{answer} across different datasets. The x-axis represents the over-refusal rates and the y-axis represents the datasets. The dividing lines in the boxplot figures are quadratic points, which separate the data into four equal parts. We find that many of the samples in XSTest and OR-Bench fail to trigger over-refusal behaviors as expected, as the r_{answer} is still high. We also find that Mistral-v0.3 is the least safety-aligned model, as it has the highest toxic rate across toxic datasets.

Page updated

Google Sites

Report abuse