We tried to use RL-based generation methods to generate over-refusal samples, but the results were not satisfactory. We use GRPO, a RL method, to generate over-refusal samples. The RL approach involves training a model to generate over-refusal samples by maximizing a reward signal. The reward signal can be based on various criteria, such as the quality of the generated samples, their diversity, and their alignment with the desired properties of over-refusal samples.
However, the RL-based generation method show several limitations in generating over-refusal samples. The RL-based generation method need to generate and evaluate over-refusal samples iteratively during training, which is time-consuming and computationally expensive. Moreover, the cost of RL-based generation grows exponentially with the length of the generated content, making it impractical to generate multiple over-refusal samples in a single pass. Additionally, constrained by the computational resources, we can only perform RL-based method on 14B models, which are much less powerful than larger models (such as DeepSeek-R1 used in this work). This leads to the RL-based generation method being less effective than the multi-turn CoT dialog method.
Besides, it is more practical to call apis of large models to generate over-refusal samples than to train a RL model from scratch.