LLM Jailbreak Study

"Do Anything Now": An Empirical Study of Jailbreak Prompts against ChatGPT

Abstract

Natural language prompts work as a crucial interface between users and many Large Language Models (LLMs), especially ChatGPT, guiding the generation of model outputs and facilitating the execution of a wide range of tasks. However, carefully crafted input with malicious intent can bypass LLMs’ regulations and pre-defined restrictions. Existing works show that prompt injection, specifically jailbreak prompts, have become a significant concern in LLM security. Attackers can leverage the jailbreak prompts as the attack vector to further compromise the system integrated with LLMs. To better understand jailbreak prompts and provide insights for the development of more robust LLM systems, our paper explores the following research questions: 1) What are the common patterns used for jailbreak prompts? 2) How effectively can jailbreak prompts bypass LLMs’ restrictions?

To address the research questions, we first categorize 78 identified jailbreak prompts into 10 unique patterns, grouped under 3 types of jailbreak strategies, and analyze their distribution. Then, we evaluate the effectiveness of the jailbreak prompts among these patterns on GPT-3.5 and GPT-4. Specifically, the evaluation includes 3,120 jailbreak questions in 8 prohibited situations defined by OpenAI. Moreover, we study the effectiveness of the jailbreak prompts with the model evolution of ChatGPT. Our findings reveal that pretending is the most commonly used strategy (76/78) in jailbreak prompts. We observe that the effectiveness of the jailbreak prompts will vary based on different patterns, yet all patterns, with the exceptions of Translation and Simulate Jailbreaking, can consistently bypass the restrictions in GPT-3.5 and GPT-4. Interestingly, our evaluation shows that GPT-4 provides enhanced protection than GPT-3.5 (30.20% vs 53.08%), with lower average success rates across all jailbreak prompts patterns. Our study highlights the importance of prompt structures in determining the effectiveness of jailbreaking LLMs in different models. It sheds light on future research in jailbreak prompt protections.

Live Demo

Jailbreak live demo.mp4

Motivation Example

Jailbreak Prompt Template

Jailbreak is a process that employs prompt injection to specifically circumvent the safety and moderation features placed on LLMs by their creators.

In this paper, we define a jailbreak prompt as a general template used to bypass restrictions.

For example, the following is a condensed version of a jailbreak prompt, allowing ChatGPT to perform any task without considering the restrictions.

Prohibited Scenario

This term refers to a real-world conversational context in which ChatGPT is forbidden from providing a meaningful output. OpenAI has listed all prohibited scenarios in the official usage policies.

In each prohibited scenario, ChatGPT warns users that the conversation potentially violates OpenAI policy.

For simplicity, we use 'scenario' to refer to such contexts throughout the paper.

Jailbreak Prompt

A jailbreak prompt or jailbreak question refers to the input given to ChatGPT to generate a response. A jailbreak question is a specific type of question that combines a jailbreak prompt with a prompt in a real-world scenario. The following content gives an example of a jailbreak question. For simplicity, we use 'question' to refer to the jailbreak question throughout the paper.