MasterKey

Update for Rebuttal

Abstract

WARNING: THIS WEBSITE CONTAINS PROMPTS AND MODEL BEHAVIORS THAT ARE OFFENSIVE IN NATURE.

Large Language Models (LLMs) have proliferated rapidly due to their exceptional ability to understand, generate, and complete human-like text, becoming an integral component of many AI services. Among these services, LLM chatbots have emerged as highly popular applications for everyday use, allowing individuals to engage in seamless interactions. Despite their benefits, these chatbots remain vulnerable to jailbreak attacks, where a malicious user manipulates the prompts to reveal sensitive, proprietary, or harmful information against the usage policies.  While a series of jailbreak attempts have been undertaken to expose these vulnerabilities, our empirical study in this paper suggests that existing approaches are not effective on the mainstream LLM chatbots. The underlying reasons for their diminished efficacy appear to be the undisclosed defenses, deployed by the service providers to counter jailbreak attempts.

We introduce MasterKey, an end-to-end framework to explore the facinating mechanisms behind jailbreak attacks and defenses. We make two-fold contributions in the design of MasterKey First, we propose an innovative methodology, which uses the time-based characteristics inherent to the generative process to reverse-engineer the defense strategies behind mainstream LLM chatbot services. The concept, inspired the time-based SQL injection technique, enables us to glean valuable insights into the operational properties of these defenses. By manipulating the time-sensitive responses of the chatbots, we are able to understand the intricacies of their implementations, and create a proof-of-concept attack to bypass the defenses in multiple LLM chatbos, e.g., \chatgpt{}, Bard, and Bing Chat.


Our second contribution is a methodology to automatically generate jailbreak prompts against well-protected LLM chatbots. The essence of our approach is to employ an LLM to auto-learn the effective patterns. By fine-tuning an LLM with jailbreak prompts, we demonstrate the possibility of automated jailbreak generation targeting a set of well-known commercialized LLM chatbots.  Our approach generates attack prompts that boast an average success rate of 21.58\%, significantly exceeding the success rate of 7.33\% achieved with existing prompts. We have responsibly disclosed our findings to the affected service providers.  MasterKey paves the way for a novel strategy of exposing vulnerabilities in LLMs and reinforces the necessity for more robust defenses against such breaches.