To provide a structured overview of the strategies utilized to compromise LLMs, we categorize current attack techniques into three categories, reflecting their fundamental traits.
The first category, Generative Techniques, includes attacks that are dynamically produced, eschewing predetermined plans.
The second category, Template Techniques, comprises attacks conducted via pre-defined templates or modifications in the generation settings.
The last category, Training Gaps Techniques, focuses on exploiting weaknesses due to insufficient safeguards in safe training practices, such as RLHF.
This graph assesses the pros and cons of three attack categories across five dimensions. The values for each metric in the radar graphs are derived either directly from the experiment results or through empirical analysis.
The criterion of Complexity measures the intrinsic algorithmic challenge posed by each method. Notably, the Generative approach is identified as the most complex, attributed to its sophisticated algorithmic underpinnings. This is followed by the Training Gaps method, which demands substantial insight into the model's operation for effective application.
The dimension of Specificity evaluates whether an attack is tailor-made for a particular model. Given that Training Gaps are dependent upon the unique safety training protocols of each model, they inherently exhibit the highest specificity. Subsequently, the Template-Based method, often crafted for specific model types (e.g., the GPT series), ranks next in specificity.
In terms of Ease of Use, the Template-Based approach emerges as the most user-friendly, attributed to its pre-designed nature, thereby facilitating immediate application. The Training Gaps method follows, offering relatively straightforward deployment when contrasted with the more complex Generative approach.
Regarding Ease of Fix, Template-Based attacks, due to their predefined structure, allow for direct incorporation into safety training protocols, simplifying mitigation efforts. Similarly, addressing vulnerabilities exposed by Training Gaps is comparatively easier.
Lastly, the criterion of Running Cost reveals that Generative techniques, due to their intensive iteration and deployment requirements, incur the highest expenses. The Template-Based method, necessitating the processing of extensive prompts, ranks second, surpassing Training Gaps in terms of token processing demands.
We further conduct thorough examination on the existing defense mechanisms, classifying them into three primary categories based on their operational principles:
Self-Processing Defenses, which rely exclusively on the LLM's own capabilities;
Additional Helper Defenses, which require the support of an auxiliary LLM for verification purposes;
Input Permutation Defenses, which manipulate the input prompt to detect and counteract malicious requests aimed at exploiting gradient-based vulnerabilities.
This graph shows a comparative analysis of defense categories across four dimensions. The values for each metric in the radar graphs are also obtained either directly from the experiment results or through empirical analysis.
Autonomy assesses the degree to which a model depends on external resources for detection. Self-Processing exhibits the highest autonomy, followed by Input Permutation—this method involves only an additional algorithm to check inputs, whereas the Additional Helper approach requires an external model.
Running Cost evaluates the operational expenses; Input Permutation is notably resource-intensive due to significant input modifications and subsequent model verifications, making it more expensive than Self-Processing. However, RAIN method presents an outlier, incurring extended processing times due to its autoregressive input examination.
Adaptability gauges the method's ability to evolve in response to new attack vectors. Given the ongoing advancements in LLMs, the Additional Helper strategy, which usaully incorporates another advanced model, benefits from continual updates, with Self-Processing similarly advantaged.
Lastly, Comprehensiveness measures a defense mechanism's capability to generalize across attack types. Employing Additional Helper specializing in identifying malicious inputs ranks highest in effectiveness, followed by Self-Processing, which is inherently limited by the model's capabilities, and input permutation, which usually disrupts attack-embedded gradient information.