LLMs are integral to a wide range of real-world applications but remain susceptible to various abnormal behaviors. In this work, we concentrate on three primary threats, including Jailbreak Attacks, Hallucinations, and Backdoor Attacks, each of which can significantly compromise both reliability and security. Consequently, there is a pressing need for robust detection mechanisms to ensure the safe deployment of LLMs.
Jailbreak Attacks aim to bypass the safety and ethical constraints of LLMs by exploiting model vulnerabilities. These attacks typically involve crafting inputs that trigger outputs violating ethical boundaries, such as harmful content. Jailbreak methods can be either manual, where users directly create malicious inputs, or automated, using optimization techniques like gradient-based attacks to evade safety filters.
Hallucinations occur when LLMs produce outputs that are syntactically correct but factually incorrect or contextually irrelevant. These errors can manifest in various forms, such as contradictory responses or the generation of content disconnected from the input. Hallucinations compromise the reliability of LLMs, especially in domains where factual accuracy is critical, such as healthcare, finance and law.
Backdoor Attacks involve embedding hidden triggers within an LLM that allow attackers to manipulate the model's behavior covertly. These triggers remain dormant under normal conditions but activate abnormal behaviors when specific inputs are encountered. Backdoors are typically achieved through data poisoning or parameter manipulation, and they can lead to harmful outputs when activated, presenting severe security risks.
To clarify the application scenarios, we define the threat model by analyzing the attacker's objectives, capabilities, and the implementation costs associated with each threat type.
Attacker Objectives: The attackers are malicious users of LLM services who aim to trigger abnormal behavior in the models, thereby compromising their security and credibility.
Attacker Knowledge and Capabilities: We assume the attackers possess reasonable computation resources to utilize for LLM inference and fine-tuning smaller LLMs in certain scenarios. It is also reasonable to generally assume that attackers have at least black-box access to the LLM's inference API, allowing them to submit queries and observe the corresponding outputs. In certain scenarios like jailbreak and backdoor attacks, attackers may have partial or full knowledge of the model's architecture, parameters, or training data. Given these assumptions, attackers may have the following knowledge and capabilities in particular:
Jailbreak Attacks: Attackers may employ low-cost template-based strategies (e.g., role-playing) to bypass safety filters, or incur higher computational costs to craft optimization-based adversarial suffixes using gradient information. Both strategies aim to force the model into generating harmful or prohibited content.
Backdoor Attacks: Attackers can compromise the training data or supply chain to insert specific triggers. While this implementation incurs high costs due to the prerequisite access to the training pipeline, it implants hidden backdoors that enable targeted malicious behaviors once activated.
Hallucinations (Unintentional Errors): Although often arising naturally due to knowledge gaps, adversaries can also intentionally induce hallucinations via deceptive contexts with minimal cost. This results in factually incorrect generation, which severely compromises the service reliability and erodes user trust.