Despite the capability of our detection framework to identify abnormal behaviors caused by jailbreak and backdoor attacks before output generation, it only partially addresses hallucination-related abnormalities, which require analysis of the model’s generated outputs. To evaluate the effectiveness of AbnorDetector-Lite and AbnorDetector-Full in detecting hallucination phenomena, we follow the methodology proposed in previous work, appending both correct and hallucinated answers to each question to capture corresponding activations as representations of normal and abnormal behaviors under hallucination conditions.
Specifically, we sample 400 hallucination-detection questions each from the Truthful-QA, HaluEval-QA, Drowzee datasets, and SciQ, appending the correct and hallucinated answers provided by each dataset. These 1,600 queries paired with correct answers and another 1,600 paired with hallucinated answers are used for critical layer analysis, with extracted features serving to construct the classifier’s training set. Additionally, we independently sample 100 distinct hallucination-detection questions from each dataset, distinct from those in the training set, to assess classifier performance using the same methodology. It is worth noting that both the HaluEval-QA, Drowzee datasets and SciQ provide knowledge related to each question, which we combine with the question input to ensure completeness. For Lynx, which has a fixed prompt template, we follow the template requirements by using the questions, knowledge, and responses provided in each dataset to construct queries for hallucination detection.
Results of AbnorDetector and Lynx in Detecting Abnormal Behaviors under Hallucination Scenarios