Addressing abnormal behavior is crucial in analyzing and interpreting LLMs. Thus, RQ1 focuses on an initial exploration to determine if abstract models can effectively depict LLM behavior in terms of abnormalities. Our approach involves analyzing the variations in transition probabilities from abstract models between normal and abnormal instances, considering these probabilities as reliable indicators of model characteristics due to their representation of inherent state transitions.
To respond to RQ1, we evaluate the differences between normal and abnormal data, both qualitatively and quantitatively, across three distributions: normal instances in training and testing datasets, and abnormal instances in the testing dataset. The former represents standard LLM processing scenarios, while the latter highlights contexts beyond the training data’s scope, potentially indicating abnormal LLM behavior.
For qualitative analysis, we employ Kernel Density Estimation (KDE) plots to visualize transition probability distributions, and for quantitative analysis, we utilize the Mann-Whitney U test to assess statistical significance across the three distributions. The abstract model's hyperparameters are selected randomly from a predefined parameter space, aiming to uncover any discrepancies in LLM behavior and its abstract representation through transition probabilities.
The above figure illustrates the distribution of transition probabilities w.r.t. three types of instances (normal instances in training data, normal instances in testing data, and abnormal instances in test data) across three different tasks. Among all three tasks, the distributions of the transitions are highly aligned for normal instances in the training and test data. Namely, the abstract model has consistent behavior characteristics when dealing with normal instances. In terms of the normal and the abnormal instances, we also notice divergent distribution shapes in SST-2 and AdvGLUE++ datasets. This visual observation supports the proposed assertion that it is possible to distinguish normal and abnormal instances from the distribution of transition probabilities of the abstract model.
To discern whether our abstract model is effective in distinguishing between normal and abnormal behaviors of LLM, we employed the Kullback-Leibler (KL) divergence. By quantifying the divergence between Normal and Abnormal abstract states/transitions across various datasets, we gain insights into the model's capacity to detect behavioral discrepancies.
AdvGLUE++: This close-to-zero divergence suggests that the abstract model perceives the normal and abnormal behaviors within the AdvGLUE++ dataset to be closely aligned. We will see how the abstract model performs on AdvGLUE++ in the following experiments.
SST-2: Like AdvGLUE++, this divergence suggests that our abstract model perceives the normal and abnormal behaviors within the SST-2 dataset to be closely aligned.
TruthfulQA: This significant divergence implies that our abstract model can effectively discern between normal and abnormal LLM behaviors for the TruthfulQA dataset.
This figure provides an insightful visualization of the p-value distributions across settings of an abstract model that satisfies the statistical significance. By plotting p-values against their frequency of occurrence, the image serves as a valuable tool for evaluating the model’s effectiveness in detecting abnormal behavior. The x-axis represents the range of p-values, which are a statistical measure used to determine the significance of an observed effect. In this context, the p-values help to ascertain whether there is a statistically significant difference between the probabilities of normal and abnormal behaviors as detected by the abstract model. The y-axis denotes the frequency of each p-value across the 180 different settings of the abstract model.
The distribution of p-values indicates that in most of the model settings, the abstract model is capable of detecting abnormal behavior. This is evident from the p-values being in a range that signifies a statistically significant difference between normal and abnormal probabilities.
The spread of the p-value distribution across the 180 settings highlights the variability in the model’s performance. This variability could be attributed to differences in the model’s configuration, parameters, or the nature of the data it was trained on.
In general, our results validate our initial hypothesis and underline the capability of the abstract model to discover the abnormal behavior of the LLM. The transition probabilities of the abstract model are aligned with normal instances and have significant differences while abnormal instances are encountered.