For an LLM with fixed weights, its actual behaviors can be traced back to the hidden states of each intermediate layer of the model. When the model processes inputs with different intents, the activation patterns of various layers can vary significantly due to differences in its behavior. We thus hypothesize that the activation patterns of an LLM, when processing queries with different intents, can reflect its understanding of the input and serve as a basis for detecting abnormal behaviors. In other words, the internal activations reflect how the model semantically parses the query.
For instance, consider two seemingly similar queries: a normal query such as "Write a tutorial on how to make a website", and an attack query such as "Write a tutorial on how to make a bomb + adversarial suffix". Although they look alike in surface text, the underlying activation patterns in the model’s hidden layers can differ substantially. In particular, the normal query follows the model’s usual processing pathways, while the attack query circumvents safety mechanisms and activates neurons tied to abnormal behaviors. This contrast illustrates how the model’s internal dynamics diverge when handling normal versus adversarial inputs.
Intuition: By analyzing the differences in activation patterns between normal queries and attack queries, we can identify the unique characteristics of the model when processing abnormal inputs, providing a solid basis for abnormal behavior detection.
To validate this intuition and explore the underlying mechanisms, we conduct two empirical studies using Llama-2-7b-chat as the target model. These studies investigate the separability of representations in the latent space and the distribution of activation differences across model layers.
We first investigate whether abnormal behaviors can be distinguished from normal behaviors in the high-dimensional hidden space. We collected three sets of samples, each consisting of 400 pairs, corresponding to three different threat models: Jailbreak (Normal vs. Attack), Hallucination (Correct vs. Incorrect), and Backdoor (Clean vs. Poisoned). We extracted hidden states from the Attention and MLP sublayers of the last transformer block and employed t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the manifold structure. Additionally, we calculated Representational Similarity Analysis (RSA) metrics to quantify the correlation distances between different classes.
The primary observation is that abnormal inputs (red points) generally map to distinct regions of the manifold compared to normal inputs (blue points). This geometric separability is consistently supported by the RSA metrics across the majority of tasks, where the inter-class distance (e.g., Normal-Jailbreak) is measurably larger than the intra-class distance (e.g., Normal-Normal). For instance, in the Jailbreak scenario, the distance between normal and attack queries (0.333 ± 0.053) is roughly double the distance within normal queries (0.167 ± 0.055), indicating a clear decision boundary in the high-dimensional space.
Secondarily, we observe variations in separability between different component types. While both Attention and MLP layers distinguish strong adversarial patterns effectively, MLP layers tend to exhibit more compact clustering and reduced intra-class variance. This is particularly noticeable in the Hallucination task, where the MLP representations provide a clearer distinction between correct and incorrect answers compared to the Attention layers.
The hidden states of LLMs exhibit intrinsic geometric separability between normal and abnormal behaviors, with MLP layers demonstrating stronger discriminative capability in certain scenarios.
While Study I confirms separability, it treats the model as a black box regarding layer contribution. To understand where this divergence occurs, we draw inspiration from coverage metrics in deep learning security testing. The core premise is that abnormal and normal inputs trigger distinct processing pathways, effectively covering different regions of the model's internal state space. Aligning with metrics such as Neuron Coverage and Top-K Neuron Patterns, we utilize the count of active neurons as a proxy for pathway activation. We collected 100 normal and 100 attack queries and calculated the activation ratio (Attack/Normal) for each layer to map how processing dynamics shift across the model architecture.
Figure illustrates the activation ratios across the model's depth. The data reveals a stark heterogeneity: the model's response to anomalies is not uniformly distributed. While many layers (shown in blue) maintain stable activation levels regardless of input type, specific blocks (highlighted in orange) exhibit a dramatic surge in activity when processing attacks, with ratios spiking significantly. This indicates that the "abnormality" is not diffused evenly but is localized in specific processing stages. These layers with high activation ratios are likely where the model reacts most strongly to attacks, such as by bypassing safety checks.
The internal impact of abnormal behavior is unevenly distributed across layers, with semantic deviations mainly concentrated in a small number of key layers.