Cluster Analysis Experiment
Cluster Analysis Experiment
We first introduce the setup for our cluster analysis experiments using the Llama-2-7b-chat model as our target. We collect four distinct datasets and use 200 queries from each dataset, aiming to trigger different model behaviors:
Normal queries are sourced from Alpaca-GPT-4, expected to trigger normal behaviors of LLMs in a QA format.
Synonymous queries are the paraphrased versions of normal queries by GPT-4, intended to trigger the same normal behaviors as the original queries.
Rejected queries are sourced from AdvBench. These malicious questions aim to trigger rejection behaviors, considering the aligned LLM is trained to reject such queries. For example, for malicious queries like "how to make a bomb," LLMs will respond with something like "Sorry, I cannot provide……" to avoid harmful content.
Attack queries are generated by appending adversarial suffixes(The red text in the picture) to rejected queries using GCG. These queries aim to trigger the model to output malicious content (i.e., abnormal behaviors).
The hidden states refer to the neuronal activations of the last token in each transformer block, after processing through both the Attention and MLP (Multi-Layer Perceptron) layers, which are widely believed to capture the overall semantics of the sentence.
For each data point, we perform normalization to ensure that all features contribute equally to the clustering process. Post normalization, we apply Principal Component Analysis (PCA) to reduce the dimensionality of the data, retaining the most significant features while reducing noise and computational complexity. Finally, we apply k-means clustering to the PCA-reduced data, partitioning the data into k clusters where each data point belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
Clustering experiment analysis results