Possible Interpretability

We try to interpret the symptoms caused by glitch tokens based on the repetition prompts. We add hooks to Llama2-7b-chat using Transformer-lens to get the output of each layer and compare the differences in logits change between the regular tokens and glitch tokens. The following figures show the differences.

Fig 1. Probability change of regular token 'Rec'

Fig 2. Probability change of glitch token ' ografie'

In the figure, the abscissa represents the probability of the model after which layer, and the ordinate represents the ranking of the tokens. We choose the top 10 tokens in the output of each layer and draw a heatmap with their probability. It's easy to observe in Fig 1 that the regular token 'Rec' has a mutation in the 20th layer, the probability of every token relating to 'Rec' rises intuitively. On the contrary, we cannot observe the similar phenomenon in Fig 2

Page updated

Google Sites

Report abuse