Our empirical study aims to explore an intuitive method for detecting glitch tokens. To this end, we investigate the differences in the model's behaviors when processing glitch tokens versus normal tokens. Two research questions are raised to guide the study:
RQ1 (Characteristics): What differences are exhibited between glitch tokens and normal tokens at the structural level of an LLM?
RQ2 (Ubiquity): Are the differences discovered in RQ1 prevalent in most LLMs?
We employ a unified approach to determine whether each token is glitchy or normal in this study. While the symptoms of glitch tokens may vary across different tasks, we consistently utilize a repetition task to construct input sequences for glitch token identification. Specifically, we formulate a prompt for a repetitive task such as ``Can you repeat the token `{token}' and return it back to me?'' This prompt is then fed into the model to assess its ability to accurately reproduce the original token. If the model fails to correctly repeat the token in its output, we classify it as a glitch token. Conversely, if the model successfully repeats the token, we categorize it as a normal token. We traverse all the 32,000 tokens of the Llama2 model and eventually identify 6,425 glitch tokens from the entire vocabulary.
To precisely capture intermediate layer outputs within the model, we resort to a transformer mechanistic interpretability tool named Transformer-lens. Its hook technique enables real-time access of the activation values at all layers and allows code insertion into specific intermediate layers of the model.Â
In this study, we insert hooks into all intermediate layers during the first forward of the tested model. This approach is chosen because the first forward comprehensively reflects the model's understanding of the input sequence and highlights the differences between normal and glitch tokens.