Based on the findings from our empirical study, we propose the GlitchProber algorithm, which aims to achieve automatic detection and fix of glitch tokens by analyzing the internal activation states of LLMS. The overall workflow of GlitchProber is shown below.
GlitchProber detection algorithm identifies and locates glitch tokens causing model output errors by analyzing intermediate layer activation states. It randomly samples a subset of tokens from the vocabulary, tests them using repetitive tasks, and extracts activation features from key layers. These features undergo dimensionality reduction and are labeled based on repetitive task outcomes. A classifier is then trained with the labeled data to assess unknown tokens. Finally, the remaining tokens are detected individually using the trained SVM classifier, and predictions are verified through repetitive tasks.
GlitchProber fix algorithm corrects the anomalous activation patterns of glitch tokens by adjusting activation values in the model's intermediate layers, eliminating their negative impact on the model output. It first calculates activation statistics of normal tokens in key layers and identifies neurons that are consistently activated or silent in most normal tokens. Then, it compares these neurons' activations between normal and glitch tokens, calculating suppression ratio coefficients for anomalous activations and activation values to promote silent neurons. Finally, it rectifies the activation values of glitch tokens in key layers based on these coefficients and values, automatically fixing the glitch tokens.