We've highlighted the detrimental effects of Recurrent Meltdown. If we can preemptively halt the model's generation upon detecting this behavior, we can significantly mitigate its impact on performance.
RecurrentDetector preemptively detects recurrent meltdown at detection_length=400 tokens, which is 25% of the generation limit, while introducing negligible delay.
As explained in the Preliminary Study section, the activation self similarity is a critical indicator of recurrent generation. Thus, the feature vector is consisted of two segments:
Max Similarity Ratios
The neurons in the LLMs are always processing all the tokens in the sequences simultaneously. Each neuron can be either activated or deactivated at each token. To quantify the activation similarity between any pair of tokens from the full model's perspective, we calculate the proportion of neurons that are at the same activation state (either both activated or deactivated) at the two tokens. This forms the Activation Similarity Matrix.
For efficiency, we retain only the maximum similarity for each token with any other token. This forms the Max Similarity Ratios vector.
Combining Value and Positional Information
The final feature vector fed to the MLP is the concatenation of the Max Similarity Ratios vector and its sorted version.
We trained RecurrentDetector on a total number of 3,400 samples on 6 open-source LLMs. First, we take 1,200 samples from the experiences of RecurrentGenerator that aren't recurrent generation, and another 1,200 samples that are. Then, we take 1,000 non-recurrent samples from the real-world dataset ShareGPT. The evaluation results are shown below. Inference takes 0.36 ms on average.