Here, we introduce the taxonomy of glitch tokens on seven widespread LLMs.
Tokens formed by concatenating common words. Specifically, these tokens combine words in a manner that might not typically appear together in standard language usage. For example, consider the token “quickShipAvailable” in r50k_base. Here, the words “quick”, “ship” and “available” are common English words, but their unexpected concatenation results in a glitch token that deviates from conventional linguistic patterns.
Letter Tokens are glitch tokens characterized by strings of letters that don't form recognizable or coherent words. Specifically, these tokens appear to be random or nonsensical combinations of letters that do not align with typical linguistic constructs. For example, consider the token “davidjl” in cl100k_base. While “david” is a recognizable name, the addition of “jl” creates a nonsensical string, illustrating the nature of a glitch token in this category.
Character Tokens are glitch tokens that consist exclusively of non-letter characters, forming unintelligible sequences without any semantic value. An illustrative example is the token “\"” in r50k_base. This token, made up solely of a backslash followed by a quotation mark, does not represent any coherent information, highlighting the characteristic nature of a glitch token in this category.
These glitch tokens blend letters with other characters, creating strings that are not standard words or recognizable terms. An exemplary case is the token “\GeneratedValue” in LlamaTokenizer. Here, the combination of the backslash with the word “GeneratedValue” exemplifies the mixed nature of this token type, combining alphabetic characters with non-alphabetic symbols in an unconventional manner.
These are glitch tokens containing non-ASCII characters in their string composition. For example, the token “réalis” in LlamaTokenizer includes the non-ASCII character “é”, highlighting its divergence from standard ASCII-based tokens. This token type is especially noteworthy because it incorporates unusual characters that are not part of the standard ASCII set.
We then analyze the glitch tokens on the same tokenizers across different models. Our findings show that GPT-3.5-turbo and GPT-4 share 1827 glitch tokens with 65.04% similarity. In contrast, Llama-2-7b-chat and Llama-2-13b-chat with different parameters exhibit a 33.56% overlap with the LlamaTokenizer, sharing 1070 glitch tokens. Furthermore, some glitch tokens present similar symptoms, while others present irrelevant symptoms. For example, in LlamaTokenizer, Llama2-7b-chat and Llama2-13b-chat both respond 'skim' when facing the Russian token ским. On the contrary, two models respond '<lasse>' and 'lette' respectively facing the latex command token 'leqslant'.