IT

Information Theory: String Complexity

We consider texts or more generally string or sequence written on a finite alphabet A. A can be the alphanumeric alphabet, or the acide base {A,C,G,T} for the DNA sequences. The set of finite sequence is called A* . A factor w of a sequence X in A* also belongs to A* and is made of consecutive symbols in the original sequence X. In other words, if w is a factor of X then there exist two sequence u and v (possibly empty) such that X=uwv. In the figure below we take w="aba" and we display three example of strings where w appears: X1, X2 and X3. The occurence are boxed. In X3 w appears two times in overlapping positions.

The complexity J(X) of a string X is the number of its distinct factors. The complexity of "aaaaa" is 5 (J("aaaaa")=5), the distinct factors are "" (the empty word), "a", "aa", "aaa", "aaaa" and "aaaaa" (the full sequence). The complexity of "ababa" is 10 with the factors: "", "a", "b", "ab", "ba", "aba", bab", abab", "baba", "ababa". The join complexity between two sequences X1 and X2 is the number of common and distinct factors between the two sequences and denoted J( X1 , X2) . Let X1 ="the_ape_is_eating_the_banana", X2 ="ananas_is_an_exotic_fruit" (blank are replaced by "_" for visibility). the common factors are "", "a", "e", "i", "n", "s", "t", "_", "a_", "an", "is", na, "s_", "ti", "_a", "_e", "_i", "ana", "is_", "nan", "_is", "anan", "nana", "_is_", "anana". Thus J( X1 , X2) =25. It is easy to compute, just by suffix tree superposition. The figures below show the respective suffix trees T1 and T2 of the sequences X1 and X2. These are the expanded versions, in fact the tree can omit the leaves made by long text and replace them by pointers in the original text. The compacted version is linear in size and can be built in a linear time (versus the length of the original sequence). The figure below the trees shows the superposition of the trees which result in the display of the common factors. The next box on the right shows an abstraction of the program to enumerate the common factor via tree superposition (by abuse we let the program to operate on the suffix trees T1 and T2. T1(a) (resp T2(a)) denotes the subtree corresponding to symbol a in A.

The paradigm here ids that the higher is the joint complexity between two texts of same length n, the higher are the similarities between the sources. The formulas below show the average joint complexity in classic sourcde models such as markov source with finite memory, etc. The first formula is when the sequence are strictly identical: X1 = X2=X . In this case the joint complexity is equal to the complexity and is roughly quadratic in n. In fact the main term (n+1)n/2 indicates that most pair of positions in the sequence generate a distinct factor. The second formula is when the sequence are different but the sources are the same. In this case the joint complexity is mostly linear in n, the quantity h refer to the entropy rate per symbol of the source. The last formula is when the sources are different (eg Markov source with different transition matrices). The exponent kappa is strictly smaller than 1 and is obtained via the root of a sophisticated equation, and the joint complexity is sublinear in n. The more different are the sources, the smaller is the quantity kappa. The quantity alpha and beta are obtained via some saddle point methods. It is possible that the source are so different that the common distinct words are finite in number, in this case the equation shows that kappa=0 and alpha=0 the joint complexity does not increase with n .

The Joint complexity can be used to identify the source. In the plot below we experiment the joint complexity of a binary Markov source tuned with a single parameter. The maximum is reached when the parameter agrees. The plot on the right shows the joint complexity of simulated english over simulated french (top) and simulated english over simulated polish (down). In red is the actual theoretical value. The simulation is a Markov process of memory 4.

The joint complexity can be used to classify a set of tweets. We have collected 1,000 random tweets. Our aim is to make a partition of this set in the categories: politics, sport, technologies, economics lifestyle. First we select the tweets with these words as keywords. This concern few percent of the original set and seeds the database. Notice: these tweets are dated of 2012.

The second step is to populate the database with the remaining tweets. For each tweet we look at the tweet in the database which is the closest in term of joint complexity, and we insert the tweet in the corresponding category. The picture below give a list of tweets which have classified that way. The category is given by the color. We stress that these tweets are among the majority of tweets which do not contain the keywords which seeded the database.

The classification is not 100% accurate. Sometimes for the good reason that the categories can overlap in their topics. Interestingly the method can also capture and classify tweets in other language and alphabet as shown below. The reason is that in most language the word are constructed with similar patterns of suffix, prefix ans factors. The difficulty is to get knowledgeable people in this language in order to check the classification. Since those tweets are seldom in the sample, the accuracy is not expected to be high. However it should be noted that the URL included in the tweets are decompressed in the joint complexity calculation (based on the JSON files), and sometimes this amplifies the joint complexity and allow to make the bridge between languages.

To make it short, the Joint Complexity is something we can qualify as the "AI for the poor people". It requires few processing capabilities, it does not necessitate heavy and lengthy learning cycle. In fact we can see it can capture similarities within text of same language without specific training. Thanks to its reactivity it can be used to detect bursty trends which appear during misinformation campaigns. Furthermore it is backed by information theory and can be used for other purpose than text classification, eg DNA similarity detection, as originally proposed by Alberto Apostolico. .