Quantitative linguistics

How to understand that you deal with a text, without understanding the text itself? This question relates to that of distinguishing a meaningful text (written in an unknown system) from a meaningless set of symbols. Recently we approached these issues by comparing features of the first half of a text to its second half; see here for the paper. This comparison uncovered hidden effects, because the halves have the same values of many parameters (style, genre, author's vocabulary etc). The first half is lexically richer, has longer and less repetitive words, more and shorter sentences, more punctuation signs, and more paragraphs. These differences between the halves indicate on a higher hierarchic level of text organization that so far went unnoticed in text linguistics.
For several years me and Weibing Deng studied statistical distribution for phonemes. The initial purpose was to look at rank-frequency relations of phonemes. But the final results reported in

Weibing Deng and Armen E. Allahverdyan, Stochastic model for phonemes uncovers an author-dependency of their usage, PLoS ONE 11(4): e0152561

went far beyond our expectations. Here is a summary of this research.

One expects that studying the hierarchic structure of language (phoneme, syllable, morpheme, word ...) may be inspired by methods employed for physical systems (atoms, molecules, ...). This motivates us to look at the smallest linguistic unit (i.e. phoneme). Phonemes (generalized sounds) are the minimal building blocks of the language related to meaning expression. The concept of phoneme emerged in Greek and Indian (Vedic) traditions together with atomic ideas, and since then the phoneme-atom analogy is frequently invoked in quantitative linguistic reasoning.

We present two main results:

-- Rank frequency relations for phonemes are well described via the simplest analogue of the ideal gas model (it is called Dirichlet distribution in mathematical statistics). Please recall that ideal gas models is a fruitful approach for deducing bulk (thermodynamic) features of matter from properties of atoms and molecules. Thus we validate quantitatively the phoneme-atom analogy.

--The above description shows that the phoneme distribution in a text depends on the author of the text. This dependence is expressed via the single parameter of the Dirichlet distribution that is akin to (inverse) temperature in terms of statistical physics.

The fact of author-dependence sharply contrasts the Zipf's law. We substantiated this result by alternative methods. We also show that the effect is not caused by the author's vocabulary (common words used in different texts by the same author). Hence it possibly implies that phonemes are stored by authors. This effect provides a statistical argument for the psychological reality of phonemes, which is a long-sought and to a large extent open problem in cognitive linguistics.

I am not the first one (and definitely not the last one) who is fascinated by the magic of Zipf's (Estoup-Zipf) law for the words in the text. Many works are devoted to explaining this law, but it is still unclear whether it is only a statistical regularity, or it has deeper relations with information-carrying structures of the text.

Together with Weibing Deng and Q.A. Wang we tried to behave as physicists and derive this law from the underlying reality: the mental lexicon of the author who is producing the text; see here. Our derivation provides more than the Zipf'slaw, since it also describes the distribution of rare words (hapax legomena).

We studied rank-frequency relations for Chinese characters; see here. This fascinating topic is close to sinology (i.e. to the discipline of understanding Chinese culture and thought), e.g. it can be useful for trying to understand whether and to which extent the characters are similar to words. It appeared that the issue of Zipf's law for Chinese characters is really non-trivial and can have an answer to the above question. Our paper also contains a detailed comparison of alphabetic and character-based writing systems.

Page updated

Google Sites

Report abuse