Quantitative linguistics

Weibing Deng and Armen E. Allahverdyan, Stochastic model for phonemes uncovers an author-dependency of their usage, PLoS ONE 11(4): e0152561 

went far beyond our expectations. Here is a summary of this research.

One expects that studying the hierarchic structure of language (phoneme, syllable, morpheme, word ...) may be inspired by methods employed for physical systems (atoms, molecules, ...). This motivates us to look at the smallest linguistic unit (i.e. phoneme). Phonemes (generalized sounds) are the minimal building blocks of the language related to meaning expression. The concept of phoneme emerged in Greek and Indian (Vedic) traditions together with atomic ideas, and since then the phoneme-atom analogy is frequently invoked in quantitative linguistic reasoning.

We present two main results:

-- Rank frequency relations for phonemes are well described via the simplest analogue of the ideal gas model (it is called Dirichlet distribution in mathematical statistics). Please recall that ideal gas models is a fruitful approach for deducing bulk (thermodynamic) features of matter from properties of atoms and molecules. Thus we validate quantitatively the phoneme-atom analogy.

--The above description shows that the phoneme distribution in a text depends on the author of the text. This dependence is expressed via the single parameter of the Dirichlet distribution that is akin to (inverse) temperature in terms of statistical physics.

The fact of author-dependence sharply contrasts the Zipf's law. We substantiated this result by alternative methods. We also show that the effect is not caused by the author's vocabulary (common words used in different texts by the same author). Hence it possibly implies that phonemes are stored by authors. This effect provides a statistical argument for the psychological reality of phonemes, which is a long-sought and to a large extent open problem in cognitive linguistics.

Together with Weibing Deng and Q.A. Wang we tried to behave as physicists and derive this law from the underlying reality: the mental lexicon of the author who is producing the text; see here. Our derivation provides more than the Zipf'slaw, since it also describes the distribution of rare words (hapax legomena).