Zipf stated an empirical law to model the fact that many types of data studied in the physical and social sciences. It is not a precise description of the phenomena but it is a distribution that works as a pretty good approximation. The Zipfian distribution, belongs to a family of related discrete power law probability distributions (Figure 1)1 [2].
There is no theoretical proof that Zipf's law applies to most languages [8], but Wentian Li [3] demonstrated empirical evidence supporting the validity of Zipf’s law in the domain of language. The demonstration was pretty simple: Li generated a document by choosing each character at random from a uniform distribution including letters and the space character. The analysis of the words in the document follow the general trend of Zipf's law.
From the point of view of certain experts, this linguistic phenomenon is the consequence of a natural conservation of effort in which speakers and hearers minimize the work needed to reach understanding, resulting in an approximately equal distribution of effort consistent with the observed Zipf distribution [4].
Language is a natural instrument for representation and communication [5], in fact, it can be established correspondences with social activities. As a consequence, it becomes a particularly interesting and promising domain for exploration and indirect analysis of social activity, and it offers a way to understand how humans perform conceptualization.
As a consequence human actions and abstract objects built in the mind are encoded into certain word pattern. A good understanding of the natural language should derive in reasoning comprenhension.
It has been probed that word meaning is directly related to its distribution and location in context[6]. A word’s position is also related to its thematic importance and its usefulness as a keyword [7].
Although syntax rules are pretty restricted, they have certain flexibility that allows encoding this kind of information though word recurrence, distribution and position. This way, it is reasonable to think that morphosyntactic analysis strongly supports "views of human conceptual structure”.
From this point of view all concepts, no matter how abstract, directly or indirectly engage contextually specific experience tracing language in the ever larger digital databases of human communications can be a most promising tool for tracing human and social dynamics". Thus, morphosyntactic analysis offers a new and promising tool for the study of dynamic social interaction[5].
[1] López De Luise D.“Morphosyntactic Linguistic Wavelets for Knowledge Management”. InTech. "Intelligent Systems", ISBN 979-953-307-593-7.
[2] Zipf's Law. (2011), Wolfram Research, Inc. In: Wolfram MathWorld, 2011, Available from http://mathworld.wolfram.com/ZipfsLaw.html
[3] W. Li. Random Texts Exhibit Zipf's-Law-Like Word Frequency Distribution. IEEE Trans. on Information Theory. 38 (6): 1842–1845. ISSN: 0018-9448. USA.
[4] R. Ferrer & R.V. Sole. Least effort and the origins of scaling in human language. Proc. of the National Academy of Sciences of the United States of America 100 (3): 788–791.ISSN 0027-8424. USA.
[5] E.G. Altmann, J.B. Pierrehumbert & A.E. Motter. Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS ONE 4(11): e7678. ISSN 1932-6203.
[6] D. López De Luise & M. Soffer. Automatic Text processing for Spanish Texts. CERMA 2008. ISBN: 978-0-7695-3320. Mexico.
[7]D. López De Luise. Mejoras en la usabilidad de la Web a través de una estructura complementaria. PhD thesis. Universidad Nacional de La Plata. Argentine.
[8]L. Brillouin. La science et la théorie de l'information. Masson, Paris. Open Library. ISBN 10-2876470365 .
Note: In the English language, the probability of encountering the rth most common word is given roughly by P(r)=0.1/r (r>1000).