PhD project

An information-theoretic approach to language complexity: variation in naturalistic corpora

Supervised by  Benedikt Szmrecsanyi

This study contributes to the typological-sociolinguistic complexity debate that was triggered by challenges (Kusters 2003; McWhorter 2001) of the assumption that all languages are, on the whole, equally complex (e.g. Hockett 1958). A substantial body of research now suggests that languages and language varieties can and do differ in their complexity (e.g. Koplenig et al. 2017; Siegel et al. 2014; Kortmann and Szmrecsanyi 2012). However, most of this research applies complexity metrics that either rely on subjective or on empirically expensive means of measuring complexity.

Against this backdrop, I explore the use and applicability of Kolmogorov complexity as a complexity metric in naturalistic corpora. It can be conveniently approximated with compression algorithms and measures the information content, or complexity of texts in terms of the predictability of new text passages on the basis of previously seen text passages. Basically, texts which can be compressed more efficiently are linguistically less complex. In combination with various distortion techniques, the measure can be used to assess complexity at the morphological and syntactic level.

To date Kolmogorov complexity has only been used to explore cross-linguistic complexity variation in parallel corpora (e.g. Juola 2008). This study is the first to use Kolmogorov complexity for assessing complexity variation in naturalistic corpora drawing on resources such as ICLE (International Corpus of Learner English) as well as the BNC (British National Corpus). For example, it is shown that the complexity of written BNC register varies along the involved-abstract dimension established by Biber (1988): more informal registers (e.g. emails, letters) exhibit less Kolmogorov complexity than formal registers (e.g. newspapers).

Thus, this study presents an innovative methodology in corpus linguistics for measuring complexity variation in naturalistic text samples and demonstrates that algorithmic measurements yield linguistically interpretable results that are in line with what more orthodox methods would lead one to expect. On a theoretical plane, I contribute to solving the issue of finding a more generally applicable complexity metric and, at the same time, provide an economical means of measurement.

Funding: