Measuring tree-likeness of linguistic datasets

Luke Maurits: Measuring tree-likeness of linguistic datasets (Abstract)

2020-10-09

Sophisticated tree-based statistical models of language evolution are increasingly being applied to both lexical (cognate) and typological linguistic datasets, often with only a passing concern for whether or not a tree is the most appropriate model for the data in question. Alternative historical structures have been described in the linguistics literature (waves, chains, linkages), but these have not been widely implemented as formal probabilistic models. Even if they had, formal model comparison can be very computationally expensive. Are there quick and easy ways to reliably assess how tree-like a linguistic dataset is? What does "tree-likeness" mean, anyway? How many ways are there to be "non-tree-like", and can we tell them apart? I will present some results from a currently being revised paper which investigates these questions, using a quantitative tool from bioinformatics (TIGER rates) and some simple non-tree-based generative models for cognate datasets.