Studying variation in Australian English: why we need a language data commons

Simon Musgrave & Michael Haugh

Recent research has moved us beyond the position, received from Mitchell and Delbridge, that there is little regional variation in Australian English. Projects such as AusTalk are starting to provide reliable information on the phonetic differences associated with different parts of the country. We also have some information about lexical differences, especially for the kind of shibboleth items examined in the Mapping Words Around Australia project. But we still know very little about more subtle lexical variation, let alone any syntactic and pragmatic variation. The purpose of this paper is to suggest that this lack is due to a lack of resources suitable for investigating such questions and to give an idea of what sort of data might be needed to approach the problems.

One of the best data sets currently available is the Australian newspaper articles corpus compiled by linguists from Lancaster University which consists of almost 7.4 million words of material from Australian newspapers collected over twelve months; source publications are part of the metadata and therefore it is possible to assign material to state-based sub-corpora. But taking the NSW and Victorian collections and comparing them to each other (using keyword analysis) tells us very little: Victoria had an election in the relevant period and NSW did not and readers in different states have different sporting interests. Comparing each of the state-based collections with the national material gave the only linguistically interesting finding of this very preliminary investigation: journalists writing for the more local publications use personal pronouns more than those writing for the national publications.

These rather disappointing results suggest that much more data is needed to be able to look at regional variation. One model for assembling the kinds of large datasets required for studying regional variation in Australian English, and much else besides, is to create a data commons that allows us to more systematically share language data, an approach currently being developed in partnership with ARDC.