Possible master's projects

UPD: In spring 2024 I will be on partial parental leave. This term, I will have time only for projects related to Topic 1.
If you are interested in some of the topics below or other topics in computational approaches to language change, sociolinguistics and typology, please feel free to contact me. Requirements for all topics: basic coding skills, some familiarity with linguistics (if you are an MLT student at GU, I am sure you have both).

1. Language variation and change in social media

Background: You will study short-term language change in Swedish using social media data (discussion forums Familjeliv and Flashback; blogs; Twitter). 

Topic A: In a famous paper, Danescu-Niculescu-Mizil et al. showed that active members of the online communities (discussion forums) constantly adjust their language to match the norms of the specific community. If they stop doing so, it's a reliable sign that they will soon leave the community. You will test whether their results hold for Swedish data.

Topic B: In 2021, Würschinger showed that in order to predict the fate of a new word, you have to look not only on how frequent it is, but also how much it has diffused through the social network. You will reproduce (and maybe improve?) some of his methods on Swedish data (cf. a very pilot study by Berdicevskis et al.).

Topic C: In 2021, my student Viktor Erbro at Chalmers successfully defended his master thesis (which later became a paper) where he showed that there is a correlation between how close two speakers are in the network and how similar their linguistic production is. You will continue his work, exploring other factors that may affect linguistic convergence and improving methods for measuring linguistic similarity and reconstructing social network.

Topic D: anything else related to the purposes of the Cassandra project: lexical and grammatical innovations in Swedish; language variation and attitudes towards it; neologisms; social networks -- you name it.

Good to know: Swedish (you don't have to be a native or an advanced speaker, but you should be able to read texts).

2. Sentiment analysis and social networks

(See Milan Stanišić's thesis, which addresses related questions)

Background: Birds of a feather flock together. In social science, it is well-known that people who are similar usually are closer to each other in social networks (the homophily principle).

Topic: you will use a dataset where it is labeled for several thousand discussion forum (Flashback) posts which attitude the given post expresses towards immigration in Sweden. You will match the posts to their authors, estimate what attitude a given author has towards the immigration and then analyze the social connections between the authors to see if those who have similar views are closer to each other. You can also analyze their linguistic styles in order to see whether those who agree write more similarly to each other than those who do not.

3. Learner language complexity

(See Nadina Subitu's thesis, which addresses these questions.)

Background: You will use second-language-learning resources available at Språkbanken Text (e.g. Dalaj and Swell) to study factors that affect language complexity. 

Topic: You will analyze texts written by learners of Swedish in order to see whether the complexity of their production (as gauged by different measures) is affected by complexity of their native language, genetic and typological distance between Swedish and native language, number of years spent in Sweden, age of exposure and other factors.

Joint supervision with Elena Volodina.

4. Who did what to whom? Quantifying the role of morphology and word order in core argument marking

Background: Core arguments (subject and object) can be marked by either morphological means (nominal cases, verb agreement etc, as in Finnish or Russian) or syntactic means (word order, as in Swedish or English), and there is a trade-off between morphological and syntactic marking (Sinnemäki 2014; Futrell et al. 2015; Levshina 2019; Berdicevskis & Piperski 2020): languages tend to have either rich morphology or rigid word order.

Topic: The purpose of this project is to quantify the extent to which different languages rely on morphology vs syntax. It will be done by training a state-of-the-art parsers on Universal Dependencies treebanks using ablation techniques (Berdicevskis & Eckhoff 2016): artificially removing either morphological or syntactic cues and testing how the removal affects the parser's performance.

Good if you know something about: treebank annotation; parsing; linguistic typology.

5. Large-scale corpus analysis of the complexity of non-native speech

Background: Linguistic complexity has been a prominent topic in typology, sociolinguistics and language evolution studies in the past couple of decades (see Bentz & Berdicevskis 2016 for a brief summary, see Trudgill 2011 for a book-length account). Importantly, the presence of non-native speakers in the population is often hypothesized to facilitate language simplification. Much, however, is still unclear about the potential mechanisms of simplification, and one underexplored area are cross-linguistic corpus-based studies of non-native speech.

Topic: In this project, you will use the WordReference Corpus (Berdicevskis 2020), a unique corpus of native and non-native natural production in four languages (English, Spanish, French, Italian), to address various questions about the complexity of non-native speech (see Berdicevskis 2020 for an example).

Good if you know something about: processing noisy text (scraped from the web).