Linguistic Variation in Online Communities:
A Computational Perspective
The same word can be used by different people to mean different things. The observed meaning variation is not random, but determined by the social characteristics of the speakers using it. In particular, a crucial factor in determining the observed variation is the community an individual belongs to. This thesis investigates meaning variation in online communities of speakers with a twofold goal: providing an empirical account of the phenomenon in online setups, and leveraging it to improve the performance of NLP models.
I build on theoretical frameworks introduced in Linguistics and Sociolinguistics which describe meaning variation in offline communities. In order to investigate variation using digital data derived from online communities, I leverage the tools and methodologies developed in the fields of Natural Language Processing and Computational Linguistics.
The thesis consists of two main parts. The first part focuses on the general research question: how to identify and represent meaning variation in online communities of speakers? This part includes three descriptive studies that address this question from different points of view. Initially, I investigate meaning variation from a synchronic perspective, introducing a methodology to represent how word meaning varies in online communities. Subsequently, I consider the diachronic dimension, focusing both on the process of meaning shift which leads to the observed variation, and on the social dynamics underpinning this process.
In the second part, I take a task-oriented approach, as I address the research question: how can social information be used to improve the performance of NLP models? I address this question in two studies. In the first one, I show how it is possible to leverage the information coming from the connections of a user on a social media platform, in order to obtain better results in tasks involving the classification of user-generated texts. In the second study, I show that the language produced by users on social media provides highly valuable information for the task of fake news detection.
Overall, this dissertation presents an extensive study of meaning variation in online communities of speakers, making two main contributions: On the one hand, it contributes empirical confirmation of the findings of traditional sociolinguistic studies and provides new theoretical insights about meaning variation in online communities of speakers. On the other hand, it introduces new models and methodologies which, by leveraging information about the social context where language is produced, help to improve the performance of NLP systems for text classification.