Tutorial title:
Uncovering the Linguistic Features of LLM-generated Text
Tutorial description:
The use of large language models for automatic text generation is becoming increasingly widespread. Language generated by LLMs seems remarkably natural, blending in with human-produced language with surprising ease. A growing amount of research is now being put into discovering the linguistic features that are characteristic of LLM-produced text compared to text written by humans, and broader patterns of language use are beginning to emerge from recent studies. The tutorial will provide participants with a summary of the major findings that have recently been put forward, building a wider overview of the linguistic tendencies found in LLM-generated text. In addition, participants will take part in a collaborative corpus construction session, during which they will produce short texts to be included in a corpus of human-written language. A hands-on session will follow, during which the newly constructed corpus of human-written texts will be compared with a corpus of LLM-generated texts using the linguistic measures most commonly described in the recent literature. This hands-on session will enable the participants to put the most recent findings to the test and discover first-hand the linguistic features commonly produced by large language models.
Please prepare for the workshop by following the following links.
Please note that, since the answers will be collected via a Google Sheets document, you will need a Google account in order to participate in the interactive part of the tutorial. If you wish to copy and run the code used for the data analysis, you will also need a Kaggle account. Please also note that you will be granted access to the first link during the tutorial, but you will not be able to access it beforehand.
Materials
Google Sheets document where we will be writing our answers:
https://docs.google.com/spreadsheets/d/1RKu6q9SpAMZZLRSdrntiRO2hl_7GWIuHNLQRp5oGl40/edit?usp=sharing
Short quiz:
Kaggle notebook with all the code:
https://www.kaggle.com/code/lukatercon/human-written-vs-llm-generated-text