MA thesis topics

I would be happy to supervise theses on topics related to my core or even peripheral interests. My main topic suggestions (below) involve de-identification measures for text data, but I would be happy to not only hear about other ideas in that area, but also things that have to do with NLP for Polish (or potentially other Slavic languages), cognitive linguistics, diachronic linguistics or dialectology, etc.

Some of the thesis topics (or thesis ideas related to the aforementioned areas) can also be repurposed into project topics for e.g. the Language Technology Resources course that I would be interested in being involved in. I also supervise Artificial Intelligence: Cognitive Systems projects (but those ideas tend to be suggested during the course).

In 2024/2025, I collaborated with a student on one LTR project:

Jacob Lee Suchardt, using Masked Language Models to generate pseudonyms. Later expanded and published as Fill-in-the-Blanks: Automatic Generation and Evaluation of Language Models' Pseudonyms for English and Swedish Texts in the Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026).

In 2025/2026, I supervised two MA theses:

Cristina Matacuta, Gossiping Models: Understanding Unintentional Data Disclosure in LLMs (co-supervised with Simon Dobnik, FLoV and Fazeleh Hoseini, AI Sweden)
Caroline Nathalie Jeanne Grand-Clement, Presenting ENBYS: The Europarl Non-BinarY Sentences: an English-Polish-French Dataset for Machine Translation Evaluation of Non-Binary Gender Inclusion (co-supervised with Sharid Loáiciga, FLoV)

The suggested topics revolve around the topic of automatic text pseudonymization.

Pseudonymization is the process of detecting personal information in texts and replacing it with grammatically and semantically suitable pseudonyms in order to hide the author’s identity. This is one of the requirements imposed by the GDPR for sharing corpora which include information that could lead to the reidentification of the author. Developing effective pseudonymization methods can lead to more open access to research data.

At GU, research on pseudonymization and related questions is now conducted within the Mormor Karl project (more information here), and the supervisor(s) for these theses would be project members – Maria Irena Szawerna with the assistance of Elena Volodina or Simon Dobnik as the co-supervisor. If you decide to pick a pseudonymization-related topic, you would be more than welcome to participate in some of our project meetings to present and discuss your progress.

You can find a quick bibliography below in case you would like to find out more.

Below you will find two ideas for an MA thesis, alongside a bit more reading and the suggested skill/knowledge requirements. Please feel free to contact me (Maria) if you are curious about or interested in it. I would also be thrilled to hear if you have your own idea that is related to pseudonymization!

Contact: maria.szawerna@gu.se

Introductory reading:

Tobias Deußer, Lorenz Sparrenberg, Armin Berger, Max Hahnbück, Christian Bauckhage, and Rafet Sifa. 2025. "A Survey on Current Trends and Recent Advances in Text Anonymization." In 2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA), pages 1-9. https://doi.org/10.1109/DSAA65442.2025.11247969

Lison, Pierre, Ildikó Pilán, David Sanchez, Montserrat Batet, and Lilja Øvrelid. 2021. “Anonymisation Models for Text Data: State of the Art, Challenges and Future Directions.” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), edited by Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, 4188–4203. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.323

Pilán, Ildikó, Pierre Lison, Lilja Øvrelid, Anthi Papadopoulou, David Sánchez, and Montserrat Batet. 2022. “The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization.” Computational Linguistics 48, no. 4 (December 2022): 1053–1101. https://doi.org/10.1162/coli_a_00458

Szawerna, Maria Irena, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Xuan-Son Vu, and Elena Volodina. 2024. “Pseudonymization Categories across Domain Boundaries.” In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), edited by Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, 13303–14. Torino, Italia: ELRA and ICCL. https://aclanthology.org/2024.lrec-main.1164

Volodina, Elena, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, Lisa Södergård, Xuan-Son Vu. 2025. "Towards shared standards for pseudonymization of research data." In Proceedings of the Huminfra Conference (HiC 2025), Stockholm. https://hdl.handle.net/10062/118303

EXAMPLE IDEA 1: SYNTHETIC DATA FOR PSEUDONYMIZATION-RELATED TASKS (IN SWEDISH)

One of the major limiting factors for developing pseudonymization tools is the lack of accessible data containing PIIs. While one way to solve it is to use freely available data (e.g. transcripts of official governmental or legal proceedings), this severely limits the variety of domains that are featured in the training. Another solution is to use fully synthetic (automatically generated) data, as in Bråthen et al. (2021). Initially it would be preferable for the synthetic data to be in Swedish and for it to mimic less structured types of text (social media or forum posts, personal writing, learner essays). Having a reliable method for generating such data that can be generalized for other languages could also make it invaluable for prospective shared tasks (see here for more information on what those are).

Requirements: Some programming language (preferably Python), English, Swedish/Norwegian/Danish (strongly preferred), possibly other languages.

Note: This topic can possibly also be adjusted to your preferred/native language.

Suggested reading:

Ibrahim Baroud, Christoph Otto, Vera Czechmann, Christine Hovhannisyan, Lisa Raithel, Sebastian Möller, and Roland Roller. 2026. "MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers." In Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 6647–6660, Palma de Mallorca, Spain. ELRA. https://doi.org/10.63317/4bzj7bdw86tn

Synnøve Bråthen, Wilhelm Wie, and Hercules Dalianis. 2021. "Creating and Evaluating a Synthetic Norwegian Clinical Corpus for De-Identification." In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), edited by Simon Dobnik and Lilja Øvrelid, 222–30. Reykjavik, Iceland (Online): Linköping University Electronic Press, Sweden. https://aclanthology.org/2021.nodalida-main.22

Maksim Savkin, Timur Ionov, and Vasily Konovalov. 2025. SPY: "Enhancing Privacy with Synthetic PII Detection Dataset." In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 236–246, Albuquerque, USA. Association for Computational Linguistics. https://aclanthology.org/2025.naacl-srw.23/

Maria Sierro, Begoña Altuna, and Itziar Gonzalez-Dios. 2024. "Automatic Detection and Labelling of Personal Data in Case Reports from the ECHR in Spanish: Evaluation of Two Different Annotation Approaches." In Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024), edited by Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, and Xuan-Son Vu, 18–24. St. Julian’s, Malta: Association for Computational Linguistics. https://aclanthology.org/2024.caldpseudo-1.3

Thomas Vakili, Aron Henriksson, and Hercules Dalianis. 2025. "Data-Constrained Synthesis of Training Data for De-Identification." In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27414–27427, Vienna, Austria. Association for Computational Linguistics. https://aclanthology.org/2025.acl-long.1329/

EXAMPLE IDEA 2: LLMs FOR PERSONAL INFORMATION DETECTION

Any procedure which involves removing or replacing personal information (pseudonymization, anonymization) presupposes, in one way or another, a step in which the location of the personal information in the text is identified, i.e. personal information detection. Traditionally (though not exclusively), this is coupled with labeling - determining the specific semantic category that the element in question belongs to, e.g. surname, zip_code, or profession. Several approaches to personal information detection & labeling have been explored, but little is known about the usability of (locally run) generative large language models for this task. This topic could involve comparing several models, comparing their performance in different languages, prompt engineering, etc.

Requirements: Some programming language (preferably Python), English, optionally another language, LLM prompting.