The suggested topics revolve around the topic of automatic text pseudonymization.
Pseudonymization is the process of detecting personal information in texts and replacing it with grammatically and semantically suitable pseudonyms in order to hide the author’s identity. This is one of the requirements imposed by the GDPR for sharing corpora which include information that could lead to the reidentification of the author. Developing effective pseudonymization methods can lead to more open access to research data.
At GU, research on pseudonymization and related questions is now conducted within the Mormor Karl project (more information here), and the supervisor(s) for these theses would be project members – Maria Irena Szawerna with the assistance of Elena Volodina or Simon Dobnik as the co-supervisor. If you decide to pick a pseudonymization-related topic, you would be more than welcome to participate in some of our project meetings to present and discuss your progress.
You can find a quick bibliography below in case you would like to find out more.
Below you will find two ideas for an MA thesis, alongside a bit more reading and the suggested skill/knowledge requirements. Please feel free to contact me (Maria) if you are curious about or interested in it. I would also be thrilled to hear if you have your own idea that is related to pseudonymization!
Contact: maria.szawerna@gu.se
Introductory reading:
Lison, Pierre, Ildikó Pilán, David Sanchez, Montserrat Batet, and Lilja Øvrelid. “Anonymisation Models for Text Data: State of the Art, Challenges and Future Directions.” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), edited by Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, 4188–4203. Online: Association for Computational Linguistics, 2021. https://doi.org/10.18653/v1/2021.acl-long.323.
Pilán, Ildikó, Pierre Lison, Lilja Øvrelid, Anthi Papadopoulou, David Sánchez, and Montserrat Batet. “The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization.” Computational Linguistics 48, no. 4 (December 2022): 1053–1101. https://doi.org/10.1162/coli_a_00458.
Szawerna, Maria Irena, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Xuan-Son Vu, and Elena Volodina. “Pseudonymization Categories across Domain Boundaries.” In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), edited by Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, 13303–14. Torino, Italia: ELRA and ICCL, 2024. https://aclanthology.org/2024.lrec-main.1164.
Volodina, Elena, Simon Dobnik, Therese Lindström Tiedemann, and Xuan-Son Vu. “Grandma Karl Is 27 Years Old – Research Agenda for Pseudonymization of Research Data,” 229–33. IEEE Computer Society, 2023. https://doi.org/10.1109/BigDataService58306.2023.00047.
I am also willing to (co)supervise topics that have something to do with NLP for Polish.
EXAMPLE IDEA 1: SYNTHETIC DATA FOR PSEUDONYMIZATION-RELATED TASKS IN SWEDISH
One of the major limiting factors for developing pseudonymization tools is the lack of accessible data containing PIIs. While one way to solve it is to use freely available data (e.g. transcripts of official governmental or legal proceedings), this severely limits the variety of domains that are featured in the training. Another solution is to use fully synthetic (automatically generated) data, as in Bråthen et al. (2021). Initially it would be preferable for the synthetic data to be in Swedish and for it to mimic less structured types of text (social media or forum posts, personal writing, learner essays). Having a reliable method for generating such data that can be generalized for other languages could also make it invaluable for prospective shared tasks (see here for more information on what those are).
Requirements: Some programming language (preferably Python), English, Swedish/Norwegian/Danish (strongly preferred), possibly other languages.
Note: This topic can possibly also be adjusted to your preferred/native language.
Suggested reading:
Bråthen, Synnøve, Wilhelm Wie, and Hercules Dalianis. "Creating and Evaluating a Synthetic Norwegian Clinical Corpus for De-Identification." In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), edited by Simon Dobnik and Lilja Øvrelid, 222–30. Reykjavik, Iceland (Online): Linköping University Electronic Press, Sweden, 2021. https://aclanthology.org/2021.nodalida-main.22.
Maksim Savkin, Timur Ionov, and Vasily Konovalov. 2025. SPY: Enhancing Privacy with Synthetic PII Detection Dataset. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 236–246, Albuquerque, USA. Association for Computational Linguistics. https://aclanthology.org/2025.naacl-srw.23/.
Sierro, Maria, Begoña Altuna, and Itziar Gonzalez-Dios. "Automatic Detection and Labelling of Personal Data in Case Reports from the ECHR in Spanish: Evaluation of Two Different Annotation Approaches." In Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-Pseudo 2024), edited by Elena Volodina, David Alfter, Simon Dobnik, Therese Lindström Tiedemann, Ricardo Muñoz Sánchez, Maria Irena Szawerna, and Xuan-Son Vu, 18–24. St. Julian’s, Malta: Association for Computational Linguistics, 2024. https://aclanthology.org/2024.caldpseudo-1.3.
Vakili, Thomas, Aron Henriksson, and Hercules Dalianis. "Data-Constrained Synthesis of Training Data for De-Identification." In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27414–27427, Vienna, Austria. Association for Computational Linguistics. https://aclanthology.org/2025.acl-long.1329/
EXAMPLE IDEA 2: LLMs FOR PERSONAL INFORMATION DETECTION
Any procedure which involves removing or replacing personal information (pseudonymization, anonymization) presupposes, in one way or another, a step in which the location of the personal information in the text is identified, i.e. personal information detection. Traditionally (though not exclusively), this is coupled with labeling - determining the specific semantic category that the element in question belongs to, e.g. surname, zip_code, or profession. Several approaches to personal information detection & labeling have been explored, but little is known about the usability of (locally run) generative large language models for this task. This topic could involve comparing several models, comparing their performance in different languages, prompt engineering, etc.
Requirements: Some programming language (preferably Python), English, optionally another language, LLM prompting.
Suggested reading:
Nikolai Ilinykh and Maria Irena Szawerna. 2025. “I Need More Context and an English Translation”: Analysing How LLMs Identify Personal Information in Komi, Polish, and English. In Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025), pages 165–178, Tallinn, Estonia. University of Tartu Library, Estonia. https://aclanthology.org/2025.resourceful-1.32/.
Zilyu Ji, Yuntian Shen, Kenneth R. Koedinger, and Jionghao Lin. 2025. Enhancing the De-identification of Personally Identifiable Information in Educational Data. Journal of Educational Data Mining, 17(2), 55-85. https://doi.org/10.5281/zenodo.17114271.
Jianliang Yang, Xiya Zhang, Kai Liang, and Yuenan Liu. 2023. Exploring the Application of Large Language Models in Detecting and Protecting Personally Identifiable Information in Archival Data: A Comprehensive Study*. In 2023 IEEE International Conference on Big Data (BigData), pages 2116–2123, Sorrento, Italy. https://doi.org/10.1109/BigData59044.2023.10386949
OTHER IDEAS: Your own take on the topic!
I would love to hear from you if you are interested in working with something related to personal information detection, labeling, pseudonym generation, or other privacy-preserving measures, including re-identification, approaches to pseudonym generation evaluation, etc.