Language Technology Resources and Tools for Persian
Mojgan Seraji
Department of Linguistics and Philology, Uppsala University
Abstract
My research contributes to the field of natural language processing by discussing various important issues and challenges in the automatic morphosyntactic processing and analysis of Persian. I further explore different methods for handling noisy data to address challenges relating to Persian orthography, morphology, and syntax. The methodologies used in my work, from decisions about handling tokenization issues in the language to the innovative analysis used in developing the Persian dependency treebank, which are all empirically evaluated, bring new insights and ideas to the field. These methods, with their emphasis on handling variations in tokenization, may deviate from the abstract linguistic conventions used in the literature, but are able to cope with common difficulties in user-generated texts due to the lack of a common standard for Persian orthography. Based on these ideas, I developed a pipeline of resources and tools for Persian that can easily be employed on out-of-domain texts. The methods I used for Persian can further benefit the work with other languages that have similar linguistic and orthographic characteristics.
Google NLP PhD Summit, Zurich, Switzerland (Sep. 24th, 2015).