Uppsala Persian Corpus: UPCUppsala Persian Corpus (UPC) (Seraji, 2015, Chapter 3, pp. 68-81) is a large, freely available Persian corpus. The corpus is a modified version of the Bijankhan corpus (Bijankhan, 2004) with additional sentence segmentation and consistent tokenization containing 2,704,028 tokens and annotated with 31 part-of-speech tags. The part-of-speech tags are listed with explanations in this table.DownloadThe corpus is developed by Mojgan Seraji ( mojgan.seraji96@gmail.com ) and licensed under GNU General Public License . The corpus can be downloaded below:Latest release:
Previous releases:
References1. Bijankhan Mahmood. 2004. The Role of the Corpus in Writing a Grammar: An Introduction to a Software . Iranian Journal of Linguistics 19.2. Seraji, Mojgan. 2015. Morphosyntactic Corpora and Tools for Persian. Doctoral dissertation, Uppsala University. Studia Linguistica Upsaliensia 16. [pdf] |
Home >