Uppsala Persian Corpus (UPC) (Seraji, 2015, Chapter 3, pp. 68-81) is a large, freely available Persian corpus. The corpus is a modified version of the Bijankhan corpus (Bijankhan, 2004) with additional sentence segmentation and consistent tokenization containing 2,704,028 tokens and annotated with 31 part-of-speech tags. The part-of-speech tags are listed with explanations in this table.
The corpus is developed by Mojgan Seraji ( mojgan.seraji96@gmail.com ) and licensed under GNU General Public License . The corpus can be downloaded below:
Latest release:
UPC.1.2 (January 01, 2017)
Previous releases:
UPC.1.1 (June 2, 2014)
UPC.1.0 (May 20, 2012)
1. Bijankhan Mahmood. 2004. The Role of the Corpus in Writing a Grammar: An Introduction to a Software . Iranian Journal of Linguistics 19.
2. Seraji, Mojgan. 2015. Morphosyntactic Corpora and Tools for Persian. Doctoral dissertation, Uppsala University. Studia Linguistica Upsaliensia 16. [pdf]