Home‎ > ‎


Uppsala Persian Corpus: UPC

Uppsala Persian Corpus (UPC) (Seraji, 2015, Chapter 3, pp. 68-81) is a large, freely available Persian corpus. The corpus is a modified version of the Bijankhan corpus (Bijankhan, 2004) with additional sentence segmentation and consistent tokenization containing 2,704,028 tokens and annotated with 31 part-of-speech tags. The part-of-speech tags are listed with explanations in this table. 


The corpus is developed by Mojgan Seraji ( mojgan.seraji96@gmail.com ) and licensed under GNU General Public License . The corpus can be downloaded below:  

Latest release: 

Previous releases: 

  • UPC.1.1   (June 2, 2014) 
  • UPC.1.0   (May 20, 2012) 


1. Bijankhan Mahmood. 2004. The Role of the Corpus in Writing a Grammar: An Introduction to a Software . Iranian Journal of Linguistics 19. 
2. Seraji, Mojgan. 2015. Morphosyntactic Corpora and Tools for Persian. Doctoral dissertation, Uppsala University. Studia Linguistica Upsaliensia 16. [pdf]

Subpages (1): UPC.1.2
Mojgan Seraji,
Jul 18, 2017, 4:31 AM