UPC

Uppsala Persian Corpus: UPC

Uppsala Persian Corpus (UPC) (Seraji, 2015, Chapter 3, pp. 68-81) is a large, freely available Persian corpus. The corpus is a modified version of the Bijankhan corpus (Bijankhan, 2004) with additional sentence segmentation and consistent tokenization containing 2,704,028 tokens and annotated with 31 part-of-speech tags. The part-of-speech tags are listed with explanations in this table.

Download

The corpus is developed by Mojgan Seraji ( mojgan.seraji96@gmail.com ) and licensed under GNU General Public License . The corpus can be downloaded below:

Latest release:

UPC.1.2 (January 01, 2017)

Previous releases:

UPC.1.1 (June 2, 2014)
UPC.1.0 (May 20, 2012)

References

1. Bijankhan Mahmood. 2004. The Role of the Corpus in Writing a Grammar: An Introduction to a Software . Iranian Journal of Linguistics 19.

2. Seraji, Mojgan. 2015. Morphosyntactic Corpora and Tools for Persian. Doctoral dissertation, Uppsala University. Studia Linguistica Upsaliensia 16. [pdf]

Page updated

Google Sites

Report abuse