Home‎ > ‎


Uppsala Persian Dependency Treebank: UPDT

Uppsala Persian Dependency Treebank (UPDT) (Seraji, 2015, Chapter 5, pp. 97-146) is a dependency-based syntactically annotated corpus. The treebank consists of 6000 sentences (151,671 tokens) of written text in CoNLL-format and is developed through a bootstrapping procedure involving the open source data-driven dependency parser MaltParser (Nivre et al., 2006), and manual validation of the annotation. The entire treebank was released in 2013. However, the first release of the treebank, containing a seed data set of 225 sentences, was in Fall 2011 and the process of treebank development was published in the journal article Linguistic Issues in Language Technology7(18):1-10, January 2012. 

The treebank data is extracted from the open source, validated Uppsala Persian Corpus (UPC) created from on-line material containing newspaper articles and common text on various topics (e.g. culture, technology, fiction, and art). The corpus is annotated with 31 part-of-speech tags. 

The treebank annotation scheme is based on Stanford Typed Dependencies (de Marneffe et al., 2006; de Marneffe and Manning, 2008). The entire dependency relations used in the annotation including an extensive guidelines for sentence segmentation, tokenization, and morphological annotation are described in detail in Seraji (2015) for the treebank section see Chapter 5, pp. 97-146, and for the section related to the sentence segmentation, tokenization, and morphological annotation 
see Chapter 3, pp. 68-81. 

(Image, Copyright © 2013 Mojgan Seraji)                                                                                                                                                                                                                                                                                                                                         

Download UPDT

The treebank is developed by Mojgan Seraji, under the supervision of Joakim Nivre and Carina Jahani. The UPDT is licensed under Creative Commons Attribution 3.0 License and can be downloaded below. 

Latest release:  
Since the treebank is constantly being improved, for the very latest version, please contact mojgan.seraji96@gmail.com.
  • UPDT.1.3     (May 01, 2016)                                                                                                                   

Previous releases:

  • UPDT.1.2   (January 01, 2016)
  • UPDT.1.1   (October 30, 2015)
  • UPDT.1.0   (May 20, 2013)
  • A seed data set of 225 sentences   (September 15, 2011)

Parsing Experiments

The UPDT has sequentially been split into 10 parts, of which segments 1-8 are used for training (80%), 9 for development (10%), and 10 for test (10%) sets. The following data sets are based on the latest release.

Feedback and bug reports

Please contact mojgan.seraji96@gmail.com for feedback and bug reports. 


I would like to thank Recorded Future Inc. for their contribution and financial support in developing the treebank. 


1. De Marneffe, Marie-Catherine, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC)
2. De Marneffe, Marie-Catherine, and Christopher D. Manning. 2008. Stanford Typed Dependencies Representation. In Proceedings of the COLING’08 Workshop on Cross-Framework and Cross-Domain Parser Evaluation
3. Nivre J., Hall J., and Nilsson J. 2006. Maltparser: A data-driven parser-generator for dependency parsing. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC)
4. Seraji, Mojgan. 2015. Morphosyntactic Corpora and Tools for Persian. Doctoral dissertation, Uppsala University. Studia Linguistica Upsaliensia 16. [pdf]