
POeTiSA: POrtuguese processing - Towards Syntactic Analysis and parsing

This page introduces and releases the 1st version of the journalistic portion of Porttinari (which stands for “PORTuguese Treebank”), which shall be a large multigenre treebank for Portuguese (Pardo et al., 2021), following the "Universal Dependencies" international grammar framework (de Marneffe et al., 2021). As reported by Duran et al. (2023), Porttinari is currently composed by three subcorpora with different characteristics and purposes:

The texts in the treebank are from Folha de São Paulo newspaper, which are publicly available at Kaggle website. Overall, the journalistc portion of Porttinari includes 167,048 news articles, with 3,964,321 sentences and 94,646,080 tokens, which are distributed in the subcorpora as follows.

Download of the corpus

The interested user may find the compressed files of the subcorpora (in the CoNLL-U format) at the following links (licensed as Creative Commons CC-BY):

Main references (there are many more related publications here)

On the corpus project and release

On the annotation design and decisions