Porttinari 3.0 (beta)
POeTiSA: POrtuguese processing - Towards Syntactic Analysis and parsing
(the previous version -- Porttinari 2.1 -- is available here)
(the previous version -- Porttinari 2.1 -- is available here)
This page releases the version 3.0 of Porttinari (which stands for “PORTuguese Treebank”), a multigenre treebank for Portuguese (Pardo et al., 2021), following the "Universal Dependencies" international grammar framework (de Marneffe et al., 2021). Porttinari is currently composed by news texts, user-generated content, legal texts, and literary texts, as follows:
As reported by Duran et al. (2023), the news portion includes texts from Folha de São Paulo newspaper (which are publicly available at Kaggle website), being divided into three subcorpora with different characteristics and purposes:
Porttinari-base:
With basic dependency relations: a corpus that is manually revised in detail to serve as gold standard (divided into training, development and test folds), with average annotation review agreement (kappa) of 97.8% and 96.2% for part of speech tags and dependency relations, respectively;
With basic and enhancend dependency relations: the above corpus with the inclusion of enhanced dependency relations (following the proposal of Pagano et al. (2023) and the guidelines of Duran (2024)), which were semi-automatically produced (a rule-based annotation system automatically produced the enhanced relations with over 96% accuracy -- see the report of Souza et al. (2024) -- and some human experts manually reviewed the more challenging annotation issues);
Porttinari-check, a small corpus structurally similar to Porttinari-base to serve as testbed for additional and diversified evaluations and to illustrate the contrast between manual and automatic annotations (including only basic dependency relations) -- the automatic annotation was carried out with Portparser.v2;
Porttinari-automatic, a very large corpus that was automatically annotated by a state of the art parser (Portparser.v2) trained on Porttinari-base (including only basic dependency relations).
The portion of user-generated content, named DANTEStocks, includes posts on the financial domain, being collected from the X social network, still named Twitter by the time the corpus was built, as reported by Silva et al. (2020). The original posts (made available at Kaggle website) were semi-automatically annotated with part of speech tags and basic dependency relations (state of the art tagger and parser were used to produce the first annotations, which were incrementally reviewed and used to train new versions of the tools in order to annotate the remaining data, and all the data was later manually reviewed). As part of speech annotation was produced by an adjudication process over automatic data reviewed by three linguists (as reported by Silva et al., 2021), there is no computed agreement value. The first version of the dependency relation annotation achieved an (kappa) agreement of 95.0% (as detailed by Barbosa, 2024). More details about the annotation are reported by Di Felippo et al. (2023, 2024). The data is also divided into training, development and test folds.
As reported by Lopes et al. (2025), the legal portion of Porttinari includes public law texts produced by the judiciary (mainly summaries) and the legislature (laws), including widely known laws in Brazil, as Henry Borel law, Internet Civil Rights law, Maria da Penha law, Copyright law, Agrarian Reform law, Elderly Persons statute and Child and Adolescent statute. The data was automatically annotated by a state of the art parser (Portparser.v2) and manually revised by human experts. To support specialized studies and computational applications, some resources were also produced from this portion of Porttinari, as lists of content words (i.e., nouns, verbs, adjectives and adverbs -- totalizing 4,994 words), propor nouns (627), verb forms (3,073), abbreviations (22), and foreign expressions (50).
The literary portion currently includes the annotation of "The Little Prince" book (already in public domain), which was automatically annotated by a state of the art parser (Portparser.v2) and manually revised. Interestingly, this corpus is the same that was also annotated with Abstract Meaning Representation (AMR) (Banarescu et al., 2013), allowing the study of both syntactic and semantic characteristics.
The data is distributed in the subcorpora as follows.
Download of the corpus (and associated resources)
The interested user may find the compressed files of the subcorpora (in the CoNLL-U format) at the following links (licensed as Creative Commons CC-BY):
Porttinari-check -- original version (automatically annotated) and manually revised version
Porttinari-automatic (divided into 168 folds, for easing handling)
DANTEStocks: full (reference) version and adapted version published at UD website (some annotation decisions changed in order to meet some UD publication requirements)
Previous versions of DANTEStocks are also available: version 1.0 (of December 15, 2022), version 1.1 (May 13, 2024) and version 2.0 (which was the basis for the full version above)
PortJur and associated resources (including two versions - tsv files and xlsx spreadsheets)
Differences of this version 3.0 of Porttinari treebank in relation to the version 2.1
The news texts of Porttinari-base were manually revised once more, going through minor corrections
PortJur annotation was also revised, going through minor corrections
The annotation of The Little Prince book was made available
Main references (there are many more related publications here)
On the corpus project and release
Lopes, L.; Nunes, M.G.V.; Duran, M.S.; Pardo, T.A.S. (2025). A sintaxe no tribunal: apresentando e explorando um corpus jurídico em português anotado sintaticamente segundo o modelo Universal Dependencies. In the Proceedings of the XVI Symposium in Information and Human Language Technology (STIL), pp. 220-232. September, 29 - October, 02. pdf
Felippo, A.D.; Roman, N.T. (2025). DANTEStocks: A Multi-Layered Annotated Corpus of Stock Market Tweets for Brazilian Portuguese. Revista Brasileira de Linguística Aplicada (RBLA), Vol. 25, N. 1, pp. 1-32. pdf
Souza, E.A.; Duran, M.S.; Nunes, M.G.V.; Sampaio, G.; Belasco, G.; Pardo, T.A.S. (2024). Automatic Annotation of Enhanced Universal Dependencies for Brazilian Portuguese. In the Proceedings of the 15th Symposium in Information and Human Language Technology (STIL), pp. 217-226. November, 17-21. pdf
Duran, M.S.; Lopes, L.; Nunes, M.G.V.; Pardo, T.A.S. (2023). The Dawn of the Porttinari Multigenre Treebank: Introducing its Journalistic Portion. In the Proceedings of the 14th Symposium in Information and Human Language Technology (STIL), pp. 115-124. September, 25-29. pdf
Pardo, T.A.S.; Duran, M.S.; Lopes, L.; Di Felippo, A.; Roman, N.T.; Nunes, M.G.V. (2021). Porttinari - a large multi-genre treebank for brazilian portuguese. In the Proceedings of the XIII Symposium in Information and Human Language (STIL), pp. 1-10. November, 29 to December, 3. pdf
On the annotation design and decisions
Lopes, L.; Duran, M.S.; Pardo, T.A.S. (2024). Desambiguação de lema e atributos morfológicos na anotaçãodo corpus Porttinari-base. In Anais da IX Jornada de Descrição do Português (JDP), pp. 336-345. November, 17-21. Belém-PA, Brazil. pdf
Lopes, L.; Duran, M. S.; Pardo, T. A. S. (2023). Atribuição de lemas e atributos morfológicos seguindo as decisões adotadas na anotação do córpus Portinari-base dentro das diretrizes da Universal Dependencies (UD). Relatório Técnico do ICMC 445. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Agosto, 34p. pdf
Lopes, L.; Duran, M.S.; Nunes, M.G.V.; Pardo, T.A.S. (2022). Corpora building process according to the Universal Dependencies model: an experiment for Portuguese. Relatório Técnico do ICMC 439. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Março, 22p. pdf
Duran, M.S. (2024). Anotação de Enhanced Dependencies: Orientações para Anotação de Relações de Dependência Sintática do Tipo Enhanced em Língua Portuguesa, seguindo as Diretrizes da Abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 448. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Agosto, 89p. pdf
Di Felippo, A.; Nunes, M.G.V.; Barbosa, B.K.S. (2024). Diretrizes de anotação de relações de dependência em tweets do mercado financeiro. Relatório Técnico do ICMC 446. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Abril, 70p. pdf
Di Felippo, A.; Postali, C.; Ceregatto, G.; Gazana, L.S.; Roman, N.T. (2022). Diretrizes de Anotação de PoS Tags em Tweets do Mercado Financeiro: Orientações para anotação em língua portuguesa segundo a abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 438. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Março, 24p. pdf
Duran, M.S. (2022). Manual de Anotação de Relações de Dependência - Versão Revisada e Estendida: Orientações para anotação de relações de dependência sintática em Língua Portuguesa, seguindo as diretrizes da abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 440. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Outubro, 166p. pdf
Duran, M.S. (2021). Manual de Anotação de PoS tags: Orientações para anotação de etiquetas morfossintáticas em Língua Portuguesa, seguindo as diretrizes da abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 434. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Setembro, 55p. pdf
Duran, M.S.; Nunes, M.G.V.; Lopes, L.; Pardo, T.A.S. (2022). Manual de anotação como recurso de Processamento de Linguagem Natural: o modelo Universal Dependencies em língua portuguesa. Domínios de Lingu@gem, Vol. 16, N. 4, pp. 1608-1643. pdf