Porttinari 2.1

POeTiSA: POrtuguese processing - Towards Syntactic Analysis and parsing

(the previous version -- Porttinari 2.0 -- is available here)

This page releases the version 2.1 of Porttinari (which stands for “PORTuguese Treebank”), a multigenre treebank for Portuguese (Pardo et al., 2021), following the "Universal Dependencies" international grammar framework (de Marneffe et al., 2021). Porttinari is currently composed by news texts, legal texts and user-generated content, as follows:

As reported by Duran et al. (2023), the news portion includes texts from Folha de São Paulo newspaper (which are publicly available at Kaggle website), being divided into three subcorpora with different characteristics and purposes:
- Porttinari-base:
  1. 1. - With basic dependency relations: a corpus that is manually revised in detail to serve as gold standard (divided into training, development and test folds), with average annotation review agreement (kappa) of 97.8% and 96.2% for part of speech tags and dependency relations, respectively;
        With basic and enhancend dependency relations: the above corpus with the inclusion of enhanced dependency relations (following the proposal of Pagano et al. (2023) and the guidelines of Duran (2024)), which were semi-automatically produced (a rule-based annotation system automatically produced the enhanced relations with over 96% accuracy -- see the report of Souza et al. (2024) -- and some human experts manually reviewed the more challenging annotation issues);
- Porttinari-check, a small corpus structurally similar to Porttinari-base to serve as testbed for additional and diversified evaluations and to illustrate the contrast between manual and automatic annotations (including only basic dependency relations) -- the automatic annotation was carried out with Portparser.v2;
- Porttinari-automatic, a very large corpus that was automatically annotated by a state of the art parser (Portparser.v2) trained on Porttinari-base (including only basic dependency relations).

The portion of user-generated content, named DANTEStocks, includes posts on the financial domain, being collected from the X social network, still named Twitter by the time the corpus was built, as reported by Silva et al., 2020. The original posts (made available at Kaggle website) were semi-automatically annotated with part of speech tags and basic dependency relations (state of the art tagger and parser were used to produce the first annotations, which were incrementally reviewed and used to train new versions of the tools in order to annotate the remaining data, and all the data was later manually reviewed). As part of speech annotation was produced by an adjudication process over automatic data reviewed by three linguists (as reported by Silva et al., 2021), there is no computed agreement value. The first version of the dependency relation annotation achieved an (kappa) agreement of 95.0% (as detailed by Barbosa, 2024). More details about the annotation are reported by Di Felippo et al. (2023, 2024). The data is also divided into training, development and test folds.

As reported by Lopes et al. (2025), the legal portion of Porttinari includes public law texts produced by the judiciary (mainly summaries) and the legislature (laws), including widely known laws in Brazil, as Henry Borel law, Internet Civil Rights law, Maria da Penha law, Copyright law, Agrarian Reform law, Elderly Persons statute and Child and Adolescent statute. The data was automatically annotated by a state of the art parser (Portparser.v2) and manually revised by human experts. To support specialized studies and computational applications, some resources were also produced from this portion of Porttinari, as lists of content words (i.e., nouns, verbs, adjectives and adverbs -- totalizing 4,994 words), propor nouns (627), verb forms (3,073), abbreviations (22), and foreign expressions (50).

The data is distributed in the subcorpora as follows.

Download of the corpus (and associated resources)

The interested user may find the compressed files of the subcorpora (in the CoNLL-U format) at the following links (licensed as Creative Commons CC-BY):

Porttinari-base
Porttinari-check -- original version (automatically annotated) and manually revised version
Porttinari-automatic (divided into 168 folds, for easing handling)
DANTEStocks: full (reference) version and adapted version published at UD website (some annotation decisions changed in order to meet some UD publication requirements)
- - Previous versions of DANTEStocks are also available: version 1.0 (of December 15, 2022), version 1.1 (May 13, 2024) and version 2.0 (which was the basis for the full version above)
PortJur and associated resources (including two versions - tsv files and xlsx spreadsheets)

Differences of this version 2.1 of Porttinari treebank in relation to the 2nd version

Inclusion of the annotated legal texts
As requested by the UD initiative, inclusion of the ExtPos feature
Minor corrections of annotation problems (as adjusting the use of the SpaceAfter feature in some situations)
New detailed manual review of the annotation, with several minor corrections carried out at all annotation levels

Main references (there are many more related publications here)

On the corpus project and release

Lopes, L.; Nunes, M.G.V.; Duran, M.S.; Pardo, T.A.S. (2025). A sintaxe no tribunal: apresentando e explorando um corpus jurídico em português anotado sintaticamente segundo o modelo Universal Dependencies. In the Proceedings of the XVI Symposium in Information and Human Language Technology (STIL), pp. 220-232. September, 29 - October, 02. pdf
Souza, E.A.; Duran, M.S.; Nunes, M.G.V.; Sampaio, G.; Belasco, G.; Pardo, T.A.S. (2024). Automatic Annotation of Enhanced Universal Dependencies for Brazilian Portuguese. In the Proceedings of the 15th Symposium in Information and Human Language Technology (STIL), pp. 217-226. November, 17-21. pdf
Di Felippo, A.; Nunes, M.G.V.; Barbosa, B.K.S. (2024). A Dependency Treebank of Tweets in Brazilian Portuguese: Syntactic Annotation Issues and Approach. In the Proceedings of the 15th Symposium in Information and Human Language Technology (STIL), pp. 192-201. November, 17-21. pdf
Duran, M.S.; Lopes, L.; Nunes, M.G.V.; Pardo, T.A.S. (2023). The Dawn of the Porttinari Multigenre Treebank: Introducing its Journalistic Portion. In the Proceedings of the 14th Symposium in Information and Human Language Technology (STIL), pp. 115-124. September, 25-29. pdf
Pardo, T.A.S.; Duran, M.S.; Lopes, L.; Di Felippo, A.; Roman, N.T.; Nunes, M.G.V. (2021). Porttinari - a large multi-genre treebank for brazilian portuguese. In the Proceedings of the XIII Symposium in Information and Human Language (STIL), pp. 1-10. November, 29 to December, 3. pdf

On the annotation design and decisions

Lopes, L.; Duran, M.S.; Pardo, T.A.S. (2024). Desambiguação de lema e atributos morfológicos na anotaçãodo corpus Porttinari-base. In Anais da IX Jornada de Descrição do Português (JDP), pp. 336-345. November, 17-21. Belém-PA, Brazil. pdf
Lopes, L.; Duran, M. S.; Pardo, T. A. S. (2023). Atribuição de lemas e atributos morfológicos seguindo as decisões adotadas na anotação do córpus Portinari-base dentro das diretrizes da Universal Dependencies (UD). Relatório Técnico do ICMC 445. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Agosto, 34p. pdf
Lopes, L.; Duran, M.S.; Nunes, M.G.V.; Pardo, T.A.S. (2022). Corpora building process according to the Universal Dependencies model: an experiment for Portuguese. Relatório Técnico do ICMC 439. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Março, 22p. pdf
Duran, M.S. (2024). Anotação de Enhanced Dependencies: Orientações para Anotação de Relações de Dependência Sintática do Tipo Enhanced em Língua Portuguesa, seguindo as Diretrizes da Abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 448. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Agosto, 89p. pdf
Di Felippo, A.; Nunes, M.G.V.; Barbosa, B.K.S. (2024). Diretrizes de anotação de relações de dependência em tweets do mercado financeiro. Relatório Técnico do ICMC 446. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Abril, 70p. pdf
Di Felippo, A.; Postali, C.; Ceregatto, G.; Gazana, L.S.; Roman, N.T. (2022). Diretrizes de Anotação de PoS Tags em Tweets do Mercado Financeiro: Orientações para anotação em língua portuguesa segundo a abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 438. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Março, 24p. pdf
Duran, M.S. (2022). Manual de Anotação de Relações de Dependência - Versão Revisada e Estendida: Orientações para anotação de relações de dependência sintática em Língua Portuguesa, seguindo as diretrizes da abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 440. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Outubro, 166p. pdf
Duran, M.S. (2021). Manual de Anotação de PoS tags: Orientações para anotação de etiquetas morfossintáticas em Língua Portuguesa, seguindo as diretrizes da abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 434. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Setembro, 55p. pdf
Duran, M.S.; Nunes, M.G.V.; Lopes, L.; Pardo, T.A.S. (2022). Manual de anotação como recurso de Processamento de Linguagem Natural: o modelo Universal Dependencies em língua portuguesa. Domínios de Lingu@gem, Vol. 16, N. 4, pp. 1608-1643. pdf

Page updated

Report abuse