Porttinari 2.0
POeTiSA: POrtuguese processing - Towards Syntactic Analysis and parsing
(the previous version -- Porttinari 1.0 -- is available here)
(the previous version -- Porttinari 1.0 -- is available here)
This page releases the 2nd version of Porttinari (which stands for “PORTuguese Treebank”), a multigenre treebank for Portuguese (Pardo et al., 2021), following the "Universal Dependencies" international grammar framework (de Marneffe et al., 2021). Porttinari is currently composed by news texts and user-generated content, as follows:
As reported by Duran et al. (2023), the news portion includes texts from Folha de São Paulo newspaper (which are publicly available at Kaggle website), being divided into three subcorpora with different characteristics and purposes:
Porttinari-base:
With basic dependency relations: a corpus that is manually revised in detail to serve as gold standard (divided into training, development and test folds), with average annotation review agreement (kappa) of 97.8% and 96.2% for part of speech tags and dependency relations, respectively;
With basic and enhancend dependency relations: the above corpus with the inclusion of enhanced dependency relations (following the proposal of Pagano et al. (2023) and the guidelines of Duran (2024)), which were semi-automatically produced (a rule-based annotation system automatically produced the enhanced relations with over 96% accuracy -- see the report of Souza et al. (2024) -- and some human experts manually reviewed the more challenging annotation issues);
Porttinari-check, a small corpus structurally similar to Porttinari-base to serve as testbed for additional and diversified evaluations and to illustrate the contrast between manual and automatic annotations (including only basic dependency relations) -- the automatic annotation was carried out with Portparser.v2;
Porttinari-automatic, a very large corpus that was automatically annotated by a state of the art parser (Portparser.v2) trained on Porttinari-base (including only basic dependency relations).
The portion of user-generated content, named DANTEStocks, includes posts on the financial domain, being collected from the X social network, still named Twitter by the time the corpus was built, as reported by Silva et al., 2020. The original posts (made available at Kaggle website) were semi-automatically annotated with part of speech tags and basic dependency relations (state of the art tagger and parser were used to produce the first annotations, which were incrementally reviewed and used to train new versions of the tools in order to annotate the remaining data, and all the data was later manually reviewed). As part of speech annotation was produced by an adjudication process over automatic data reviewed by three linguists (as reported by Silva et al., 2021), there is no computed agreement value. The first version of the dependency relation annotation achieved an (kappa) agreement of 95.0% (as detailed by Barbosa, 2024). More details about the annotation are reported by Di Felippo et al. (2023, 2024). The data is also divided into training, development and test folds.
The data is distributed in the subcorpora as follows.
Download of the corpus
The interested user may find the compressed files of the subcorpora (in the CoNLL-U format) at the following links (licensed as Creative Commons CC-BY):
Porttinari-check -- original version (automatically annotated) and manually revised version
Porttinari-automatic (divided into 168 folds, for easing handling)
DANTEStocks: full (reference) version and adapted version published at UD website (some annotation decisions changed in order to meet some UD publication requirements)
Previous versions of DANTEStocks are also available: version 1.0 (of December 15, 2022), version 1.1 (May 13, 2024) and version 2.0 (which was the basis for the full version above)
Differences of this 2nd version of Porttinari treebank in relation to the 1st one
Inclusion of the annotated user-generated content
The annotation of enhanced dependency relations in Porttinari-base (from the news portion)
The automatic re-annotation of Porttinari-check and Porttinari-automatic (from the news portion) with a new state of the art parser for Portuguese (with accuracy over 96%)
The revised data passed through a new detailed manual review, with several minor corrections carried out at all annotation levels
Main references (there are many more related publications here)
On the corpus project and release
Souza, E.A.; Duran, M.S.; Nunes, M.G.V.; Sampaio, G.; Belasco, G.; Pardo, T.A.S. (2024). Automatic Annotation of Enhanced Universal Dependencies for Brazilian Portuguese. In the Proceedings of the 15th Symposium in Information and Human Language Technology (STIL), pp. 217-226. November, 17-21. pdf
Di Felippo, A.; Nunes, M.G.V.; Barbosa, B.K.S. (2024). A Dependency Treebank of Tweets in Brazilian Portuguese: Syntactic Annotation Issues and Approach. In the Proceedings of the 15th Symposium in Information and Human Language Technology (STIL), pp. 192-201. November, 17-21. pdf
Duran, M.S.; Lopes, L.; Nunes, M.G.V.; Pardo, T.A.S. (2023). The Dawn of the Porttinari Multigenre Treebank: Introducing its Journalistic Portion. In the Proceedings of the 14th Symposium in Information and Human Language Technology (STIL), pp. 115-124. September, 25-29. pdf
Pardo, T.A.S.; Duran, M.S.; Lopes, L.; Di Felippo, A.; Roman, N.T.; Nunes, M.G.V. (2021). Porttinari - a large multi-genre treebank for brazilian portuguese. In the Proceedings of the XIII Symposium in Information and Human Language (STIL), pp. 1-10. November, 29 to December, 3. pdf
On the annotation design and decisions
Lopes, L.; Duran, M.S.; Pardo, T.A.S. (2024). Desambiguação de lema e atributos morfológicos na anotaçãodo corpus Porttinari-base. In Anais da IX Jornada de Descrição do Português (JDP), pp. 336-345. November, 17-21. Belém-PA, Brazil. pdf
Lopes, L.; Duran, M. S.; Pardo, T. A. S. (2023). Atribuição de lemas e atributos morfológicos seguindo as decisões adotadas na anotação do córpus Portinari-base dentro das diretrizes da Universal Dependencies (UD). Relatório Técnico do ICMC 445. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Agosto, 34p. pdf
Lopes, L.; Duran, M.S.; Nunes, M.G.V.; Pardo, T.A.S. (2022). Corpora building process according to the Universal Dependencies model: an experiment for Portuguese. Relatório Técnico do ICMC 439. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Março, 22p. pdf
Duran, M.S. (2024). Anotação de Enhanced Dependencies: Orientações para Anotação de Relações de Dependência Sintática do Tipo Enhanced em Língua Portuguesa, seguindo as Diretrizes da Abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 448. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Agosto, 89p. pdf
Di Felippo, A.; Nunes, M.G.V.; Barbosa, B.K.S. (2024). Diretrizes de anotação de relações de dependência em tweets do mercado financeiro. Relatório Técnico do ICMC 446. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Abril, 70p. pdf
Di Felippo, A.; Postali, C.; Ceregatto, G.; Gazana, L.S.; Roman, N.T. (2022). Diretrizes de Anotação de PoS Tags em Tweets do Mercado Financeiro: Orientações para anotação em língua portuguesa segundo a abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 438. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Março, 24p. pdf
Duran, M.S. (2022). Manual de Anotação de Relações de Dependência - Versão Revisada e Estendida: Orientações para anotação de relações de dependência sintática em Língua Portuguesa, seguindo as diretrizes da abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 440. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Outubro, 166p. pdf
Duran, M.S. (2021). Manual de Anotação de PoS tags: Orientações para anotação de etiquetas morfossintáticas em Língua Portuguesa, seguindo as diretrizes da abordagem Universal Dependencies (UD). Relatório Técnico do ICMC 434. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Setembro, 55p. pdf
Duran, M.S.; Nunes, M.G.V.; Lopes, L.; Pardo, T.A.S. (2022). Manual de anotação como recurso de Processamento de Linguagem Natural: o modelo Universal Dependencies em língua portuguesa. Domínios de Lingu@gem, Vol. 16, N. 4, pp. 1608-1643. pdf