Porttinari-base PropBank (PBP) (Freitas and Pardo, 2024, 2025) refers to the layer of semantic roles annotated in Porttinari-base portion of the Porttinari 1.0 treebank (Duran et al., 2023; Pardo et al., 2021). Porttinari-base is composed of Brazilian journalistic texts (168,000 tokens/8,418 sentences). Semantic Role labeling (SRL) identifies who did what to whom, where, when, how, why, for what, with what, with whom, etc. The task structures the information in linguistic statements in a manner that is both explicit and interpretable.
PBP is annotated in a PropBank-style (Palmer et al., 2005), and contains dependency-based SRL, since Porttinari-base is annotated according to the Universal Dependencies approach. Overall, all verbs were tagged with semantic roles, and 13,395 (60.8%) verbal instances (1,018 different frames) were tagged with semantic frames according to Verbo-Brasil (Duran and Aluísio, 2015). The dataset is available in 2 versions – “ud” and “classic” – described below.
The format
PBP is available in a dependency-based style, in a CoNLL-U file. SRL annotation was originally done in column 9 (“deps”) for verbal frames and in column 10 (“misc”) for arguments and their heads, as shown below.
Annotation schema
PBP follows the original PropBank guidelines (Bonial et al., 2015), with minor adjustments specific to the Portuguese language. The material was annotated considering i) (portuguese) guidelines described by Duran and Freitas (2024) and ii) verbal frames listed in Verbo-Brasil search tool. For some verbs without frame in Verbo-Brasil, we used the tag INC (for “incomplete”) in column 9. The table below lists the tagset used in PBP. In light green are those tags specific to the PBP corpus (labels ArgM:src (source of information), ArgM:conseq (consequences) and ArgM:cond (conditionals) might be replaced by ArgM:adv, and ArgM:comp (comparatives) by ArgM:ext).
The annotation was carried out with the support of the ET (Estação de Trabalho para busca, edição e avaliação de árvores sintáticas) tool (Souza and Freitas, 2021). Throughout the project, the tool was improved to allow other types of linguistic annotation from scratch. ET can be downloaded from here.
Inter-Annotator Agreement
PBP was annotated by a single person, initially using the guidelines of Duran (2014), which were enriched throughout the annotation project. To evaluate the quality of the annotation, an inter-annotator agreement was conducted a posteriori, based on a sample of 100 sentences from the PropBank-Br v.2 corpus, on which the Duran (2014) guidelines are based. This sample was then re-annotated and compared with the original annotation of PropBank-BR v.2 using kappa. The achieved agreement was 0.907. For the interested user, the original and new annotations used for computing the agreement are available for download.
PBP versions
PBP may be found in two versions:
Classic version (click here for the distribution of tags/roles)
This version emphasizes the concept of proposition in SRL, regardless of the underlying syntactic analysis. Consequently:
We annotate semantic roles related to “ser” (to be).
We do not annotate semantic roles related to verbs considered full verbs by the UD morphosyntactic annotation, but considered auxiliaries in the PBP Portuguese guidelines (modal and aspectual auxiliaries).
We annotate verbs that, although considered full verbs by the UD morphosyntactic annotation, are considered auxiliaries in the PBP Portuguese guidelines (ArgM-mod, ArgM-asp), in addition to those verbs always considered auxiliaries (ArgM-tml; ArgM-pas).
UD version (click here for the distribution of tags/roles)
This version assigns semantic roles only to tokens considered full verbs (upos = VERB) in the UD annotation layer. Consequently:
We do not annotate semantic roles related to “ser” and "estar" (to be), considered “AUX” in UD.
We do annotate semantic roles related to verbs considered full verbs by the UD morphosyntactic annotation, but considered modal or aspectual auxiliaries (ArgM-mod tag, according to the PropBank tagset, and ArgM-asp) in the PBP portuguese guidelines. However, to differentiate them from other semantic roles, they were tagged as Arg0_d and Arg1_d. If you want, both labels might be replaced by Arg1 and Arg0, respectively.
Notice that this UD version is aligned to the guidelines of the time that the corpus annotation decisions were taken, when the auxiliary verbs that were considered for Portuguese were "ser", "estar", "ir", "ter" and "haver". Nowadays, the auxiliary class may also include other verbs (for reference, see this page).
We present below the information encoded in each version (some columns are omitted for readability).
In addition to the classic and ud versions, we provide two more versions, focusing on (i) the ease of consulting in linguistic research and (ii) the ease of decomposing SRL in two subtasks: explicit role labeling and implicit role labeling.
Complete version: contains all relations between verbal predicates and arguments.
Explicit-only version: contains only the relations between verbal predicates and arguments that are explicit in the sentence. Unlike the complete version, this version does not contain the id of the head predicate in column 10, only the type of argument. This version does not contain predication relations between copula verbs, even if all arguments are explicit in the sentence.
If in doubt, we suggest using the complete classic version, as it is more committed to the concept of proposition.
Download of the corpus (licensed as Creative Commons CC-BY)
Classic version
UD version
The interested user may also find previous related Propbank initiatives at the following webpages:
Propbank-Br (for news genre, built over manually reviewed constituent syntactic trees for the Brazilian portion of Bosque corpus) -- click here to download this corpus or here to access its annotation manual
Propbank-Br v.2 (also for news genre, built over automatically produced constituent syntactic trees for sentences selected from PLN.Br corpus) -- click here to download this corpus or here to access its annotation manual
How to cite
Freitas, C.; Pardo, T.A.S. (2025). PropBanks e representações semânticas: o que temos, o que queremos e o que podemos. LinguaMÁTICA, Vol 17, N.2, pp. 1-29. pdf
Freitas, C.; Pardo, T. A.S. (2024). PropBank e anotação de papéis semânticos para a língua portuguesa: O que há de novo? In the Proceedings of the 15th Symposium in Information and Human Language Technology (STIL), pp. 118-128. November, 17-21. Belém-PA, Brazil. pdf
Other relevant references
Freitas, C. (2024). Anotação de papéis semânticos no corpus Porttinari-base: procedimentos, resultados e análises. Relatório Técnico do ICMC 450. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Dezembro, 145p. pdf
Duran, M.S.; Freitas, C. (2024). Guia de anotação de papéis semânticos seguindo o modelo propbank no córpus Porttinari-base. Relatório Técnico do ICMC 449. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, Novembro, 50p. pdf
Acknowledgments
To Magali S. Duran and Elvis A. Souza, for their scientific and technical contributions to the development of this work.