Porttinari is a multi-genre treebank for Brazilian Portuguese with sentences that are manually annotated according to the Universal Dependencies model. As reported in this paper, it is currently under construction and shall be composed by news texts, user generated content, transcribed speech (under annotation) and legal texts (also under annotation) selected from the following corpora:
Folha corpus at Kaggle - 167.053 news texts from Folha de São Paulo newspaper
DANTE (Dependency-ANalised corpora of TwEets) - currently composed by the DANTEStocks corpus, with 4,042 Brazilian stock market tweets, building on a previous existing corpus annotated with emotions according to the Plutchik’s Wheel of Emotions, as described by Silva et al. (2020)
Roda-Viva - transcribed data of selected Roda Viva interviews, as described by Miranda Jr et al. (2024)
PortJur - 2,005 sentences of Brazilian government legal texts, as the Statute of the Child and Adolescent, Statute of the Elderly and juridical decisions of STF and STJ
The current version (2.1) of the treebank is available at this link. Part of the data may also be found at the Universal Dependencies webpage.
Other corpora and lexical resources
Porttinari-base Propbank (PBP): as reported by Freitas and Pardo (2024), the Porttinari-base portion of the Porttinari treebank (composed by journalistic texts, with 168,000 tokens and 8,418 sentences) annotated with a layer of PropBank-style semantic roles, which identify who did what to whom, where, when, how, why, for what, with what, with whom, etc.
NounBank.DS: as described by Barbosa (2024), a repository of predicate names from the DANTEStocks corpus (on stock market topics) and their respective syntactic-semantic valence (the interested user may find the data and more related information at github)
Porttinari-base corpus with annotation of Extended Enhanced Universal Relations (EEUD): as reported by Duran et al. (2025), a corpus of news sentences annotated with the original 6 types of UD enhanced dependency relations and the extended ones proposed by the authors
Portuguese Tweet Corpus Annotated with NER
First version of the annotation: as reported by Zerbinati et al. (2024), DANTEStocks posts that were manually annotated with named entities according to HAREM taxonomy
Second version of the annotation: as reported by Piai (2025), an expansion of the above work, applying linguistic oriented guidelines and providing a more granular analysis by employing a refined taxonomy of entity types, including new types created specifically for the financial domain and the tweet genre
News Texts Annotated with Speech Acts: as reported by Silva et al. (2024), a subset of Porttinari-base corpus manually annotated with speech acts, using the tagset proposed by ISO 24617-2
PortiLexicon-UD (also available at this link): based on UNITEX-PB and on recent corpus analyses and linguistic studies, it is a large lexicon with part of speech tags, lemmas and morphological features for words in Portuguese, following Universal Dependencies model, with more than 1.2 million word forms, freely available under CC-BY license (the full lexicon is also available here for download) -- see this paper for more information (the interested user may find more related information at github)
Semantically annotated texts for Brazilian Portuguese: "The Little Prince" book, news, and product reviews manually annotated according to the Abstract Meaning Representation (AMR)
Verbo-Brasil search interface - an online tool for searching the Verbo-Brasil repository, making it possible to use regular expressions to look for data of interest
Manually annotated corpora with (explicit and implicit) opinion aspects for online comments on "camera", "smartphone", "book" and "hotel" domains (CSV-encoded, using IOB annotation format)
Lexicon of implicit aspect clues and their corresponding aspects retrieved from the above corpora (XML-encoded), as described in this paper
Typology (with categories and subcategories) of implicit aspect clues (regarding the necessary knowledge to identify them) and the classification of the clues cited above (CSV-encoded), as described in this paper
Steam corpus (also available at Kaggle website) - more than 2 million comments in Brazilian Portuguese about games at the Steam website (extracted from 10 thousand games that had their name and genre manually annotated), used for research on usefulness prediction (as described here)
DCG grammar for Portuguese - a full (constituency) DCG grammar automatically extracted from the Brazilian portion (CETENFolha) of the Bosque treebank, where each grammar rule is accompanied by its overall frequency in the treebank, the number of sentences in which it was used, and its probability (thanks to Vinícius F. Arruda for running the grammar extraction)
Additional material includes (a) files for individual DCG rules with the sentences from which the rules were extracted (the files are named according to the rules -- when a file name happens to be too big, it is truncated and an additional ID is used), (b) files for individual sentences and the DCG rules that are used in them (the files are named according to the sentence numbers), and (c) files with (visual) parse trees of individual sentences (the files are named according to the sentence numbers)
Tools and applications
Named Entity Recognition (NER) system for financial tweets: source code for named entity annotation in financial tweets, based on the corpus annotation reported by Zerbinati et al. (2024)
Speech act classification system for news texts: as reported by Silva et al. (2024), source code for speech act classification in news texts, using the tagset proposed by ISO 24617-2
Genipapo: as reported by Di Felippo et al. (2024), it is a robust multigenre dependency parser for Brazilian Portuguese (trained with three distinct gold standard corpora, namely, news texts of Porttinari-base, academic texts on the oil and gas domain from PetroGold, and user-generated content (posts from X, formerly Twitter) on stock markets from DANTEStocks), following the Universal Dependencies framework (the interested user may find the source code and more related information at github)
UGC Parser: as reported by Barbosa (2024), a dependency parser specifically trained on User-Generated Content in Brazilian Portuguese, using the DANTEStocks corpus (composed by X posts on stock markets), following the Universal Dependencies framework
Portparser.v2: a new version of Portparser (trained on news texts), now following the LatinPipe architecture of Straka et al. (2024), achieving state of the art results for Portuguese parsing according to the Universal Dependencies model (the interested user may find the source code and more related information at github)
Portparser (the interested user may find the [better] 2nd version above): as described here, a syntactical parser for Brazilian Portuguese (trained on news texts) according to the Universal Dependencies model
Porttagger: as described here, a state of the art multi-genre Brazilian Portuguese part of speech tagger according to the Universal Dependencies model (trained on news texts, tweets and academic texts) (the interested user may find the source code and more related information at github)
PortTokenizer: a tokenization tool for Portuguese, which receives as input a single text file with the sentences of interest (one per line) and generates a CoNLL-U file with all sentences tokenized
PortSentencer: a general sentence segmentation tool for Portuguese, which recieves as input one or more text files and generates a single text file with one sentence per line
Enhanced Universal Dependencies annotation tool for Portuguese: the tool produces the UD annotation and its enhanced version for input sentences in Portuguese, following a set of annotation rules for Portuguese
Verifica-UD (an older version may be found at this link): as described here, a verifier for Universal Dependencies annotation for Portuguese that automatically checks structural, morphosyntactic and syntactic annotation issues (the interested user may find the source code and more related information at github)
UDConcord: as described here, a visual and easy-to-use web-based concordance tool for UD-annotated treebanks (the interested user may find the source code at github)
Arborator-Grew-NILC: as described here, an extended and improved version of Arborator-Grew (Guibon et al., 2020) for UD annotation (the interested user may find the source code at github)
UD Annotation and Visualization tool - an online interface for visualizing and editing CoNLL-U files, including enhanced relations
conlluFile package: a Python 3 package to handle CoNNL-U files in order to more easily access, edit, and print the encoded information
Opinion aspect extraction methods for Portuguese: include frequency, rule, machine learning and language model-based methods -- see this paper for more information
Semantic-based opinion summarization methods for Brazilian Portuguese: investigated, developed and evaluated during the MSc of Marcio L. Inácio (see a related paper here)
Deception detection for Portuguese: varied methods for detecting fake news, including machine learning (classical and deep learning approaches), knowledge graphs and complex networks (see this and this monograph for details)
Related third-party products
Easy-to-use web-based UD validation tool, developed by Elvis A. Souza and kindly shared with the research community
Predicative nouns in Portuguese, as reported by Barros et al. (2024)