Resources and tools

POeTiSA: POrtuguese processing - Towards Syntactic Analysis and parsing

Porttinari (PorTuguese Treebank)

Porttinari shall be a large multi-genre corpus of Brazilian Portuguese texts that are manually annotated according to the Universal Dependencies model. As reported in this paper, it is currently under construction and is composed by news texts, user generated content and transcribed speech selected from the following corpora:

1. Folha corpus at Kaggle - 167.053 news texts from Folha de São Paulo newspaper
2. MAC-MORPHO - a 1.1 million word corpus of newspaper articles, originally developed in the Lacio-Web project, as described by Aluísio et al. (2003)
3. DANTE (Dependency-ANalised corpora of TwEets) - currently composed by the DANTEStocks corpus, with 4,048 Brazilian stock market tweets manually annotated with morphological and morphosyntactic information, building on a previous existing corpus annotated with emotions according to the Plutchik’s Wheel of Emotions, as described by Silva et al. (2020)
4. B2W-reviews01 - more than 130.000 e-commerce customer reviews, collected from the Americanas website, as described by Real et al. (2019)
5. Book reviews - a small (but challenging) corpus of online book review sentences, as described by Belisário et al. (2020)
6. Roda-Viva - transcribed data of selected Roda Viva interviews, as described by Miranda Jr et al. (2024)

The 1st version of the journalistc portion of Porttinari is already available at this link, whick includes its three partitions, namely, Porttinari-base, Porttinari-check and Porttinari-automatic, as detailed in this paper.

The last annotated version of DANTEStocks corpus (version 1.1, of May 03, 2024) is available at this link. In relation to the previsous versions, it incorporates the following improvements: exclusion of sentences written entirely in English and correction of the tokenization process, decimal point representation and CoNLL-U structuring issues. For the interested user, the previous versions of this corpus are also available at the following links: versions of December 15, 2022 and November 16, 2022.

Other corpora and lexical resources

PortiLexicon-UD: based on UNITEX-PB and on recent corpus analyses and linguistic studies, it is a large lexicon with part of speech tags, lemmas and morphological features for words in Portuguese, following Universal Dependencies model, with more than 1.2 million word forms, freely available under CC-BY license (the full lexicon is also available here for download) -- see this paper for more information (the interested user may find more related information at github)
Semantically annotated texts for Brazilian Portuguese: "The Little Prince" book, news, and product reviews manually annotated according to the Abstract Meaning Representation (AMR)
Verbo-Brasil search interface - an online tool for searching the Verbo-Brasil repository, making it possible to use regular expressions to look for data of interest
Manually annotated corpora with (explicit and implicit) opinion aspects for online comments on "camera", "smartphone", "book" and "hotel" domains (CSV-encoded, using IOB annotation format)
Lexicon of implicit aspect clues and their corresponding aspects retrieved from the above corpora (XML-encoded), as described in this paper
Typology (with categories and subcategories) of implicit aspect clues (regarding the necessary knowledge to identify them) and the classification of the clues cited above (CSV-encoded), as described in this paper
Steam corpus (also available at Kaggle website) - more than 2 million comments in Brazilian Portuguese about games at the Steam website (extracted from 10 thousand games that had their name and genre manually annotated), used for research on usefulness prediction (as described here)
DCG grammar for Portuguese - a full (constituency) DCG grammar automatically extracted from the Brazilian portion (CETENFolha) of the Bosque treebank, where each grammar rule is accompanied by its overall frequency in the treebank, the number of sentences in which it was used, and its probability (thanks to Vinícius F. Arruda for running the grammar extraction)
- - Additional material includes (a) files for individual DCG rules with the sentences from which the rules were extracted (the files are named according to the rules -- when a file name happens to be too big, it is truncated and an additional ID is used), (b) files for individual sentences and the DCG rules that are used in them (the files are named according to the sentence numbers), and (c) files with (visual) parse trees of individual sentences (the files are named according to the sentence numbers)

Tools and applications

Portparser: as described here, a state of the art syntactical parser for Brazilian Portuguese (trained on news texts) according to the Universal Dependencies model (the interested user may find the source code and more related information at github)
Porttagger: as described here, a state of the art multi-genre Brazilian Portuguese part of speech tagger according to the Universal Dependencies model (trained on news texts, tweets and academic texts) (the interested user may find the source code and more related information at github)
PortTokenizer: a tokenization tool for Portuguese, which receives as input a single text file with the sentences of interest (one per line) and generates a CoNLL-U file with all sentences tokenized
PortSentencer: a general sentence segmentation tool for Portuguese, which recieves as input one or more text files and generates a single text file with one sentence per line
Verifica-UD: as described here, a verifier for Universal Dependencies annotation for Portuguese that automatically checks structural, morphosyntactic and syntactic annotation issues (the interested user may find the source code and more related information at github)
UDConcord: as described here, a visual and easy-to-use web-based concordance tool for UD-annotated treebanks (the interested user may find the source code at github)
Arborator-Grew-NILC: as described here, an extended and improved version of Arborator-Grew (Guibon et al., 2020) for UD annotation (the interested user may find the source code at github)
UD Annotation and Visualization tool - an online interface for visualizing and editing CoNLL-U files, including enhanced relations
conlluFile package: a Python 3 package to handle CoNNL-U files in order to more easily access, edit, and print the encoded information
Opinion aspect extraction methods for Portuguese: include frequency, rule, machine learning and language model-based methods -- see this paper for more information
Semantic-based opinion summarization methods for Brazilian Portuguese: investigated, developed and evaluated during the MSc of Marcio L. Inácio (see a related paper here)
Deception detection for Portuguese: varied methods for detecting fake news, including machine learning (classical and deep learning approaches), knowledge graphs and complex networks (see this and this monograph for details)

Related third-party products

Easy-to-use web-based UD validation tool, developed by Elvis A. Souza and kindly shared with the research community