Selected contributions

Natural Language Processing for Portuguese (NLP2)

C4AI -- Center for Artificial Intelligence

Selected resources, tools and applications

Porttinari: as reported by Pardo et al. (2021), a multi-genre treebank for Brazilian Portuguese with sentences that are manually annotated according to the Universal Dependencies model
Porttinari-base Propbank (PBP): as reported by Freitas and Pardo (2024), the Porttinari-base portion (composed by journalistic texts) of the Porttinari treebank annotated with a layer of PropBank-style semantic roles, which identify who did what to whom, where, when, how, why, for what, with what, with whom, etc.
NounBank.DS: as described by Barbosa (2024), a repository of predicate names from the DANTEStocks corpus (on stock market topics) and their respective syntactic-semantic valence
PortiLexicon-UD (also available at this link): as reported by Lopes et al. (2022), based on UNITEX-PB and on recent corpus analyses and linguistic studies, it is a large lexicon with part of speech tags, lemmas and morphological features for words in Portuguese, following Universal Dependencies model, with more than 1.2 million word forms
Carolina: a large corpus with texts in Brazilian Portuguese (1970-2021), with information on origin and typology, currently including 650 million tokens and available in open access (free download for research purposes) -- see this paper for more information
CORAA (CORpus de Áudios Anotados): a large multi-purpose corpus of Brazilian Portuguese audio files aligned with transcriptions and manually validated for the purpose of training ASR and TTS models and also Sentiment Analysis using acoustic audio features (SER), including baseline models for ASR and SER
- - Version 1: CORAA ASR - Academic Corpora Projects, composed of academic corpora projects and a collection of TeD talks, all of them with academic license (CC BY NC ND 4.0 International) and totalizing 290.79 hours (a strong ASR baseline, consisting of a pre-trained version of the Wav2Vec 2.0 model, can be found here, while instructions to train and test the model can be found here -- see this paper for details)
  - Version 6: CORAA SER Sentiment Analysis Dataset, composed of approximately 50 minutes of audio segments labeled in three classes (neutral, non-neutral female, and non-neutral male - while the neutral class represents audio segments with no well-defined emotional state, the non-neutral classes represent segments associated with one of the primary emotional states in the speaker's speech)
Portparser.v2: a new version of Portparser (trained on news texts), following the LatinPipe architecture of Straka et al. (2024), achieving state of the art results for Portuguese parsing according to the Universal Dependencies model
Genipapo: as reported by Di Felippo et al. (2024), it is a robust multigenre dependency parser for Brazilian Portuguese (trained with three distinct gold standard corpora, namely, news texts of Porttinari-base, academic texts on the oil and gas domain from PetroGold, and user-generated content (posts from X, formerly Twitter) on stock markets from DANTEStocks), following the Universal Dependencies framework
Porttagger: as described by Silva et al. (2023), a state of the art multi-genre Brazilian Portuguese part of speech tagger according to the Universal Dependencies model (trained on news texts, tweets and academic texts)

Selected publications

Duran, M.S.; Souza, E.A.; Nunes, M.G.V.; Pagano, A.S.; Pardo, T.A.S. (2025). Extending the Enhanced Universal Dependencies - addressing subjects in pro-drop languages. In the Proceedings of the Eighth Workshop on Universal Dependencies (UDW), pp. 143-152. August, 26-29. Ljubljana, Slovenia. link to the paper
Duran, M.S.; Lopes, L.; Pardo, T.A.S. (2025). The revision of linguistic annotation in the Universal Dependencies framework: a look at the annotators’ behavior. In the Proceedings of the 19th Linguistic Annotation Workshop (LAW), pp. 60-69. July, 31. Vienna, Austria. link to the paper
Freitas, C.; Pardo, T.A.S. (2024). PropBank e anotação de papéis semânticos para a língua portuguesa: O que há de novo? In the Proceedings of the 15th Symposium in Information and Human Language Technology (STIL), pp. 118-128. November, 17-21. Belém-PA, Brazil. link to the paper
Di Felippo, A.; Nunes, M.G.V.; Barbosa, B. (2024). A Dependency Treebank of Tweets in Brazilian Portuguese: Syntactic Annotation Issues and Approach. In the Proceedings of the 15th Symposium in Information and Human Language Technology (STIL), pp. 192-201. November, 17-21. Belém-PA, Brazil. link to the paper
Souza, E.; Duran, M.S.; Nunes, M.G.V.; Sampaio, G.; Belasco, G.; Pardo, T.A.S. (2024). Automatic Annotation of Enhanced Universal Dependencies for Brazilian Portuguese. In the Proceedings of the 15th Symposium in Information and Human Language Technology (STIL), pp. 217-226. November, 17-21. Belém-PA, Brazil. link to the paper
Lopes, L.; Pardo, T.A.S. (2024). Towards Portparser - a highly accurate parsing system for Brazilian Portuguese following the Universal Dependencies framework. In the Proceedings of the 16th International Conference on Computational Processing of Portuguese (PROPOR), pp. 401-410. May, 13-15. link to the paper
Di Felippo, A.; Roman, N.T.; Barbosa, B.; Pardo, T.A.S. (2024). Genipapo - A Multigenre Dependency Parser for Brazilian Portuguese. In the Proceedings of the 15th Symposium in Information and Human Language Technology (STIL), pp. 257-266. November, 17-21. Belém-PA, Brazil. link to the paper
Di Felippo, A.; Roman, N.T.; Pardo, T.A.S.; Moura, L.P. (2023). The DANTEStocks Corpus: an analysis of the distribution of Universal Dependencies-based Part-of-Speech tags. Revista Da ABRALIN, Vol. 22, N. 2, pp. 249-271. link to the paper
Santos, W.R.; Oliveira, R.L.; Paraboni, I. (2023). SetembroBR: a social media corpus for depression and anxiety disorder prediction. Language Resources and Evaluation. link to the paper
Inácio, M.L.; Sobrevilla Cabezudo, M.A.; Ramisch, R.; Di Felippo, A.; Pardo, T.A.S. (2023). The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts. DELTA: Documentação e Estudos em Linguística Teórica e Aplicada, Vol. 39, N. 3, pp. 1-31. link to the paper
Anchiêta, R.T. and Pardo, T.A.S. (2022). Analise Semântica com base em AMR para o Português. LinguaMÁTICA, Vol. 14, N. 1, pp. 33-48. link to the paper
Candido Jr, A.; Casanova, E.; Soares, A.; Oliveira, F.S.; Oliveira, L.; Fernandes Jr, R.C.; Silva, D.P.P.; Fayet, F.G.; Carlotto, B.B.; Gris, L.R.S.; Aluísio, S.M. (2022). CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese. Language Resources & Evaluation. link to the paper
Casanova, E.; Weber, J.; Shulby, C.D.; Júnior, A.C.; Gölge, E.; Ponti, M.A. (2022). YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone. In the Proceedings of the Thirty-ninth International Conference on Machine Learning (ICML), pp. 2709-2720. link to the paper
Casanova, E.; Candido Jr, A.; Shulby, C.; Oliveira, F.S.; Teixeira, J.P.; Ponti, M.A.; Aluísio, S.M. (2022). TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Language Resources and Evaluation, Vol. 56, pp. 1043–1055. link to the paper
Duran, M.S.; Nunes, M.G.V.; Lopes, L.; Pardo, T.A.S. (2022). Manual de anotação como recurso de Processamento de Linguagem Natural: o modelo Universal Dependencies em língua portuguesa. Domínios de Lingu@gem, Vol. 16, N. 4, pp. 1608-1643. link to the paper
Lopes, L.; Duran, M.S.; Fernandes, P.H.L.; Pardo, T.A.S. (2022). PortiLexicon-UD: a Portuguese Lexical Resource according to Universal Dependencies Model. In the Proceedings of the 13th Edition of the Language Resources and Evaluation Conference (LREC), pp. 6635‑6643. link to the paper
Silva, A.C.M.; Silva, D.F.; Marcacini, R.M. (2022). Heterogeneous Graph Neural Network for Music Emotion Recognition. In the Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR). Accepted for publication.
Sturzeneker, M.L.; Crespo, M.C.R.M.; Rocha, M.L.S.J.; Finger, M.; Paixão de Sousa, M.C.; Monte, V.M.; Namiuti, C. (2022). Carolina’s Methodology: building a large corpus with provenance and typology information. In the Proceedings of the Second Workshop on Digital Humanities and Natural Language Processing (DHandNLP), pp. 53-58. link to the paper
Casanova, E.; Shulby, C.D.; Gölge, E.; Müller, N.M.; Oliveira, F.S.; Candido Jr., A.; Soares, A.S.; Aluísio, S.M.; Ponti, M.A. (2021). SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model. In the Proceedings of Interspeech, pp. 3645-3649. link to the paper
Inácio, M.L. and Pardo, T.A.S. (2021). Semantic-Based Opinion Summarization. In the Proceedings of Recent Advances in Natural Language Processing (RANLP), pp. 624-633. link to the paper
Lopes, L.; Duran, M.S.; Pardo, T.A.S. (2021). Universal Dependencies-based PoS Tagging Refinement through Linguistic Resources. In the Proceedings of the 10th Brazilian Conference on Intelligent System (BRACIS), pp. 601-615. link to the paper
Mattos, J.P.R. and Marcacini, R.M. (2021). Semi-Supervised Graph Attention Networks for Event Representation Learning. In the Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 1234-1239. link to the paper
Pardo, T.A.S.; Duran, M.S.; Lopes, L.; Di Felippo, A.; Roman, N.T.; Nunes, M.G.V. (2021). Porttinari - a large multi-genre treebank for brazilian portuguese. In the Proceedings of the XIV Symposium in Information and Human Language (STIL), pp. 1-10. link to the paper
Silva, E.H.; Pardo, T.A.S.; Roman, N.T.; Di Felippo, A. (2021). Universal Dependencies for Tweets in Brazilian Portuguese: Tokenization and Part of Speech Tagging. In the Proceedings of the 18th National Meeting on Artificial and Computational Intelligence (ENIAC), pp. 434-445. link to the paper
Souza, M.C.; Nogueira, B.M.; Rossi, R.G.; Marcacini, R.M.; Santos, B.N.; Rezende, S.O. (2021). A network-based positive and unlabeled learning approach for fake news detection. Machine Learning, pp. 1-44. link to the paper
Anchiêta, R.T. and Pardo, T.A.S. (2020). Semantically Inspired AMR Alignment for the Portuguese language. In the Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1595-1600. link to the paper

Other related inititiatives

Page updated

Report abuse