Selected contributions
Natural Language Processing for Portuguese (NLP2)
Selected resources, tools and applications
PortiLexicon-UD: based on UNITEX-PB and on recent corpus analyses and linguistic studies, it is a large lexicon with part of speech tags, lemmas and morphological features for words in Portuguese, following Universal Dependencies model, with more than 1.2 million word forms, freely available under CC-BY license -- see this paper for more information
Lexicon of implicit aspect clues and their corresponding aspects retrieved from the above corpora (XML-encoded), as described in this paper
Carolina: a large corpus with texts in Brazilian Portuguese (1970-2021), with information on origin and typology, currently including 650 million tokens and available in open access (free download for research purposes) -- see this paper for more information
CORAA (CORpus de Áudios Anotados): a large multi-purpose corpus of Brazilian Portuguese audio files aligned with transcriptions and manually validated for the purpose of training ASR and TTS models and also Sentiment Analysis using acoustic audio features (SER), including baseline models for ASR and SER
Version 1: CORAA ASR - Academic Corpora Projects, composed of academic corpora projects and a collection of TeD talks, all of them with academic license (CC BY NC ND 4.0 International) and totalizing 290.79 hours (a strong ASR baseline, consisting of a pre-trained version of the Wav2Vec 2.0 model, can be found here, while instructions to train and test the model can be found here -- see this paper for details)
Version 6: CORAA SER Sentiment Analysis Dataset, composed of approximately 50 minutes of audio segments labeled in three classes (neutral, non-neutral female, and non-neutral male - while the neutral class represents audio segments with no well-defined emotional state, the non-neutral classes represent segments associated with one of the primary emotional states in the speaker's speech)
Selected publications
Anchiêta, R.T. and Pardo, T.A.S. (2022). Analise Semântica com base em AMR para o Português. LinguaMÁTICA, Vol. 14, N. 1, pp. 33-48. link to the paper
Candido Jr, A.; Casanova, E.; Soares, A.; Oliveira, F.S.; Oliveira, L.; Fernandes Jr, R.C.; Silva, D.P.P.; Fayet, F.G.; Carlotto, B.B.; Gris, L.R.S.; Aluísio, S.M. (2022). CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese. Language Resources & Evaluation. link to the paper
Casanova, E.; Weber, J.; Shulby, C.D.; Júnior, A.C.; Gölge, E.; Ponti, M.A. (2022). YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone. In the Proceedings of the Thirty-ninth International Conference on Machine Learning (ICML), pp. 2709-2720. link to the paper
Casanova, E.; Candido Jr, A.; Shulby, C.; Oliveira, F.S.; Teixeira, J.P.; Ponti, M.A.; Aluísio, S.M. (2022). TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Language Resources and Evaluation, Vol. 56, pp. 1043–1055. link to the paper
Duran, M.S.; Nunes, M.G.V.; Lopes, L.; Pardo, T.A.S. (2022). Manual de anotação como recurso de Processamento de Linguagem Natural: o modelo Universal Dependencies em língua portuguesa. Domínios de Lingu@gem, Vol. 16, N. 4, pp. 1608-1643. link to the paper
Lopes, L.; Duran, M.S.; Fernandes, P.H.L.; Pardo, T.A.S. (2022). PortiLexicon-UD: a Portuguese Lexical Resource according to Universal Dependencies Model. In the Proceedings of the 13th Edition of the Language Resources and Evaluation Conference (LREC), pp. 6635‑6643. link to the paper
Silva, A.C.M.; Silva, D.F.; Marcacini, R.M. (2022). Heterogeneous Graph Neural Network for Music Emotion Recognition. In the Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR). Accepted for publication.
Sturzeneker, M.L.; Crespo, M.C.R.M.; Rocha, M.L.S.J.; Finger, M.; Paixão de Sousa, M.C.; Monte, V.M.; Namiuti, C. (2022). Carolina’s Methodology: building a large corpus with provenance and typology information. In the Proceedings of the Second Workshop on Digital Humanities and Natural Language Processing (DHandNLP), pp. 53-58. link to the paper
Casanova, E.; Shulby, C.D.; Gölge, E.; Müller, N.M.; Oliveira, F.S.; Candido Jr., A.; Soares, A.S.; Aluísio, S.M.; Ponti, M.A. (2021). SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model. In the Proceedings of Interspeech, pp. 3645-3649. link to the paper
Inácio, M.L. and Pardo, T.A.S. (2021). Semantic-Based Opinion Summarization. In the Proceedings of Recent Advances in Natural Language Processing (RANLP), pp. 624-633. link to the paper
Lopes, L.; Duran, M.S.; Pardo, T.A.S. (2021). Universal Dependencies-based PoS Tagging Refinement through Linguistic Resources. In the Proceedings of the 10th Brazilian Conference on Intelligent System (BRACIS), pp. 601-615. link to the paper
Mattos, J.P.R. and Marcacini, R.M. (2021). Semi-Supervised Graph Attention Networks for Event Representation Learning. In the Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 1234-1239. link to the paper
Pardo, T.A.S.; Duran, M.S.; Lopes, L.; Di Felippo, A.; Roman, N.T.; Nunes, M.G.V. (2021). Porttinari - a large multi-genre treebank for brazilian portuguese. In the Proceedings of the XIV Symposium in Information and Human Language (STIL), pp. 1-10. link to the paper
Silva, E.H.; Pardo, T.A.S.; Roman, N.T.; Di Felippo, A. (2021). Universal Dependencies for Tweets in Brazilian Portuguese: Tokenization and Part of Speech Tagging. In the Proceedings of the 18th National Meeting on Artificial and Computational Intelligence (ENIAC), pp. 434-445. link to the paper
Souza, M.C.; Nogueira, B.M.; Rossi, R.G.; Marcacini, R.M.; Santos, B.N.; Rezende, S.O. (2021). A network-based positive and unlabeled learning approach for fake news detection. Machine Learning, pp. 1-44. link to the paper
Anchiêta, R.T. and Pardo, T.A.S. (2020). Semantically Inspired AMR Alignment for the Portuguese language. In the Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1595-1600. link to the paper
Other related inititiatives