Opinion Mining for Portuguese
Concept-based Approaches and Beyond
According to Liu (2012), "sentiment analysis, also called opinion mining, is the field of study that analyzes people’s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes". Cambria (2013) and Cambria et al. (2015) go on and define what they call concept-level sentiment analysis, which, according to the authors, performs a deeper understanding of the texts of interest in order to produce better results, taking into account more sophisticated NLP tasks for extracting opinionated information from text, including microtext analysis, semantic parsing, subjectivity detection, anaphora resolution, sarcasm detection, topic spotting, aspect extraction, and polarity detection.
The OPINANDO project aimed at investigating issues of concept-level analysis for the Brazilian Portuguese language. We were particularly interested on three main research fronts, namely: (i) the identification of relevant texts to mine, which included tackling text importance and filtering deceptive content; (ii) the analysis of the selected texts, performing the necessary semantic and discourse analysis and identifying subjective content and the corresponding aspects and polarities; and (iii) the synthesis of the relevant information, using text summarization and generation strategies and dealing with the related challenges in these tasks.
The project was officially funded by USP Research Office (PRP N. 668, from May 2019 to April 2020) and got student scholarships from FAPESP, CAPES and CNPq agencies.
Team
Coordinator: Prof. Thiago A. S. Pardo (NILC-ICMC-USP)
Collaborators
Prof. Evandro E. S. Ruiz (USP)
Prof. Oto. A. Vale (UFSCar)
Prof. Tiago A. de Almeida (UFSCar)
Students
Renato M. Silva (Post-doc)
Francielle A. Vargas (PhD)
Gabriela Wick Pedro (PhD)
Henrico B. Brum (PhD)
Mateus T. Machado (PhD)
Murilo G. Gazzola (PhD)
Rafael T. Anchiêta (PhD)
Rogério F. Sousa (PhD)
Roney L. S. Santos (PhD)
Emerson Y. Okano (MSc)
Marcio Lima Inácio (MSc)
Pedro R. Pires (MSc)
Raphael R. Silva (MSc)
Angelo A. R. Tessaro (undergrad)
Bruno B. Rizzi (undergrad)
Caroline F. Pettarelli (undergrad)
David C. Silva (undergrad)
Gabriel M. B. A. Carvalho (undergrad)
Luana B. Belisário (undergrad)
Lucas S. F. Cardoso (undergrad)
Luiz G. Ferreira (undergrad)
Otávio A. F. Sousa (undergrad)
Rafael A. Monteiro (undergrad)
Raul W. M. Costa (undergrad)
Sérgio R. G. Barbosa Filho (undergrad)
Related publications
Dias, M.S.; Di Felippo, A.; Rassi, A.P.; Cardoso, P.C.F.; Nóbrega, F.A.A.; Pardo, T.A.S. (2021). An investigation of linguistic problems in automatic multi-document summaries. Revista de Estudos da Linguagem, Vol. 29, N. 2, pp. 859-907. link to the paper
Sobrevilla Cabezudo, M.A. and Pardo, T.A.S. (2020). NILC at WebNLG+: Pretrained Sequence-to-Sequence Models on RDF-to-Text Generation. In the Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+), pp. 131-136. December, 18. pdf
Sobrevilla Cabezudo, M.A. and Pardo, T.A.S. (2020). NILC at SR’20: Exploring Pre-Trained Models in Surface Realisation. In the Proceedings of the Third Workshop on Multilingual Surface Realisation (MSR), pp. 50-56. December, 12. pdf
Costa, R.W.M. and Pardo, T.A.S. (2020). Métodos baseados em léxico para extração de aspectos de opiniões em português. In the Proceedings of the IX Brazilian Workshop on Social Network Analysis and Mining (BraSNAM), pp. 61-72. November, 16-20. Cuiabá/Brazil. pdf
Anchiêta, R.T. and Pardo, T.A.S. (2020). Semantically Inspired AMR Alignment for the Portuguese language. In the Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1595-1600. November, 16-20. pdf
Belisário, L.B.; Ferreira, L.G.; Pardo, T.A.S. (2020). Evaluating Richer Features and Varied Machine Learning Models for Subjectivity Classification of Book Review Sentences in Portuguese. Information, Vol. 11, N. 9, pp. 1-14. link to the paper
Anchiêta, R.T.; Sousa, R.F.; Pardo, T.A.S. (2020). Modeling the Paraphrase Detection Task over a Heterogeneous Graph Network with Data Augmentation. Information, Vol. 11, N. 9, pp. 1-12. link to the paper
Vargas, F.A. and Pardo, T.A.S. (2020). Studying Dishonest Intentions in Brazilian Portuguese Texts. In the Proceedings of the 1st International Workshop on Deceptive AI, pp. 1-13. August, 30. Santiago de Compostela/Spain. pdf
Anchiêta, R.T. (2020). Abstract Meaning Representation Parsing for the Brazilian Portuguese Language. PhD Thesis. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. 145p. pdf
Vargas, F.A. and Pardo, T.A.S. (2020). Linguistic Rules for Fine-Grained Opinion Extraction. In the Workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, pp. 1-6. June, 8. pdf
Vargas, F.A. and Pardo, T.A.S. (2020). Aspect Clustering for Sentiment Analysis. In T.S. Clary (ed.), Horizons in Computer Science Research, Vol. 18, pp. 213-224. Nova Science Publishers Inc. link to the book
Santos, R.L.S.; Wick-Pedro, G.; Leal, S.; Vale, O.A.; Pardo, T.A.S.; Bontcheva, K.; Scarton, C. (2020). Measuring the Impact of Readability Features in Fake News Detection. In the Proceedings of the 12th Language Resources and Evaluation Conference (LREC), pp. 1404-1413. May, 13-15. Marseille/France. pdf
Vargas, F.A. and Pardo, T.A.S. (2020). An Automatic Explicit and Implicit Opinion Aspect Clustering Tool for Portuguese. In the Online Proceedings of PROPOR Demonstration Workshop, pp. 1-3. March, 2-4. Évora/Portugal. pdf
Wick-Pedro, G.; Santos, R.L.S.; Vale, O.A.; Pardo, T.A.S.; Bontcheva, K.; Scarton, C. (2020). Linguistic Analysis Model for Monitoring User Reaction on Satirical News for Brazilian Portuguese. In the Proceedings of the 14th International Conference on the Computational Processing of Portuguese (PROPOR) (LNAI 12037), pp. 313-320. March, 2-4. Évora/Portugal. link to the paper
Santos, R.L.S. and Pardo, T.A.S. (2020). Fact-Checking for Portuguese: Knowledge Graph and Google Search-Based Methods. In the Proceedings of the 14th International Conference on the Computational Processing of Portuguese (PROPOR) (LNAI 12037), pp. 195-205. March, 2-4. Évora/Portugal. link to the paper
Nóbrega, F.A.A.; Jorge, A.M.; Brazdil, P.; Pardo, T.A.S. (2020). Sentence Compression for Portuguese. In the Proceedings of the 14th International Conference on the Computational Processing of Portuguese (PROPOR) (LNAI 12037), pp. 270-280. March, 2-4. Évora/Portugal. link to the paper
Belisário, L.B.; Ferreira, L.G.; Pardo, T.A.S. (2020). Evaluating Methods of Different Paradigms for Subjectivity Classification in Portuguese. In the Proceedings of the 14th International Conference on the Computational Processing of Portuguese (PROPOR) (LNAI 12037), pp. 261-269. March, 2-4. Évora/Portugal. link to the paper
Anchiêta, R.T. and Pardo, T.A.S. (2020). Exploring the Potentiality of Semantic Features for Paraphrase Detection. In the Proceedings of the 14th International Conference on the Computational Processing of Portuguese (PROPOR) (LNAI 12037), pp. 228-238. March, 2-4. Évora/Portugal. link to the paper
Bertalan, V.G. and Ruiz, E.E.S. (2020). Predicting judicial outcomes in the Brazilian legal system using textual features. Workshop on Digital Humanities and Natural Language Processing. March, 2-4. Évora/Portugal.
Okano, E.Y.; Liu, Z.; Ji, D.; Ruiz, E.E.S. (2020). Fake news detection on Fake.Br using hierarchical attention networks. In the Proceedings of the 14th International Conference on the Computational Processing of Portuguese (PROPOR) (LNAI 12037), pp. 143-152. March, 2-4. Évora/Portugal. link to the paper
Silva, R.M.; Santos, R.L.S.; Almeida, T.A.; Pardo, T.A.S. (2020). Towards Automatically Filtering Fake News in Portuguese. Expert Systems with Applications (ESWA), Vol. 146, pp. 1-14. pdf
Bertalan, V.G. and Ruiz, E. E. S. (2019). Using topic modeling to find main discussion topics in brazilian political websites. In the Proceedings of the 25th Brazillian Symposium on Multimedia and the Web (WebMedia), pp. 245-248. October, 29 - November, 1. Rio de Janeiro/RJ. pdf
Sousa, R.F.; Anchiêta, R.T.; Nunes, M.G.V. (2019). Um método baseado em grafos para predição da utilidade de opiniões sobre produtos. In the Proceedings of the VIII Brazilian Workshop on Social Network Analysis and Mining (BraSNAM), pp. 95-106. Belém/PA. pdf
Sousa, R.F.; Brum, H.B.; Nunes, M.G.V. (2019). A bunch of helpfulness and sentiment corpora in brazilian portuguese. In the Proceedings of Symposium in Information and Human Language Technology (STIL), pp. 209-218. Salvador/BA. pdf
Silva, R.R. (2019). Sumarização contrastiva de opinião. Dissertação de Mestrado. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. 152p. pdf
Anchiêta, R.F.; Sobrevilla Cabezudo, M.A.; Pardo, T.A.S. (2019). SEMA: an Extended Semantic Evaluation Metric for AMR. In the Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing). April, 7-13. La Rochelle, France. pdf (preprint version)
Belisário, L.B.; Ferreira, L.G.; Pardo, T.A.S. (2019). Classificação de subjetividade para a língua portuguesa. In Anais do VI Workshop de Iniciação Científica em Tecnologia da Informação e da Linguagem Humana (TILic), pp. 358-361. October, 15-18. Salvador/Bahia, Brazil. pdf
Silva, R.R. and Pardo, T.A.S. (2019). Córpus 4P: um córpus anotado de opiniões em português sobre produtos eletrônicos para fins de sumarização contrastiva de opinião. In Anais da 6a Jornada de Descrição do Português (JDP), pp. 330-338. October, 15-18. Salvador/Bahia, Brazil. pdf
Sobrevilla Cabezudo, M.A.; Mille, S.; Pardo, T.A.S. (2019). Back-Translation as Strategy to Tackle the Lack of Corpus in Natural Language Generation from Semantic Representations. In the Proceedings of the Second Workshop on Multilingual Surface Realization (MSR), pp. 94-103. November, 3. Hong Kong, China. pdf
Belisário, L.B.; Ferreira, L.G.; Pardo, T.A.S. (2019). Classificação de Subjetividade para o Português: Métodos Baseados em Aprendizado de Máquina e em Léxico. 27o Simpósio Internacional de Iniciação Científica e Tecnológica da USP (SIICUSP), pp. 1-1. September, 11. São Carlos/SP. Brazil. pdf
Sobrevilla Cabezudo, M.A. and Pardo, T.A.S. (2019). Natural Language Generation: Recently Learned Lessons, Directions for Semantic Representation-based Approaches, and the Case of Brazilian Portuguese Language. In the Proceedings of the ACL Student Research Workshop (SRW), pp. 81-88. July, 28 to August, 2. Florence/Italy. pdf
Sobrevilla Cabezudo, M.A. and Pardo, T.A.S. (2019). Towards a General Abstract Meaning Representation Corpus for Brazilian Portuguese. In the Proceedings of the 13th Linguistic Annotation Workshop (LAW), pp. 236-244. August, 1. Florence/Italy. pdf
Monteiro, R.A. (2018). Detecção Automática de Notícias Falsas. Trabalho de Conclusão de Curso. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, November, 40p. pdf
Costa, R.W.M. (2018). Extração e qualificação de aspectos de opinião para o português. Trabalho de Conclusão de Curso. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, November, 47p. pdf
Anchiêta, R.T. and Pardo, T.A.S. (2018). A Rule-Based AMR Parser for Portuguese. In the Proceedings of the 16th Ibero-American Conference on Artificial Intelligence (IBERAMIA) (LNCS 11238), pp. 341-353. November, 13-16. Trujillo/Peru. pdf (preprint version)
Nóbrega, F.A.A. and Pardo, T.A.S. (2018). Update Summarization: Building from Scratch for Portuguese and Comparing to English. Journal of the Brazilian Computer Society (JBCS), Vol. 24, N. 11, pp. 1-12. pdf
Monteiro, R.A.; Santos, R.L.S.; Pardo, T.A.S.; Almeida, T.A.; Ruiz, E.E.S.; Vale, O.A. (2018). Contributions to the Study of Fake News in Portuguese: New Corpus and Automatic Detection Results. In the Proceedings of the 13th International Conference on the Computational Processing of Portuguese (PROPOR) (LNAI 11122), pp. 324-334. September, 24-26. Canela-RS/Brazil. pdf (preprint version)
Machado, M.T.; Pardo, T.A.S.; Ruiz, E.E.S. (2018). Creating a Portuguese context sensitive lexicon for sentiment analysis. In the Proceedings of the 13th International Conference on the Computational Processing of Portuguese (PROPOR) (LNAI 11122), pp. 335-344. September, 24-26. Canela-RS/Brazil. pdf (preprint version)
Vargas, F.A. and Pardo, T.A.S. (2018). Aspect clustering methods for sentiment analysis. In the Proceedings of the 13th International Conference on the Computational Processing of Portuguese (PROPOR) (LNAI 11122), pp. 365-374. September, 24-26. Canela-RS/Brazil. pdf (preprint version)
Santos, R.L.S.; Monteiro, R.A.; Pardo, T.A.S. (2018). The Fake.Br corpus - a corpus of fake news for Brazilian Portuguese. Latin American and Iberian Languages Open Corpora Forum (OpenCor). September, 24. Canela-RS/Brazil. pdf
Sobrevilla Cabezudo, M.A. and Pardo, T.A.S. (2018). NILC-SWORNEMO at the Surface Realization Shared Task: Exploring Syntax-Based Word Ordering using Neural Models. In the Proceedings of the First Workshop on Multilingual Surface Realisation, pp. 1–7. July 19. Melbourne/Australia. pdf
Sousa, O.A.F. (2018). Sumarização contrastiva de opinião: uma abordagem com otimização. Trabalho de Conclusão de Curso. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. São Carlos-SP, June, 42p. pdf
Anchiêta, R.T. and Pardo, T.A.S. (2018). Towards AMR-BR: A SemBank for Brazilian Portuguese Language. In the Proceedings of the 11th edition of the Language Resources and Evaluation Conference (LREC), pp. 974-979. May 7-12. Miyazaki/Japan. pdf
Vargas, F.A. and Pardo, T,A.S. (2018). Hierarchical clustering of aspects for opinion mining: a corpus study. In M.J.B. Finatto, R.R. Rebechi, S. Sarmento and A.E.P. Bocorny (eds.), Linguística de Corpus: Perspectivas, pp. 69-91. Porto Alegre: Instituto de Letras da UFRGS. 580p. pdf
Anchiêta, R.T.; Sousa, R.F.; Moura, R.S.; Pardo, T.A.S. (2017). Improving Opinion Summarization by Assessing Sentence Importance in On-line Reviews. In the Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology (STIL), pp. 32-36. October 2-4. Uberlândia-MG/Brazil. pdf
Machado, M.T.; Ruiz, E.E.S.; Pardo, T.A.S. (2017). Analysis of unsupervised aspect term identification methods for Portuguese reviews. In the Proceedings of the 14o Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pp. 239-249. October 2-5. Uberlândia-MG/Brazil. pdf
Machado, M.T.; Temporal, J.C.A.N.; Pardo, T.A.S.; Ruiz, E.E.S. (2017). Mineração de tópicos e aspectos em microblogs sobre Dengue, Chikungunya, Zika e Microcefalia. In Anais do XVII Workshop de Informática Médica (WiM), pp. 265-274. July 3-5. São Paulo-SP/Brazil. pdf
López Condori, R.E. and Pardo, T.A.S. (2017). Opinion Summarization Methods: Comparing and Extending Extractive and Abstractive Approaches. Expert Systems with Applications (ESWA), Vol. 78, pp. 124-134. pdf
Vargas, F.A. and Pardo, T.A.S. (2017). Estudo Empírico sobre Agrupamento e Organização Hierárquica de Aspectos para Mineração de Opinião. Série de Relatórios Técnicos do Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, N. 418. São Carlos-SP, March, 48p. pdf
Applications
Automatic Fake News Detection for Portuguese - as described here, a machine learning solution for fake news detection based on the analysis of text features (the related scripts for training and testing the models are available here)
Opizer - as described here, methods for automatically producing opinion summaries for products of interest
Constrastive opinion summarization - methods for automatically producing opinion summaries that compare/contrast features of products of interest
COS system (also available here) - as described here, a constrastive opinion summarizer based on the proposal of Kim and Zhai (2009)
Tools
RBAMR - as described here, a Rule-Based Abstract Meaning Representation (AMR) Parser for the portuguese language
CAMR Parser for Portuguese - a transition-based tree-to-graph Abstract Meaning Representation (AMR) parser for the portuguese language (more details about the CAMR strategy may be found here)
AMR eager parser for Portuguese - a transition-based Abstract Meaning Representation (AMR) parser for the portuguese language (more details about the eager strategy may be found here)
A paraphrase detection tool for Portuguese based on semantic features, as described here
SEMA - as described here, a metric for evaluation of Abstract Meaning Representation (AMR) structures
Machine learning-based methods for subjectivity classification of sentences in Portuguese - as cited here, machine learning approaches for classifying sentences as subjective (clearly expressing a sentiment/an opinion) or objective (expressing facts)
Lexicon-based method for subjectivity classification of sentences in Portuguese - as cited here, a lexicon-based approach for classifying sentences as subjective (clearly expressing a sentiment/an opinion) or objective (expressing facts)
Lexicon-based methods for aspect extraction and polarity classification - as described here, ontology and embedding-based methods customized for reviews in Brazilian Portuguese
OpCluster-PT - as described in the MSc Dissertation of Vargas (2017), a new computational method based on semantic relations and linguistic rules to automatically detect fine-grained opinions in User-Generated Content (UGC)
UGCNormal - as described in the MSc Dissertation of Avanço (2015), an automatic text normalizer for user-generated content in Brazilian Portuguese
Enelvo - as described in the MSc Dissertation of Bertaglia (2017), another automatic text normalizer for user-generated content in Brazilian Portuguese
LBC - as described here, a Lexicon-Based Classifier for sentiment analysis in Brazilian Portuguese
TOP(X) - as described here, a fuzzy-based system to estimate the degree of importance of user-generated comments
TOP(X) v2 - as described here, a new version of TOP(X) -- see above -- using artificial neural networks
OpinionC - as described in the MSc Dissertation of Avanço (2015), a full set of aspect-based opinion classifiers for Brazilian Portuguese, including lexicon, machine learning and hybrid-based techniques
CSTParser v2 - a new stand-alone python-based version of CSTParser, a multi-document discourse parser for Brazilian Portuguese based on the Cross-document Structure Theory (CST) - click here for more details
Resources -- lexicons and ontologies
LIWC lexicon - as described here, a Brazilian Portuguese version of the lexicon in the Linguistic Inquiry and Word Count tool, which is a text analysis software program that calculates the degree to which people use different categories of words across a wide array of texts
Aspect ontologies (also available here) - as described here and in the MSc Dissertation of Vargas (2017), groups of (hierarchically organized) opinion aspects for supporting opinion mining tasks in Brazilian Portuguese, including the domains of smartphones, digital cameras and books, in OWL format
Verbo-Brasil - as described here, a PropBank-like repository for Brazilian Portuguese (there is also a web interface for consulting the data)
VerbNet.Br - as described here, a class-based verb lexicon for Brazilian Portuguese (you may acess the search tool here or directly download the database and the gold standard file)
Resources -- corpora
AMR-annotated "The Little Prince" book (in Portuguese) - as described here, a manually annotated corpus, following the Abstract Meaning Representation (AMR) language
Fake.Br Corpus (also available here) - as described here, a corpus of aligned true and fake news in Brazilian Portuguese (some additional validation datasets are available here)
UTLCorpus - as described here, a corpus of online reviews in Brazilian Portuguese annotated with helpfulness classification
Subjectivity-annotated corpus on the electronic product domain - as cited here, a corpus of manually classified objective (neutral) and subjective (positive and negative) sentences from electronic product reviews collected from the web
Subjectivity-annotated corpus on the book domain - as cited here, a corpus of manually classified objective (neutral) and subjective (positive and negative) sentences from book reviews collected from ReLi corpus, Skoob and Amazon
Discourse analysis of a subjectivity-annotated corpus on the book domain - a RST analysis of the texts of the above corpus
Computer-BR - as described here (and kindly made available by the authors), a corpus of tweets manually classified for polarity (including irony/-2, negative/-1, neutral/0 and positive/1 categories)
Corpus 4P - as described here, a corpus of product reviews about 4 electronic Products (2 digital cameras and 2 smartphones), manually annotated with highly refined classifications for aspects and polarities, specially designed for contrastive summarization research
Aspect-annotated corpus - as described here and in the MSc Dissertation of Vargas (2017), a corpus of product reviews about digital cameras, smartphones and books, manually annotated with explicit and implict aspects
OpiSums-PT - as described here and in the MSc Dissertation of López Condori (2015), a corpus of (extractive and abstractive) opinion summaries (170, in total) for reviews of books (13 reviews) and electronic products (4 reviews), written in Brazilian Portuguese
TweetSentBR - as described here, a corpus of 15,000 tweets in Brazilian Portuguese manually labeled according to their polarities (positive, neutral or negative)
Córpus Buscapé - as described here, a (anonymized) large corpus of product reviews in portuguese crawled from the web