Research

FEED PETs: Further Experimentation and Expansion on the Disambiguation of Potentially Euphemistic Terms 

Transformers have been shown to work well for the task of English euphemism disambiguation, in which a potentially euphemistic term (PET) is classified as euphemistic or non-euphemistic in a particular context. In this study, we expand on the task in two ways. First, we annotate PETs for vagueness, a linguistic property associated with euphemisms, and find that transformers are generally better at classifying vague PETs, suggesting linguistic differences in the data that impact performance. Second, we present novel euphemism corpora in three different languages: Yoruba, Spanish, and Mandarin Chinese. We perform euphemism disambiguation experiments in each language using multilingual transformer models mBERT and XLM-RoBERTa, establishing preliminary results from which to launch future work.

[paper, bib]


Iyanuoluwa Shode, David Ifeoluwa Adelani, Jing Peng, Anna Feldman

NollySenti: Leveraging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification 

We create a new dataset, NollySenti - based on the Nollywood movie reviews for five languages widely spoken in Nigeria (English, Hausa, Igbo, Nigerian-Pidgin, and Yoruba. We provide an extensive empirical evaluation using classical machine learning methods and pre-trained language models. Leveraging transfer learning, we compare the performance of cross-domain adaptation from Twitter domain, and cross-lingual adaptation from English language. Our evaluation shows that transfer from English in the same target domain leads to more than 5% improvement in accuracy compared to transfer from Twitter in the same language. To further mitigate the domain difference, we leverage machine translation (MT) from English to other Nigerian languages, which leads to a further improvement of 7% over cross-lingual evaluation. While MT to low-resource languages are often of low quality, through human evaluation, we show that most of the translated sentences preserve the sentiment of the original English reviews.

[paper, bib]


Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Oreen Yousuf, Salomey Osei, Abigail Oppong, Iyanuoluwa Shode, Oluwabusayo Olufunke Awoyomi, Chris Chinenye Emezue

AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages

In the paper, we leverage the efficiency of active learning in the process of training multilingual large pre-trained models. We trained AfroLM from scratch, with ~0.73GB (which is 14x+ times smaller than other baselines mBERT, XLMR, AfroXLMR-base) of data from 23 African Languages. On MasakhaNER, AfroLM outperforms mBERT, and XMLR-base, and is highly competitive with AfroXLMR-base. AfroLM was solely trained on news data. We performed OOD/Cross-domain experiments, with sentiment analysis tasks in Twitter and Movies domains. AfroLM also performs better, suggesting its better adaptation, and generalization.

[paper, bib]


Kenna Reagan, Aparna Varde, Lei Xie

Evolving Perceptions of Mental Health on Social Media and their Medical Impacts

The purpose of this research is to investigate the evolving perceptions of mental health by analyzing big data from social media over a seven-year period, with a particular focus on Twitter. By using topic modeling and sentiment analysis, we aim to understand public sentiments regarding mental health issues, emphasizing the importance of considering both polarity (emotion orientation) and subjectivity (fact vs. opinion). The research highlights how significant events, such as elections and the COVID-19 pandemic, influence mental health discussions and shows that while overall sentiment has been positive, it has declined since the pandemic began. The findings are intended to aid professionals in various fields, including data science, epidemiology, and psychology, by providing insights derived from social media data relevant to mental health trends and discussions.

[paper, bib]


Patrick Lee, Martha Gavidia, Anna Feldman, Jing Peng

Searching for PETs: Using Distributional and Sentiment-Based Methods to Find Potentially Euphemistic Terms

This paper presents a linguistically driven proof of concept for finding potentially euphemistic terms, or PETs. Acknowledging that PETs tend to be commonly used expressions for a certain range of sensitive topics, we make use of distributional similarities to select and filter phrase candidates from a sentence and rank them using a set of simple sentiment-based metrics. We present the results of our approach tested on a corpus of sentences containing euphemisms, demonstrating its efficacy for detecting single and multi-word PETs from a broad range of topics. We also discuss future potential for sentiment-based methods on this task.

[paper, bib]


Avery Field, Aparna Varde, Pankaj Lal

Sentiment Analysis and Topic Modeling for Public Perceptions of Air Travel: COVID Issues and Policy Amendments

Among many industries, air travel is impacted by the COVID pandemic. Airlines and airports rely on public sector information to enforce guidelines for ensuring health and safety of travelers. Such guidelines can be policy amendments or laws during the pandemic. In response to the inception of COVID preventive policies, travelers have exercised freedom of expression via the avenue of online reviews. This avenue facilitates voicing public concern while anonymizing / concealing user identity as needed. It is important to assess opinions on policy amendments to ensure transparency and openness, while also preserving confidentiality and ethics. Hence, this study leverages data science to analyze, with identity protection, the online reviews of airlines and airports since 2017, considering impacts of COVID issues and relevant policy amendments since 2020. Supervised learning with VADER sentiment analysis is deployed to predict changes in opinion from 2017 to date. Unsupervised learning with LDA topic modeling is employed to discover air travelers’ major areas of concern before and after the pandemic. This study reveals that COVID policies have worsened public perceptions of air travel and aroused notable new concerns, affecting economics, environment and health.

[paper, bib]


Brad McNamee, Aparna Varde, Simon Razniewski

Correlating Facts and Social Media Trends on Environmental Quantities Leveraging Commonsense Reasoning and Human Sentiments

As climate change alters the physical world we inhabit, opinions surrounding this hot-button issue continue to fluctuate. This is apparent on social media, particularly Twitter. In this paper, we explore concrete climate change data concerning the Air Quality Index (AQI), and its relationship to tweets. We incorporate commonsense connotations for appeal to the masses. Earlier work focuses primarily on accuracy and performance of sentiment analysis tools / models, much geared towards experts. We present commonsense interpretations of results, such that they are not impervious to the masses. Moreover, our study uses real data on multiple environmental quantities comprising AQI. We address human sentiments gathered from linked data on hashtagged tweets with geolocations. Tweets are analyzed using VADER, subtly entailing commonsense reasoning. Interestingly, correlations between climate change tweets and air quality data vary not only based upon the year, but also the specific environmental quantity. It is hoped that this study will shed light on possible areas to increase awareness of climate change, and methods to address it, by the scientists as well as the common public. In line with Linked Data initiatives, we aim to make this work openly accessible on a network, published with the Creative Commons license.

[paper, bib]


Martha Gavidia, Patrick Lee, Anna Feldman, Jing Peng

CATs are Fuzzy PETs: A Corpus and Analysis of Potentially Euphemistic Terms

Euphemisms have not received much attention in natural language processing, despite being an important element of polite and figurative language. Euphemisms prove to be a difficult topic, not only because they are subject to language change, but also because humans may not agree on what is a euphemism and what is not. Nonetheless, the first step to tackling the issue is to collect and analyze examples of euphemisms. We present a corpus of potentially euphemistic terms (PETs) along with example texts from the GloWbE corpus. Additionally, we present a subcorpus of texts where these PETs are not being used euphemistically, which may be useful for future applications. We also discuss the results of multiple analyses run on the corpus. Firstly, we find that sentiment analysis on the euphemistic texts supports that PETs generally decrease negative and offensive sentiment. Secondly, we observe cases of disagreement in an annotation task, where humans are asked to label PETs as euphemistic or not in a subset of our corpus text examples. We attribute the disagreement to a variety of potential reasons, including if the PET was a commonly accepted term (CAT).

[paper, bib]


Iyanuoluwa Shode, David Ifeoluwa Adelani, Anna Feldman

YOSM: A New Yoruba Sentiment Corpus For Movie Reviews

A movie that is thoroughly enjoyed and recommended by an individual might be hated by another. One characteristic of humans is the ability to have feelings which could be positive or negative. To automatically classify and study human feelings, an aspect of natural language processing, sentiment analysis and opinion mining were designed to understand human feelings regarding several issues which could affect a product, a social media platforms, government, or societal discussions or even movies. Several works on sentiment analysis have been done on high resource languages while low resources languages like Yoruba have been sidelined. Due to the scarcity of datasets and linguistic architectures that will suit low resource languages, African languages "low resource languages" have been ignored and not fully explored. For this reason, our attention is placed on Yoruba to explore sentiment analysis on reviews of Nigerian movies. The data comprised 1500 movie reviews that were sourced from IMDB, Rotten Tomatoes, Letterboxd, Cinemapointer and Nollyrated. We develop sentiment classification models using the state-of-the-art pre-trained language models like mBERT and AfriBERTa to classify the movie reviews.

[paper, bib]


Levi Corallo, Guanghui Li, Kenna Reagan, Abhishek Saxena, Brandon Wilde, Aparna S. Varde

A Framework for German-English Machine Translation with GRU RNN

Machine translation (MT) using Gated Recurrent Units (GRUs) is a popular model used in industry-level web translators because of the efficiency with which it handles sequential data compared to Long Short-Term Memory (LSTM) in language modeling with smaller datasets. Motivated by this, a deep learning GRU based Recurrent Neural Network (RNN) is modeled as a framework in this paper, utilizing WMT2021’s English-German data-set that originally contains 400,000 strings from German news with parallel English translations. Our framework serves as a pilot approach in translating strings from German news media into English sentences, to build applications and pave the way for further work in the area. In real-life scenarios, this framework can be useful in developing mobile applications (apps) for quick translation where efficiency is crucial. Furthermore, our work makes broader impacts on a UN SDG (United Nations Sustainable Development Goal) of Quality Education, since offering education remotely by leveraging technology, as well as seeking equitable solutions and universal access are significant objectives there. Our framework for German-English translation in this paper can be adapted to other similar language translation tasks.

[paper, bib]


Azza Abugharsa

Sentiment Analysis in Poems in Misurata Sub-dialect

Over the recent decades, there has been a significant increase and development of resources for Arabic natural language processing. This includes the task of exploring Arabic Language Sentiment Analysis (ALSA) from Arabic utterances in both Modern Standard Arabic (MSA) and different Arabic dialects. This study focuses on detecting sentiment in poems written in Misurata Arabic sub-dialect spoken in Misurata, Libya. The tools used to detect sentiment from the dataset are Sklearn as well as Mazajak sentiment tool1. Logistic Regression, Random Forest, Naive Bayes (NB), and Support Vector Machines (SVM) classifiers are used with Sklearn, while the Convolutional Neural Network (CNN) is implemented with Mazajak. The results show that the traditional classifiers score a higher level of accuracy as compared to Mazajak which is built on an algorithm that includes deep learning techniques. More research is suggested to analyze Arabic sub-dialect poetry in order to investigate the aspects that contribute to sentiments in these multi-line texts; for example, the use of figurative language such as metaphors.

[paper, bib]


Chris Leberknight , Anna Feldman, Carlos Martinez, Kei Yin Ng (past member):

A Linguistically-Informed Approach for Measuring and Circumventing Internet Censorship

Internet censorship consists of restrictions on what information can be publicized or viewed on the Internet. According to Freedom House's annual Freedom on the Net report, more than half the world's Internet users now live in a place where the Internet is censored or restricted. However, members of the Internet Freedom community lack comprehensive real-time awareness of where and how censorship is being imposed. The challenges to achieving such a solution include but are not limited to coverage, scalability, adoption, and safety. The project explores a linguistically-informed approach for measuring and circumventing Internet censorship.

[paper]


Anna Feldman, Jing Peng

Automatic Detection of Idioms

The main goal of this research project is to develop a method for automatic idiom recognition. Idiomatic expressions, such as 'go the whole nine yards or 'piece of cake' are plentiful in everyday language. Some potentially idiomatic expressions can actually appear as literal (e.g.,  we can actually eat a piece of cake!). So the goal of this project is to develop an algorithm that recognizes idiomatic expressions. We have been working on several languages, including English, Russian, and Chinese.

[paper, bib]


Martina Ducret, Lauren Kruse, Carlos Martinez, Anna Feldman, Jing Peng

Automatic Detection of Sarcasm

We explore linguistic features that contribute to sarcasm detection. The linguistic features that we investigate are a combination of text and word complexity, stylistic and psychological features. 

[paper, bib]


Brikena Liko, Anna Feldman, Jing Peng 

Resource-light morphosyntactic tagging for morphologically rich languages.

The main goal of this project is to develop a tagging method which neither relies on target-language training data nor requires bilingual dictionaries and parallel corpora. The main assumption is that a model for the target language can be approximated by language models from one or more related source languages.  

[paper, bib]


Sara Cantor, Anna Feldman, Jing Peng:

Generating Clues for Crossword Puzzles

[paper, bib]


Zach Dau, Anna Feldman, Jing Peng:

Computational Analysis of the Coronavirus Pandemic; Response of Tri-State Area Politicians on Twitter

[paper, bib]




Urban Policy and Commonsense Knowledge for Smart Cities: Text Mining of Ordinances and Tweets

Common Sense in Implicit Requirements Identification within SRS Documents

Terminology Evolution in Information Retrieval

Articles, Collocations and Prepositions in L2 English text