Research

Welcome to the Montclair State University NLP Lab's Research Page!

You can find our recent publications and their BibTeX files in this page.

Clicking [paper] will direct you to the published paper.

Clicking [bib] will automatically download the BibTeX file for the paper.

Hasan Can Biyik, Patrick Lee, Anna Feldman. 2024.

Turkish Delights: a Dataset on Turkish Euphemisms

This research extends NLP work on potentially euphemistic terms (PETs) to Turkish, introducing the first Turkish PET dataset with both euphemistic and non-euphemistic examples. By listing Turkish euphemisms, collecting contexts, and annotating them, we describe the dataset and methodologies. We also experiment with transformer-based models for Turkish euphemism detection, evaluating them using F1, accuracy, and precision metrics.

In Proceedings of The First SIGTURK workshop co-located with ACL 2024

[paper, bib]


Patrick Lee, Anna Feldman. 2024.

Report on the Multilingual Euphemism Detection Task

This paper introduces the Multilingual Euphemism Detection Shared Task at FigLang 2024, part of NAACL 2024. The task involved detecting euphemisms in texts from American English, Spanish, Yorùbá, and Mandarin Chinese. We describe the expanded datasets, summarize the methods and findings of participating teams, and discuss implications for future research.

Proceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024)

[paper, bib]


Patrick Lee, Alain Chirino Trujillo, Diana Cuevas Plancarte, Olumide Ebenezer Ojo, Xinyi Liu, Iyanuoluwa Shode, Yuan Zhao, Jing Peng, Anna Feldman. 2024.

MEDs for PETs: Multilingual Euphemism Disambiguation for Potentially Euphemistic Terms

This study explores how euphemisms are processed computationally across languages. We train the multilingual transformer model XLM-RoBERTa to identify potentially euphemistic terms (PETs) in both multilingual and cross-lingual contexts. Our findings show that zero-shot learning occurs and that multilingual models often outperform monolingual ones, highlighting the benefits of multilingual data for understanding euphemisms. We also investigate whether cross-lingual data within the same domain is more valuable than within-language data from other domains.

Findings of the Association for Computational Linguistics: EACL 2024

[paper, bib]


FEED PETs: Further Experimentation and Expansion on the Disambiguation of Potentially Euphemistic Terms 

Transformers are effective for classifying English potentially euphemistic terms (PETs) as euphemistic or non-euphemistic. We expand this task by annotating PETs for vagueness, finding transformers perform better on vague PETs, indicating linguistic differences impact performance. We also introduce euphemism corpora in Yoruba, Spanish, and Mandarin Chinese, using multilingual models mBERT and XLM-RoBERTa for experiments, providing preliminary results for future research.

Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

[paper, bib]


Iyanuoluwa Shode, David Ifeoluwa Adelani, Jing Peng, Anna Feldman. 2023.

NollySenti: Leveraging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification 

We create NollySenti, a Nollywood movie review dataset for five Nigerian languages (English, Hausa, Igbo, Nigerian-Pidgin, and Yoruba). Using classical machine learning and pre-trained language models, we evaluate cross-domain adaptation from Twitter and cross-lingual adaptation from English. Results show that English transfer in the same domain improves accuracy by over 5% compared to Twitter transfer in the same language. Using machine translation (MT) from English to Nigerian languages further improves accuracy by 7%. Despite low-quality MT for low-resource languages, human evaluation confirms that most translated sentences retain the original sentiment.

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

[paper, bib]


Libby Barak, Zara Harmon, Naomi H. Feldman, Jan Edwards, Patrick Shafto. 2023.

When Children's Production Deviates From Observed Input: Modeling the Variable Production of the English Past Tense

Children often produce verb forms incorrectly as they learn grammatical rules, such as using bare verbs when past tense is required. This study uses computational modeling to replicate this early stage of rule acquisition in English. Our model shows that these errors arise from a tension between trying to use less frequent forms (past tense) and overusing frequent forms (bare verbs). The model progresses through similar stages as children, eventually mastering past tense, illustrating how these stages can be explained by a single learning mechanism.

Cognitive Science - A Multidisciplinary Journal

[paper, bib]


Levi Corallo, Aparna S Varde. 2023.

Optical Character Recognition and Transcription of Berber Signs from Images in a Low-Resource Language Amazigh

The Berber (Amazigh) language, spoken by 14 million people in North Africa, lacks resources and representation in education and technology, including Google Translate. We propose DaToBS, a supervised method for detecting and transcribing Berber's Tifinagh alphabet from photos. Using a corpus of 1862 annotated character images and CNN-based computer vision, DaToBS achieves over 92% accuracy. This work is among the first to automate Tifinagh transcription using deep learning.

AAAI-2023 the 37th AAAI Conference on Artificial Intelligence (AI4EDU workshop)

[paper, bib]


Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Oreen Yousuf, Salomey Osei, Abigail Oppong, Iyanuoluwa Shode, Oluwabusayo Olufunke Awoyomi, Chris Chinenye Emezue. 2022.

AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages

In the paper, we leverage the efficiency of active learning in the process of training multilingual large pre-trained models. We trained AfroLM from scratch, with ~0.73GB (which is 14x+ times smaller than other baselines mBERT, XLMR, AfroXLMR-base) of data from 23 African Languages. On MasakhaNER, AfroLM outperforms mBERT, and XMLR-base, and is highly competitive with AfroXLMR-base. AfroLM was solely trained on news data. We performed OOD/Cross-domain experiments, with sentiment analysis tasks in Twitter and Movies domains. AfroLM also performs better, suggesting its better adaptation, and generalization.

Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)

[paper, bib]


Kenna Reagan, Aparna Varde, Lei Xie. 2022.

Evolving Perceptions of Mental Health on Social Media and their Medical Impacts

This research investigates mental health perceptions by analyzing seven years of Twitter data using topic modeling and sentiment analysis. We focus on polarity and subjectivity to understand public sentiments. Significant events like elections and the COVID-19 pandemic have influenced discussions, with a decline in positive sentiment since the pandemic. The findings provide insights for professionals in data science, epidemiology, and psychology on mental health trends from social media data.

2022 IEEE International Conference on Big Data (Big Data)

[paper, bib]


Patrick Lee, Martha Gavidia, Anna Feldman, Jing Peng. 2022.

Searching for PETs: Using Distributional and Sentiment-Based Methods to Find Potentially Euphemistic Terms

This paper introduces a method for identifying potentially euphemistic terms (PETs) using linguistic principles. By leveraging distributional similarities and sentiment-based metrics, we filter and rank phrase candidates from sentences. Our approach, tested on a corpus of euphemisms, effectively detects PETs across various topics and suggests future applications for sentiment-based methods in this area.

Proceedings of the Second Workshop on Understanding Implicit and Underspecified Language

[paper, bib]


Avery Field, Aparna Varde, Pankaj Lal. 2022.

Sentiment Analysis and Topic Modeling for Public Perceptions of Air Travel: COVID Issues and Policy Amendments

Air travel has been impacted by the COVID pandemic, with airlines and airports using public sector information to enforce health guidelines. Travelers have expressed opinions on these policies through online reviews. This study uses data science to analyze airline and airport reviews since 2017, focusing on COVID-related impacts and policy changes from 2020. VADER sentiment analysis predicts opinion changes, while LDA topic modeling identifies major concerns. Findings reveal that COVID policies have worsened public perceptions of air travel, raising new concerns about economics, environment, and health.

The Legal and Ethical issues Workshop @LREC2022

[paper, bib]


Brad McNamee, Aparna Varde, Simon Razniewski. 2022.

Correlating Facts and Social Media Trends on Environmental Quantities Leveraging Commonsense Reasoning and Human Sentiments

Climate change opinions fluctuate, evident on social media like Twitter. This paper explores the relationship between Air Quality Index (AQI) data and climate change tweets. We focus on commonsense interpretations for broader appeal, using real AQI data and VADER for sentiment analysis. We find that correlations between climate tweets and air quality vary by year and environmental factors. Our goal is to increase climate change awareness and provide methods for addressing it, making the study openly accessible under a Creative Commons license.

Proceedings of the 2nd Workshop on Sentiment Analysis and Linguistic Linked Data

[paper, bib]


Martha Gavidia, Patrick Lee, Anna Feldman, Jing Peng. 2022.

CATs are Fuzzy PETs: A Corpus and Analysis of Potentially Euphemistic Terms

Euphemisms are often overlooked in natural language processing, despite their role in polite and figurative speech. They pose challenges due to their evolving nature and varying interpretations. To address this, we present a corpus of potential euphemistic terms (PETs) with examples from the GloWbE corpus, as well as a subcorpus of non-euphemistic uses. Our analyses show that PETs generally reduce negative sentiment, but there is some disagreement in annotating these terms as euphemistic or not, which may be influenced by whether a term is widely accepted.

Proceedings of the Thirteenth Language Resources and Evaluation Conference

[paper, bib]


Iyanuoluwa Shode, David Ifeoluwa Adelani, Anna Feldman. 2022.

YOSM: A New Yoruba Sentiment Corpus For Movie Reviews

Opinions on movies can vary widely, reflecting the complex nature of human emotions. Sentiment analysis, a branch of natural language processing, helps understand these emotions across different contexts, including product reviews and social media. While much research has focused on high-resource languages, low-resource languages like Yoruba have been underexplored. To address this gap, we analyze sentiment in 1500 Yoruba movie reviews from sources like IMDB and Rotten Tomatoes. We use advanced models such as mBERT and AfriBERTa to classify these reviews.

AfricaNLP Workshop @ICLR 2022

[paper, bib]


Levi Corallo, Guanghui Li, Kenna Reagan, Abhishek Saxena, Brandon Wilde, Aparna S. Varde. 2022.

A Framework for German-English Machine Translation with GRU RNN

Machine translation (MT) with Gated Recurrent Units (GRUs) is efficient for handling sequential data compared to Long Short-Term Memory (LSTM) models, especially with smaller datasets. This paper presents a GRU-based Recurrent Neural Network (RNN) using WMT2021’s English-German dataset to translate German news into English. Our framework aims to improve translation efficiency for applications and supports the UN’s Quality Education goal by enhancing remote education and equitable access. It can also be adapted for other language translation tasks.

CEUR Workshop Proceedings (CEUR-WS.org)

[paper, bib]


Azza Abugharsa. 2021.

Sentiment Analysis in Poems in Misurata Sub-dialect

Recent advancements in Arabic natural language processing have improved sentiment analysis for both Modern Standard Arabic (MSA) and various dialects. This study examines sentiment in Misurata Arabic poetry from Libya, using Sklearn with classifiers like Logistic Regression and SVM, and Mazajak's CNN tool. Results indicate traditional classifiers outperform Mazajak's deep learning approach. Further research is needed to explore sentiment in Arabic sub-dialect poetry, particularly regarding figurative language use.

International Journal of Computer and Technology Vol 21 (2021)

[paper, bib]

Zach Dau, Anna Feldman, Jing Peng. 2021.

Computational Analysis of the Coronavirus Pandemic; Response of Tri-State Area Politicians on Twitter

The COVID-19 pandemic has significantly changed life worldwide. In the U.S., nearly 10% of cases are in New York, New Jersey, and Connecticut. We analyzed tweets from prominent politicians in this area over 20 months, including before and during the pandemic. Our study found a significant increase in Twitter activity and used LDA and LSA models to observe topic changes. We also analyzed sentiment and lexical shifts, noting a trend towards more neutral tweet sentiment as politicians increased their engagement with constituents during the pandemic.

EasyChair Preprint no. 5984

[paper, bib]


Martina Ducret, Lauren Kruse, Carlos Martinez, Anna Feldman, Jing Peng. 2020.

You Don’t Say… Linguistic Features in Sarcasm Detection

We explore linguistic features contributing to sarcasm detection, focusing on text and word complexity, as well as stylistic and psychological features. Our experiments with sarcastic tweets, both with and without context, reveal that contextual information is crucial for sarcasm prediction. Notably, sarcastic tweets often show sentiment or emotional incongruence with their context.

Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

[paper, bib]