NaijaNLP: Sentiment Lexicon & Hate Speech
Abstract
Sentiment analysis is a novel field of research in Natural Language Processing (NLP) that deals with the identification and classification of people’s opinions and sentiments about products and services contained in a piece of text, usually in web data . While there are various language resources for sentiment analysis, most of them are for English, Chinese, and European languages. In Nigeria, the Hausa, the Yoruba and the Igbo languages are the most widely spoken languages in Nigeria, with over 150 million speakers in Nigeria alone. These languages are also widely used in other African countries. However, despite the huge amount of data generated in these languages through social media, language resources for sentiment analysis remains untapped. Consequently, in this research, we seek to develop a corpus, sentiment lexicon, and hate speech lexicon for Hausa, Yoruba and Igbo languages.
Proposed Dataset and Use Cases:
The contemporary web technology allows people to generate an unlimited amount of data online. People share their opinion mostly in the form of writing, images or videos about facts, events or things through web technologies. This opinion can be in the form of comments in a blog, debates, and arguments in discussion forums or status updates in social networking channels. The web 2.0 immensely contributed to this development. The web 2.0 provides features that enable users to actively interact and contribute to the web contents rather than merely reading the contents. These features make blogs, Facebook, Twitter and other social networking platforms possible. The user-generated content available on the web is a huge source of data that contains information virtually about everything worth discussing or talking about. Researchers attempt to make use of these data to draw conclusions about people’s opinions or sentiment about issues and events which leads to a new field of research called Sentiment Analysis or Opinion Mining.
Therefore, sentiment analysis can be defined as a novel field of research in Natural Language Processing (NLP) that deals with the identification and classification of people’s opinions and sentiments about products and services contained in a piece of text, usually in web data (Medhat, Hassan, and Korashy 2014). Research on sentiment analysis began over a decade ago. Early research on sentiment analysis includes research by (Dave, Lawrence, and Pennock 2003; Nasukawa and Yi 2003; Pang, Lee, and Vaithyanathan 2002). Some research areas in sentiment analysis include subjectivity detection, sentiment prediction, aspect-based sentiment summarization, contrastive viewpoint summarization, text summarization for opinions and predicting helpfulness of online comments/reviews.
However, social media has its downsides. One of which is providing freedom for the publication of content which is abusive and harmful both towards the principles of democracy and the rights of some groups of people - namely hate speech (henceforth, HS). HS can be defined as any expression “that is abusive, insulting, intimidating, harass- ing, and/or incites to violence, hatred, or discrimination. It is directed against people on the basis of their race, ethnic origin, religion, gender, age, physical condition, disability, sexual orientation, political conviction, and so forth” (Er- javec and Kovacˇicˇ, 2012). Although definitions and approaches to HS are varied and depend on the juridical tradition of the country, many agree that what is identified as such can not fall under the protection granted by the right to freedom of expression, and must be prohibited. Online platforms like Twitter or Youtube discourage hateful content, but its removal mainly relies on users' reports and lacks a systematic control. In this regard, a promising direction of research is the training of automated classifiers based on manually annotated corpora.
Despite these developments in sentiment analysis and social media analysis, to the best of our knowledge, current sentiment analysis resources such as sentiment analysis corpus and sentiment lexicon are available in rich-resource languages. Therefore, we propose to build the following resources for Nigerian Languages:
Sentiment Corpus: Labelled and Unlabeled sentiment Corpus for training and evaluation in Machine learning task (Reference)
Sentiment Lexicon : A sentiment dictionary use for sentiment Analysis (Reference)
Hate Speech Lexicon/Corpus: (-3 + 3). Extreme sentiment opinion has been used to create hate speech lexicon by using subjectivity and semantic features related to hate speech as reported in (A Lexicon-based Approach for Hate Speech Detection)
Sentiment lexicons are mostly developed manually. This is done by compiling a list of words that convey a sentiment from an existing corpus. These words are then assigned a polarity value (positive, negative or neutral) and a polarity score(s) manually or automatically that identify the sentiment orientation of the word. On the other hand, a corpus is developed from social media data publicly available. Therefore, for this research, our methodology for the three resources is as follows:
Sentiment Corpus: The sentiment corpus will be generated from tweets of opinions concerning major news headlines on Twitter using an existing Python crawler developed by one of the researchers (Bello). Around ten thousand tweets will be extracted per language via the Twitter API.
Sentiment Lexicon: The sentiment lexicon for the three languages will be developed by manual annotation of the sentiment corpus. The annotation tool IO Annotator (https://app.ioannotator.com/datasets) will be used for the annotation. Three native annotators, for each of the languages will be hired and trained to perform the annotation. Additionally, the widely used English lexicons (Bing Liu, AFINN, and NRC) will be translated into Hausa, Yoruba, and Igbo by the services of professional language translators.
Hate Speech Lexicon: Extreme negative sentiment from the sentiment lexicon will be used to develop the hate speech lexicon as shown in [].
Use cases:
Sentiment Analysis: The labelled and unlabeled corpus can be used for sentiment analysis as training and evaluation.
The unlabeled Twitter dataset can be use for other NLP task such as POS, NER
Detecting and combating hate speech Reference (Combining Linguistic Features to Identify Hate Speech Against Immigrants and Women on Multilingual).
Business will employ sentiment analysis to gain business insight and help in better decision making
Government : Helps government make decisions based on citizens needs. People use social media to voice out their concerns and needs via Social media in their own local language. Thus, government can use sentiment analysis easily to make decision that directly affect citizen.
Specifications and Deliverables for Proposed Data and Documentation
Dataset,quantity, types, and format :
The project aimed at developing sentiment corpus enough for machine learning and other natural language processing tasks. Our target is to generate at least 6,000 lexicons for each language. This will consist of at least 2000 samples of each, positive, negative and neural lexicons in a CSV format which is one of the widely used data formats.
Data collection:
Social media has been one of the most widely used platforms for communication. The data will be collected from Twitter, one of the most popular social media platforms. Twitter provides public access to its data via the Twitter API and the project will utilize this API to generate a wealth amount of corpus for each language. One of the participants in this project has already created a Python-based Twitter crawler which has been used in various projects(Reference). We will use this tool for the collection.
Annotation process:
The dataset for each language will be prepared and submitted for annotation. To mitigate errors and bias, each dataset will be annotated by three different annotators. After which the project team will compute the kappa agreement between the annotators. We plan to use a web based annotation tools, brat (Stenetorp et al., 2012) which has been proved to be efficient for this type of task by many researchers(Pont). The annotators must be native speakers of the language and follow the annotation guidelines provided by the project teams. The annotation tasks consist of labeling each tweet as either positive, negative or neural. Identifying sentiment bearing words of each tweet and assigning a sentiment score between -5 to +5.
Deliverables:
The main deliverables of this project are: Sentiment Corpus, Sentiment Lexicon and Hate Speech Lexicon. Each deliverable will be accompanied by proper documentation.
Pathway(s) to Impact and Intended Beneficiaries
The datasets will be of interest to Researchers , Business, Media houses, Government and Security agencies. The datasets will facilitate understanding of discussions and opinions of the local people which will help greatly in achieving the 17 sustainable development goals(SDGs) to transform our world. The sentiment lexicons will serve as a tool for opinion mining from social media to obtain an immediate feedback of the public to the government about poverty, hunger, education and other SDGs topics(footnote). In addition, people now use social media such as Twitter to express their opinion on products they buy, which can be positive and negative. Therefore, businesses can leverage sentiment analysis to find feedback on their product and make informed decision.
Accessibility, Data Management, and Licensing
All corpus and Lexicons created from this project will be made widely accessible and available to the public via github, the project website and other public data repositories under Creative Commons License(CC-BY 4.0 ) which allows researchers and other users to use, share and build upon our work. The project team will manage the data and provide a proper and complete documentation on how to use the data.
Risks, Including Ethics and Privacy
The corpus and lexicons that will be created from this project will be generated from social media texts collected from Twitter Public API. Despite the data is already public under the Twitter developers license agreement, the project teams have planned to use additional pre-processing techniques to clean all the users mentioned in the tweets at the point of collection. This is to protect the privacy of users associated with the tweets. Hence, no sensitive or personal information can be derived from the data.
Sustainability Plan
Describe how the labeled dataset will be maintained, integrated, and/or expanded beyond the initial funding (e.g. through resultant ML applications, by a dedicated community, or a pool of interested parties with a robust governance model for the open dataset).
As part of the plan to maintain the proposed corpus, the following itemized plans will insure the sustainability of all the datasets:
The research will be carried out by BUKNLP research group and they have academics, PhD researches , masters and UG students that will continue to work on improving and contributing to the datasets after the lifetime of the grants.
Open Sourceness : Since the proposed datasets will be open source and hosted on Github and the project website , researchers and practitioners across the length and breadth of Africa will contribute towards maintaining and expanding. We will create a Project website, where other people will find existing lexicons and submit requests to update or add new corpus that is not part of this project.
We plan to integrate all the created corpus in an R package called (NaijaLex) where people will find it readily available to use in related machine learning tasks and make it available. Example of usch use case is lexicon package for English available via cran website here