Isabelle Ngassa

Sentiment analysis on Scientific Peers Reviews

Link to GitHub Repository

Table Of Contents

Phase I:
- Introduction
- Data Preprocessing
Phase II: Exploratory Data Analysis(EDA)
Phase III: Reports & Conclusion

Image courtesy of GeeksforGeeks. [1]

Project Delivery: Phase 1

Introduction

Image courtesy of Adobe Stock [4]

Scientific discoveries and findings can have a big impact on people and society. This is why they are subjected to high-quality control called 'peer review' before being published.

Peer review includes subjecting the author's insightful work and research to the examination of other specialists within the same field to check its legitimacy and assess its reasonableness for publication. A peer review helps the publishers to decide whether a work ought to be accepted.[2]

For now, sentiment analysis has not been used that much in scientific papers.

The authors of the dataset that will be used also stated that they are the first to apply sentiment analysis on scientific peers’ reviews.[3]

What is sentiment analysis?

Sentiment analysis or opinion mining is the field of study that analyzes people's opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes. [5]

Most often, the term “sentiment analysis” is used to refer to the task of automatically classifying text according to the expressed polarity (positive, negative, or neutral). However, it actually covers a larger number of tasks relating to detecting the general attitude of the author of the text towards a particular target (scientific papers in this particular case).

Project Goals

The goal of this analysis is to apply sentiment analysis tools on the dataset to determine the sentiments in each review and compare them with the assessments made by the readers and the reviewers of the papers. I will also evaluate the performance of each method used.

Implementations

Previous works on this dataset used traditional machine learning methods such as Naïve Bayes, Logistic Regression, and Support Vector Machines (SVM). Which are really good models for small datasets because of their low complexity.

For this study, I will first use the previous approaches. After I’m planning to use the Python Keras deep learning library (but the size of the dataset can be a problem), and a supervised learning algorithm (neural networks), which seems to work well even on small-size datasets.

For each method used, I will be focusing on “Accuracy”. As it will give me the proportion of the total number of correct predictions (Accuracy = number of correct/total).

Dataset

The dataset is an unstructured JSON file, from the UCI machine learning repository database.

It contains paper reviews sent to an international conference on computing and informatics.

mostly in Spanish (some are in English). It has a total of 179 papers and 405 reviews.

The structure is made up of a list of papers, called “paper”.

Each paper is made up of an “id”, “preliminary_decision” and a list of reviews called “review”.

Each review has “confidence”, “evaluation”, “id”, “lan”, “orientation”, “remarks”, “text”, “timespan”.

Dataset Link

Background

The applications of sentiment analysis are really many and varied. For a fairly comprehensive list of the applications, we can take a look at the book by Lilian Lee and Bo Pang [6].

- Applications to review-related websites.

Customer Experience is the new marketing battlefront for companies, and it is important for business success. To improve the customer’s experience, companies have to pay attention to the VOC (Voice Of the Customer) to build products and strategies based on customers’ needs.

- Applications as a sub-component technology.

In recommendation systems, for example, sentiment analysis might help the system in recommend items base on users’ feedbacks.

- Applications in government intelligence.

Sentiment analysis can help the government knows what citizens think about topics of interest to the Public Administration.

- Applications across different domains.

For example, monitoring the evolution of opinion on a given subject, such as bills, political figures, etc.

Data Preprocessing

Using some python libraries I converted the raw data into clean data.

I went from a unique dataframe column to multiple dataframe columns

For the rest of my analysis, I will be using just the Spanish data from my dataset.

One problem I encountered at this point of my data preprocessing is the limitation of the NLTK lemmatizer on Spanish words. In the table below, lemmatization ["text_lemmatized" column]did not convert the token words ["New_text" column] into their base form.

To fix this, I used spaCy, and the result looks better, as you can see in the table below. I still have to do some checking to make sure all the lemmas are in their right form.

Youtube Presentation

Here is a brief youtube presentation of my project.

Project Delivery: Phase 2

Exploratory Data Analysis (EDA)

In this phase, the tools and libraries used are Pandas for EDA, Seaborn and Matplotlib for visualization.

Figure 2.1: Relationship between data, group by years.

In Figure 2.1, the “remarks” curve is lower because the reviewers did not always write remarks for each paper.

Figure 2.2: Yearly reviews

Figure 2.2 represents the number of reviews written each years for 4years.

Figure 2.3: Lengths of the reviews

In Figure 2.3, the boxplot shows that the median length of reviews is about 1000 words. With some outliers, one has over 6000 words.

Figure 2.4: Normalization of the evaluation scores

Figure 2.5: Normalization of the orientation scores

Figure 2.4 Evaluation expresses the reviewers’ opinions about the papers.

Figure 2.5: Orientation expresses the readers’ perceptions about the papers.

Here we have measures based on 5-points scale ('-2': very negative, '-1': negative, '0': neutral, '1': positive, '2': very positive)

Figure 2.6: Correlation map

Although the readers didn't know what were the reviewer's evaluations about the papers, we can see that the correlation between "evaluation" and "orientation" is high. This shows that both groups have almost the same perceptions about the papers.

Text classification

In the table below, using CountVectorizer() as vectorizer with ngram_range = (1, 2) and different classifiers, I was able to predict with an accuracy of about 70% which paper will be accepted or rejected.

Figure 2.7: Preliminary decision

Sentiment Analysis

In this code snippet, I'm using " sentiment-analysis-Spanish" which is a Python library that uses convolutional neural networks to predict the sentiment of Spanish sentences.

Figure 2.8: Distribution of the sentiment score

With "SentimentAnalysisSpanish", the function sentiment(review) returns a number between 0&1.

Numbers close to 0, mean that the review is negative. Numbers close to 1, mean that the review is positive. The space in between correspond to neutral texts. [7]

In Figure 2.8. I have 343 (~90%) reviews with sentiment score between 0 and 0.5 ==> Negative.

And just 39 reviews with score greater than 0.5 ==> Positive.

This is not what, I was expecting. Based on Figure 2.7 about 70% of the papers were accepted, which let us think that the reviews about them were somehow positive.

To make sure this library is doing well on identifying the sentiment behind each review, I translated the reviews into English, which has more advanced libraries for sentiment analysis.

Text translation before sentiment analysis

After translating the reviews into English and calculating the sentiment behind each review. I had scores that seem to match better with what I'm expecting.

As we can see in the graphs below most of the scores are greater than 0, which means "positive" sentiment.

With Vader sentiment analysis (Figure 2.9.), 279 reviews (~ 72%) have a positive compound (score >0).

With TextBlob (Figure 2.10.), 275 reviews (~ 71%) have a positive polarity (score > 0).

Figure 2.9: Vader sentiment score

Figure 2.10: TextBlob sentiment score

Youtube Presentation

Phase 2, bref Youtube presentation.

Project Delivery: Phase 3

Reports & Conclusion

The main steps of this phase are:

Model Training
Prediction
Evaluation

In Fig 3.1 and 3.2, I calculated the sentiment score of each review based on a 5-points scale to conform with the reviewers' evaluation scores.

“-2”: very negative,

“-1”: negative,

“0”: neutral,

“1”: positive,

“2”: very positive.

Figure 3.1: TextBlob Scores

Figure 3.2: Vader Scores

In the figure beside (Fig 3.3), I have the word cloud of the text data where the size of each word indicates its frequency or importance.

Figure 3.3: Word Cloud

Here looking at the subjectivity axis, we can see that the reviews are mostly based on facts (Subjectivity < 0.5) than opinions (Subjectivity> 0.5).

This is a good evaluation of the model as these texts are scientific reviews written by experts in the domain.

Figure 3.4: TextBlob "Polarity-Subjectivity"

In Fig 3.5, I plotted together 3 columns that represented the scores and the count of the reviews dataset using 3 different approaches:

-Manual (Orientation)

-TextBlob

-Vader

Keeping in mind that orientation scores are scores gave by people based on their perception of the papers.

And TextBlob results are Machine Learning(ML) scores' calculations on the same papers

The difference in the results can be explained by the fact that manual scoring (human scoring) is subjective by nature (two readers can easily give different scores) and potential biases can be introduced in the process.

The ML approach is more standardized, and we can get the same result independent of who is running the process.

Figure 3.5: Sentiments Scores

Models

The methods used in sentiment analysis or opinion mining are related to data extraction and preprocessing. Natural language processing, and machine learning.

In this study, I used 3 models:

•Naïve Bayes (NB)

•Logistic Regression (LR)

•Support Vector Machine (SVM)

For each model, I calculated the accuracy score which is the measure of the algorithms' performance in predicting the scores.

Figure 3.6: Models Accuracies

Conclusion

Looking at the accuracies, Logistic Regression outperforms all other models in predicting the sentiment scores.

So I will choose LR as my model.

What Next

The next step Could be to start looking into techniques like RNN and LSTMs, which can improve the model performance.

Youtube Presentation

References

[1] Image courtesy of GeeksforGeeks. [6]. https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python

[2] Kelly J, Sadeghieh T, Adeli K. (2014). “Peer review in scientific publications: benefits, critiques, and a survival guide” from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4975196/

[3] Keith, B., Fuentes, E., & Meneses, C. (2017). A Hybrid Approach for Sentiment Analysis Applied to Paper Reviews, from https://sentic.net/wisdom2017fuentes.pdf

[4] Image courtesy of Adobe Stock. https://www.attestationupdate.com/2016/10/18/be-prepared-a-comprehensive-peer-review-update/

[5] B. Liu. (2012) “Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies.” Morgan & Claypool Publishers.

[6] B. Pang and L. Lee. (2008) “Opinion Mining and Sentiment Analysis,” from https://www.cs.cornell.edu/home/llee/omsa/omsa.pdf.

[7] https://pypi.org/project/sentiment-analysis-spanish/

Page updated

Report abuse