ASHREEN KAUR OBEROI
ASHREEN KAUR OBEROI
SENTIMENT ANALYSIS AND TOPIC MODELING
This page provides a summary of my project for MPS Data Science. I chose to work on Amazon Reviews for cell phones. Amazon is considered reliable and safe option to buy cell phones and it is easy to do research on the same as many people leave reviews for the product they purchase.
I have divided my work in 3 phases, each will have its own significance and below is detailed information of the phases.
Using an existing dataset of locked and unlocked phones sold on Amazon’s customer reviews for locked and unlocked phones, which is focused on 10 brands ASUS, Apple, Google, HUAWEI, Motorola, Nokia, OnePlus, Samsung, Sony, and Xiaomi. This dataset contains 82,815 reviews from Amazon about cell phones from 2004 up until Sep 2019. Each review can be associated with an item and brand name and comes with a rating ranging from 1 to 5.
Step by Step derivation of the dataset :
https://github.com/grikomsn/amazon-cell-phones-reviews
In phase 1 I have introduced my dataset and my objective and purpose of this project through a YouTube video , a GitHub page and a brief description on this site
Published here are two files, items.csv and reviews.csv with a date prefixed which indicates when the data is retrieved. items.csv contains retrieved (read: scraped) items from Amazon.com search results using generated URL and specific query string to search only specific brands and has minimal 1-star review. reviews.csv contains reviews for previously retrieved items at items.csv but not with columns from items.csv. Datasets are retrieved using Puppeteer.
This dataset is a structured Dataset.
TF-ID vectorizer
LinearSVM classifier
Logistic regression classifier
MultinomialNB
Topic Modelling
Sarcasm: Sarcasm can be very common in reviews and in order to eliminate that I needed to understand how to identify sarcasm which would have been very challenging, so I assumed that the reviews mentioned don’t contain of sarcasm
Vectorization method : I compared many vectorization methods like: Binary Term Frequency, Bag of Words (BoW) Term Frequency, Normalized Term Frequency. (L2) Normalized TF-IDF.(which turns out to be the best)
Models : After doing a lot of research, I figured the following to be the best models :Logistic regression classifier, MultinomialNB, SVM Classification
Brand Distribution
.
Average rating per brand
LEARNING RATINGS FROM REVIEWS TEXT :
Step 1 :First, I performed some Data cleaning by dropping unnecessary columnsand the null values to perform vectorization and further analysis.
Step 2:For this analysis ,I have only used the reviews column and the correspond rating value, for predicting the rating from text
Step 3: For these predictions I have used TF-ID vectorizer and LinearSVM classifier,Logistic regression classifier and MultinomialNB classifier with the dataset split into test and train.
Step 4 :the rating predictions for the first 10 reviews in the dataset were like this:
Predicted rating are mentioned first
Actual Rating are mentioned below
52%
54%
MultinomialNB :51%
Finally,I have modelled a simpler problem for the same analysis, and instead of trying to predict the exact star rating I have tried to classify the reviews into positive or negative, but I have removed the 3-star values to avoid confusion and improve the accuracy of the model. The accuracy results for this remodeled problem is
The accuracy increased to 90%
I have merged the items and reviews dataset based on the unique identifier ASIN and I have calculated the positivity.
The reviews were split according to the brands Apple, Samsung and Xiaomi and perform Topic Modeling on these 3 Brands using TF-ID vectorizer and LDA. Outputs show the following topics for apple
Apple’s Topic modelling Output from LDA model with 10 topics (each includes Top 10 words):
Topic #0: phone came got charger scratches iphone just time apple new
Topic #1: love working time phone new far iphone bought scratches great
Topic #2: good use bought phone iphone far time scratches price just
Topic #3: work product buy phone iphone time new great bought good
Topic #4: new brand looks price phone like works great good came
Topic #5: great works far phone scratches new good came iphone time
Topic #6: screen just phone iphone scratches got new came buy time
Topic #7: condition perfect came works phone scratches new great iphone good
Topic #8: like apple refurbished phone new iphone scratches buy time got
Topic #9: battery iphone life phone new good screen great scratches like
Apple Feature plot
Xiaomi Feature Plot
Samsung Feature Plot
The models did not perform that great but when I removed the neutral reviews, I saw better results but that’s not the correct way. My conclusion to this would be that the models were not that successful but still not the worst results but I would not recommend them.
For the result object , I had amazing results and I was able to print the top 10 most used words
References :
1. https://towardsdatascience.com/predicting-sentiment-of-amazon-product-reviews-6370f466fa73 : There is a very situation in this project. They are trying to predict the reviews through the sentiments and find the precision and accuracy. The sentiments are deciding factors of whether the review is negative or positive. It is very useful for my projects but I will be using other vectorization techniques than this project.
2. https://towardsdatascience.com/review-rating-prediction-a-combined-approach-538c617c495c: This is also the same context but they have also used Random Forest, NLP, and decision tree to solve the same problem which can be very useful as I can try to solve the issue I am trying to tackle with NLP instead of TF-IF vectorization. It gives me choice as to which vectorization methods work for me.
3. https://www.developintelligence.com/blog/2017/03/predicting-yelp-star-ratings-review-text-python/: This is a project from YELP reviews that are based on vectorization, they have some very easy method to test and train the data that i can use.