ASHREEN KAUR OBEROI

SENTIMENT ANALYSIS AND TOPIC MODELING

OVERVIEW

This page provides a summary of my project for MPS Data Science. I chose to work on Amazon Reviews for cell phones. Amazon is considered reliable and safe option to buy cell phones and it is easy to do research on the same as many people leave reviews for the product they purchase.

I have divided my work in 3 phases, each will have its own significance and below is detailed information of the phases.

ABOUT MY DATATSET

Using an existing dataset of locked and unlocked phones sold on Amazon’s customer reviews for locked and unlocked phones, which is focused on 10 brands ASUS, Apple, Google, HUAWEI, Motorola, Nokia, OnePlus, Samsung, Sony, and Xiaomi. This dataset contains 82,815 reviews from Amazon about cell phones from 2004 up until Sep 2019. Each review can be associated with an item and brand name and comes with a rating ranging from 1 to 5.

Step by Step derivation of the dataset :

https://github.com/grikomsn/amazon-cell-phones-reviews

PHASE 1

In phase 1 I have introduced my dataset and my objective and purpose of this project through a YouTube video , a GitHub page and a brief description on this site

Published here are two files, items.csv and reviews.csv with a date prefixed which indicates when the data is retrieved. items.csv contains retrieved (read: scraped) items from Amazon.com search results using generated URL and specific query string to search only specific brands and has minimal 1-star review. reviews.csv contains reviews for previously retrieved items at items.csv but not with columns from items.csv. Datasets are retrieved using Puppeteer.

Structured/Unstructured Data:

This dataset is a structured Dataset.

Methods that are going to be used:

TF-ID vectorizer
LinearSVM classifier
Logistic regression classifier
MultinomialNB
Topic Modelling

YOUTUBE VIDEO -PHASE 1

GITHUB LINK

https://github.com/ashreenoberoi/606-Capstone

CHALLENGES

Sarcasm: Sarcasm can be very common in reviews and in order to eliminate that I needed to understand how to identify sarcasm which would have been very challenging, so I assumed that the reviews mentioned don’t contain of sarcasm
Vectorization method : I compared many vectorization methods like: Binary Term Frequency, Bag of Words (BoW) Term Frequency, Normalized Term Frequency. (L2) Normalized TF-IDF.(which turns out to be the best)
Models : After doing a lot of research, I figured the following to be the best models :Logistic regression classifier, MultinomialNB, SVM Classification

PHASE 2

Brand Distribution

Average rating per brand

LEARNING RATINGS FROM REVIEWS TEXT :

Step 1 :First, I performed some Data cleaning by dropping unnecessary columnsand the null values to perform vectorization and further analysis.

Step 2:For this analysis ,I have only used the reviews column and the correspond rating value, for predicting the rating from text

Step 3: For these predictions I have used TF-ID vectorizer and LinearSVM classifier,Logistic regression classifier and MultinomialNB classifier with the dataset split into test and train.

Step 4 :the rating predictions for the first 10 reviews in the dataset were like this:

Predicted rating are mentioned first

Actual Rating are mentioned below

MODELS AND ACCURACY RATES

Accuracy rate

52%

accuracy rate

54%

Accuracy rate

MultinomialNB :51%

Finally,I have modelled a simpler problem for the same analysis, and instead of trying to predict the exact star rating I have tried to classify the reviews into positive or negative, but I have removed the 3-star values to avoid confusion and improve the accuracy of the model. The accuracy results for this remodeled problem is

The accuracy increased to 90%

TOPIC MODELLING

I have merged the items and reviews dataset based on the unique identifier ASIN and I have calculated the positivity.

The reviews were split according to the brands Apple, Samsung and Xiaomi and perform Topic Modeling on these 3 Brands using TF-ID vectorizer and LDA. Outputs show the following topics for apple

Apple’s Topic modelling Output from LDA model with 10 topics (each includes Top 10 words):

Topic #0: phone came got charger scratches iphone just time apple new

Topic #1: love working time phone new far iphone bought scratches great

Topic #2: good use bought phone iphone far time scratches price just

Topic #3: work product buy phone iphone time new great bought good

Topic #4: new brand looks price phone like works great good came

Topic #5: great works far phone scratches new good came iphone time

Topic #6: screen just phone iphone scratches got new came buy time

Topic #7: condition perfect came works phone scratches new great iphone good

Topic #8: like apple refurbished phone new iphone scratches buy time got

Topic #9: battery iphone life phone new good screen great scratches like

PLOTTING FEATURES USING XBOOST

Apple Feature plot

Xiaomi Feature Plot

Samsung Feature Plot

PHASE 2 YOUTUBE VIDEO

CONCLUSION

The models did not perform that great but when I removed the neutral reviews, I saw better results but that’s not the correct way. My conclusion to this would be that the models were not that successful but still not the worst results but I would not recommend them.

For the result object , I had amazing results and I was able to print the top 10 most used words

REFERENCES

References :

1. https://towardsdatascience.com/predicting-sentiment-of-amazon-product-reviews-6370f466fa73 : There is a very situation in this project. They are trying to predict the reviews through the sentiments and find the precision and accuracy. The sentiments are deciding factors of whether the review is negative or positive. It is very useful for my projects but I will be using other vectorization techniques than this project.

2. https://towardsdatascience.com/review-rating-prediction-a-combined-approach-538c617c495c: This is also the same context but they have also used Random Forest, NLP, and decision tree to solve the same problem which can be very useful as I can try to solve the issue I am trying to tackle with NLP instead of TF-IF vectorization. It gives me choice as to which vectorization methods work for me.

3. https://www.developintelligence.com/blog/2017/03/predicting-yelp-star-ratings-review-text-python/: This is a project from YELP reviews that are based on vectorization, they have some very easy method to test and train the data that i can use.

Page updated

Report abuse