Spring 2021

Instructor: Dr. Ergun Simsek

Fake Amazon Review Detection

Phase 1

Introduction

According to the articles in BBC and New York Post, a lot of fake Amazon reviews are being sold online. Online hucksters are selling Amazon reviews in bulk to help merchants manipulate the e-commerce platform’s rating system. As demand for online shopping grew at the beginning of the coronavirus pandemic a lot of fake and misleading reviews are being posted to boost the sales of the products sold online. This issue is being faced across different platforms like Amazon, eBay, Yelp, Walmart, BestBuy, etc. [1] [2] [3]

Amazon has taken action by removing fake reviews and urged customers to report reviews that appear suspicious. Despite Amazon's best efforts, merchants have found all kinds of ways to circumvent the e-commerce platform’s rating system.

My project is to use natural language processing and machine learning techniques to create a model that would detect if a given review is a fake or real one.

Industrial Research

There has been a lot of research done on detecting fake reviews across different platforms like Amazon, eBay, Yelp, Walmart, BestBuy, etc., and the websites like Fakespot and ReviewMeta provide a platform to analyze online customers reviews for fake or unreliable reviews. [4] [5] [6]

The common feature used for detecting fake reviews are: [7]

Review-centric features

Length of the review
Average word length of the reviewer
Number of sentences
Average sentence length of the reviewer
Percentage of numerals
Percentage of capitalized words
Percentage of positive/negative opinion bearing words in each review

Reviewer-centric features

Maximum number of reviews in a day
Percentage of reviews with positive/negative ratings
Average review length
The standard deviation of ratings of the reviewer’s reviews

Project Goal

The project goal is to create a machine learning model to detect if a review is fake or real from the Amazon reviews dataset.

Dataset

The Amazon Customer Reviews dataset has been taken from the Amazon Customer Reviews Library and it contains the customer review text with accompanying metadata, consisting of three major components: [8]

A collection of reviews written in the Amazon.com marketplace and associated metadata from 1995 until 2015. This is intended to facilitate study into the properties (and the evolution) of customer reviews potentially including how people evaluate and express their experiences with respect to products at scale. (130M+ customer reviews)
A collection of reviews about products in multiple languages from different Amazon marketplaces, intended to facilitate analysis of customers’ perception of the same products and wider consumer preferences across languages and countries. (200K+ customer reviews in 5 countries)
A collection of reviews that have been identified as non-compliant with respect to Amazon policies. This is intended to provide a reference dataset for research on detecting promotional or biased reviews. (several thousand customer reviews).

Size:

6900886 rows x 15 columns

Data Columns:

marketplace - 2 letter country code of the marketplace where the review was written.

customer_id - Random identifier that can be used to aggregate reviews written by a single author.

review_id - The unique ID of the review.

product_id - The unique Product ID the review pertains to. In the multilingual dataset, the reviews for the same product in different countries can be grouped by the same product_id.

product_parent - Random identifier that can be used to aggregate reviews for the same product.

product_title - Title of the product.

product_category - Broad product category that can be used to group reviews.

star_rating - The 1-5 star rating of the review.

helpful_votes - Number of helpful votes.

total_votes - Number of total votes the review received.

vine - Review was written as part of the Vine program.

verified_purchase - The review is on a verified purchase.

review_headline - The title of the review.

review_body - The review text.

review_date - The date the review was written.

Data Format:

Tab ('\t') separated text file, without a quote or escape characters. First-line in each file is the header; 1 line corresponds to 1 record.

Sample Dataset

Full Dataset

Dataset ReadMe

Training Dataset

Phase 2

Data Pre-processing and Cleaning

The following steps have been made as a part of data preprocessing and cleaning:

The dataset has been loaded into a Pandas DataFrame and invalid records (rows with more than 22 values) have been dropped.
The records with Null values have been dropped and there were no duplicate records.
The date column is converted into date format.

The above steps have been performed for both the full dataset and the training dataset.

Exploratory Data Analysis

Full Dataset

The dataset has 6900413 records with reviews dated from 1995 to 2015.
The reviews are of various categories like Books, Music, Video, Toys, Tools, Office Products, Video Games, Software, Digital Purchases, Electronics, etc.
The highest number of reviews have 5-star ratings.

Training Dataset

The dataset has 21000 records with reviews with labels for biased and unbiased reviews.
The number of biased and unbiased reviews is equal i.e., 10500 records each.
There are 700 reviews for each category with 350 biased and 350 unbiased reviews.
The highest number of reviews have 5-star ratings.

Building the Models

Data Cleaning

Text from the product title, review text, and title would need to be cleaned before passing it to the model
The text was converted to lower case and removed URLs, mentions, hashtags, punctuations, emojis
Removed numbers, special characters, single characters, and over spaces
Stopwords have not been removed as in few cases the positive and negative intent of the review would get changed

Model Training

The dataset has been split for 80% training and 20% testing
Cleaned product title, review text, and title have been vectorized using TfidfVectorizer for feature extraction
Review rating, Verified purchase, and Product category have been transformed to a one-hot numeric array using OneHotEncoder for categorical feature extraction
The hstack() function is used to stack arrays in sequence horizontally (column-wise)
I have used Logistic Regression, Decision Tree, Gradient Boosting, Random Forest, Passive Aggressive, Multinomial Naive Bayes, and Linear Support Vector classifiers for predicting if the review is biased or unbiased
The Accuracy of these models is:

Phase 3

Making a prediction on the reviews from the Amazon Customer Reviews dataset

Before making a prediction on the reviews from the Amazon Customer Reviews dataset, the dataset needs to be pre-processed and cleaned. Text from the product title, review text, and title have been cleaned similarly as I had done for the training dataset and stored the cleaned dataset to Google Drive.

Additionally, I have:

converted the star_rating from float to int datatype to keep it in sync with the training dataset
There are more Product Categories on the Unlabeled Amazon review dataset compared when compared to the Labeled Amazon review dataset. I have used the Labeled Amazon review dataset for training, so I have dropped the Product Categories from the Unlabeled Amazon review dataset which are not present in the Labeled Amazon review dataset.

The final cleaned Unlabeled Amazon review dataset has 2208111 rows. I would be making the prediction on this dataset using my trained on Labeled Amazon review dataset.

Final cleaned unlabeled Amazon reviews dataset used for prediction

Model prediction results

Sample Result 1

Review Number 15 is about the book "It" with a 5-star rating.

Review Text: "I have to say, that when Stephen King wrote this book he truly outdid himself. Stephen King enters you into the world of seven children that are bonded by destiny to fight evil itself in the form of a clown that haunts them and kills other children. After they thought they had killed IT, 20 something years later, IT comes back and they are drawn back to their hometown to face the same evil they faced as children but now they are adults and actually don't have their innocence and imagination, which were their advantages over IT. You'll be hooked until the end until one of the forces wins, IT or the force that draws the children (now adults) toghether."

The prediction results are:

Logistic_Regression_Pred 1 - Unbiased

Decision_Tree_Classification_Pred 0 - Biased

Gradient_Boosting_Classifier_Pred 1 - Unbiased

Random_Forest_Classifier_Pred 1 - Unbiased

PassiveAggressiveClassifier_Pred 0 - Biased

MultinomialNB_Pred 1 - Unbiased

SVM_Pred 1 - Unbiased

Sample Result 2

Review Number 1561993 is about the Video DVD "Smallville: The Complete Series" with a 3-star rating.

Review Text: "its cool"

The prediction results are:

Logistic_Regression_Pred 0 - Biased

Decision_Tree_Classification_Pred 0 - Biased

Gradient_Boosting_Classifier_Pred 0 - Biased

Random_Forest_Classifier_Pred 0 - Biased

PassiveAggressiveClassifier_Pred 1 - Unbiased

MultinomialNB_Pred 0 - Biased

SVM_Pred 0 - Biased

Conclusion

Out of the 2.2 million unlabeled Amazon reviews used for prediction, my models have predicted that there are approximately 650K Unbiased reviews and 1.5 million Biased reviews. These model predictions are 80% accurate.

References:

[1] Fake Amazon REVIEWS 'being sold in Bulk' online. (2021, February 16). Retrieved February 20, 2021, from https://www.bbc.com/news/business-56069472

[2] Manskar, N. (2021, February 16). Fake Amazon reviews are being sold in bulk online. Retrieved February 20, 2021, from https://nypost.com/2021/02/16/fake-amazon-reviews-are-being-sold-in-bulk-online/

[3] Picchi, A. (2019, February 28). Buyer Beware: Scourge of fake reviews hitting Amazon, Walmart and other major retailers. Retrieved February 22, 2021, from https://www.cbsnews.com/news/buyer-beware-a-scourge-of-fake-online-reviews-is-hitting-amazon-walmart-and-other-major-retailers/

[4] W. Liu, J. He, S. Han, F. Cai, Z. Yang and N. Zhu, "A Method for the Detection of Fake Reviews Based on Temporal Features of Reviews and Comments," in IEEE Engineering Management Review, vol. 47, no. 4, pp. 67-79, 1 Fourthquarter, Dec. 2019, doi: 10.1109/EMR.2019.2928964.

[5] P. Liu, Z. Xu, J. Ai and F. Wang, "Identifying Indicators of Fake Reviews Based on Spammer's Behavior Features," 2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C), Prague, 2017, pp. 396-403, doi: 10.1109/QRS-C.2017.72.

[6] S. K. Chauhan, A. Goel, P. Goel, A. Chauhan and M. K. Gurve, "Research on product review analysis and spam review detection," 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN), Noida, 2017, pp. 390-393, doi: 10.1109/SPIN.2017.8049980.

[7] Zhang, K. (n.d.). How to detect fake online reviews using machine learning. Retrieved February 23, 2021, from https://scoredata.com/how-to-detect-fake-online-reviews-using-machine-learning-2/

[8] Amazon customer Reviews Dataset. (n.d.). Retrieved February 20, 2021, from https://s3.amazonaws.com/amazon-reviews-pds/readme.html

Connect :

Page updated

Report abuse

Yaswanth Kaushal, Rayani Veera

Master's of Professional Studies in Data Science

Spring 2021

Instructor: Dr. Ergun Simsek

Fake Amazon Review Detection

Phase 1

Introduction

Industrial Research

Project Goal

Dataset

Size:

Data Columns:

Data Format:

Phase 2

Data Pre-processing and Cleaning

Exploratory Data Analysis

Full Dataset

Training Dataset

Building the Models

Data Cleaning

Model Training

Phase 3

Making a prediction on the reviews from the Amazon Customer Reviews dataset

Sample Result 1

Review Number 15 is about the book "It" with a 5-star rating.

The prediction results are:

Sample Result 2

Review Number 1561993 is about the Video DVD "Smallville: The Complete Series" with a 3-star rating.

The prediction results are:

Conclusion

Connect :