According to the articles in BBC and New York Post, a lot of fake Amazon reviews are being sold online. Online hucksters are selling Amazon reviews in bulk to help merchants manipulate the e-commerce platform’s rating system. As demand for online shopping grew at the beginning of the coronavirus pandemic a lot of fake and misleading reviews are being posted to boost the sales of the products sold online. This issue is being faced across different platforms like Amazon, eBay, Yelp, Walmart, BestBuy, etc. [1] [2] [3]
Amazon has taken action by removing fake reviews and urged customers to report reviews that appear suspicious. Despite Amazon's best efforts, merchants have found all kinds of ways to circumvent the e-commerce platform’s rating system.
My project is to use natural language processing and machine learning techniques to create a model that would detect if a given review is a fake or real one.
There has been a lot of research done on detecting fake reviews across different platforms like Amazon, eBay, Yelp, Walmart, BestBuy, etc., and the websites like Fakespot and ReviewMeta provide a platform to analyze online customers reviews for fake or unreliable reviews. [4] [5] [6]
The common feature used for detecting fake reviews are: [7]
Review-centric features
Length of the review
Average word length of the reviewer
Number of sentences
Average sentence length of the reviewer
Percentage of numerals
Percentage of capitalized words
Percentage of positive/negative opinion bearing words in each review
Reviewer-centric features
Maximum number of reviews in a day
Percentage of reviews with positive/negative ratings
Average review length
The standard deviation of ratings of the reviewer’s reviews
The project goal is to create a machine learning model to detect if a review is fake or real from the Amazon reviews dataset.
The Amazon Customer Reviews dataset has been taken from the Amazon Customer Reviews Library and it contains the customer review text with accompanying metadata, consisting of three major components: [8]
A collection of reviews written in the Amazon.com marketplace and associated metadata from 1995 until 2015. This is intended to facilitate study into the properties (and the evolution) of customer reviews potentially including how people evaluate and express their experiences with respect to products at scale. (130M+ customer reviews)
A collection of reviews about products in multiple languages from different Amazon marketplaces, intended to facilitate analysis of customers’ perception of the same products and wider consumer preferences across languages and countries. (200K+ customer reviews in 5 countries)
A collection of reviews that have been identified as non-compliant with respect to Amazon policies. This is intended to provide a reference dataset for research on detecting promotional or biased reviews. (several thousand customer reviews).
6900886 rows x 15 columns
marketplace - 2 letter country code of the marketplace where the review was written.
customer_id - Random identifier that can be used to aggregate reviews written by a single author.
review_id - The unique ID of the review.
product_id - The unique Product ID the review pertains to. In the multilingual dataset, the reviews for the same product in different countries can be grouped by the same product_id.
product_parent - Random identifier that can be used to aggregate reviews for the same product.
product_title - Title of the product.
product_category - Broad product category that can be used to group reviews.
star_rating - The 1-5 star rating of the review.
helpful_votes - Number of helpful votes.
total_votes - Number of total votes the review received.
vine - Review was written as part of the Vine program.
verified_purchase - The review is on a verified purchase.
review_headline - The title of the review.
review_body - The review text.
review_date - The date the review was written.
Tab ('\t') separated text file, without a quote or escape characters. First-line in each file is the header; 1 line corresponds to 1 record.
The following steps have been made as a part of data preprocessing and cleaning:
The dataset has been loaded into a Pandas DataFrame and invalid records (rows with more than 22 values) have been dropped.
The records with Null values have been dropped and there were no duplicate records.
The date column is converted into date format.
The above steps have been performed for both the full dataset and the training dataset.
The dataset has 6900413 records with reviews dated from 1995 to 2015.
The reviews are of various categories like Books, Music, Video, Toys, Tools, Office Products, Video Games, Software, Digital Purchases, Electronics, etc.
The highest number of reviews have 5-star ratings.
The dataset has 21000 records with reviews with labels for biased and unbiased reviews.
The number of biased and unbiased reviews is equal i.e., 10500 records each.
There are 700 reviews for each category with 350 biased and 350 unbiased reviews.
The highest number of reviews have 5-star ratings.
Text from the product title, review text, and title would need to be cleaned before passing it to the model
The text was converted to lower case and removed URLs, mentions, hashtags, punctuations, emojis
Removed numbers, special characters, single characters, and over spaces
Stopwords have not been removed as in few cases the positive and negative intent of the review would get changed
The dataset has been split for 80% training and 20% testing
Cleaned product title, review text, and title have been vectorized using TfidfVectorizer for feature extraction
Review rating, Verified purchase, and Product category have been transformed to a one-hot numeric array using OneHotEncoder for categorical feature extraction
The hstack() function is used to stack arrays in sequence horizontally (column-wise)
I have used Logistic Regression, Decision Tree, Gradient Boosting, Random Forest, Passive Aggressive, Multinomial Naive Bayes, and Linear Support Vector classifiers for predicting if the review is biased or unbiased
The Accuracy of these models is:
Before making a prediction on the reviews from the Amazon Customer Reviews dataset, the dataset needs to be pre-processed and cleaned. Text from the product title, review text, and title have been cleaned similarly as I had done for the training dataset and stored the cleaned dataset to Google Drive.
Additionally, I have:
converted the star_rating from float to int datatype to keep it in sync with the training dataset
There are more Product Categories on the Unlabeled Amazon review dataset compared when compared to the Labeled Amazon review dataset. I have used the Labeled Amazon review dataset for training, so I have dropped the Product Categories from the Unlabeled Amazon review dataset which are not present in the Labeled Amazon review dataset.
The final cleaned Unlabeled Amazon review dataset has 2208111 rows. I would be making the prediction on this dataset using my trained on Labeled Amazon review dataset.
Review Text: "I have to say, that when Stephen King wrote this book he truly outdid himself. Stephen King enters you into the world of seven children that are bonded by destiny to fight evil itself in the form of a clown that haunts them and kills other children. After they thought they had killed IT, 20 something years later, IT comes back and they are drawn back to their hometown to face the same evil they faced as children but now they are adults and actually don't have their innocence and imagination, which were their advantages over IT. You'll be hooked until the end until one of the forces wins, IT or the force that draws the children (now adults) toghether."
Logistic_Regression_Pred 1 - Unbiased
Decision_Tree_Classification_Pred 0 - Biased
Gradient_Boosting_Classifier_Pred 1 - Unbiased
Random_Forest_Classifier_Pred 1 - Unbiased
PassiveAggressiveClassifier_Pred 0 - Biased
MultinomialNB_Pred 1 - Unbiased
SVM_Pred 1 - Unbiased
Review Text: "its cool"
Logistic_Regression_Pred 0 - Biased
Decision_Tree_Classification_Pred 0 - Biased
Gradient_Boosting_Classifier_Pred 0 - Biased
Random_Forest_Classifier_Pred 0 - Biased
PassiveAggressiveClassifier_Pred 1 - Unbiased
MultinomialNB_Pred 0 - Biased
SVM_Pred 0 - Biased
Out of the 2.2 million unlabeled Amazon reviews used for prediction, my models have predicted that there are approximately 650K Unbiased reviews and 1.5 million Biased reviews. These model predictions are 80% accurate.
References:
[1] Fake Amazon REVIEWS 'being sold in Bulk' online. (2021, February 16). Retrieved February 20, 2021, from https://www.bbc.com/news/business-56069472
[2] Manskar, N. (2021, February 16). Fake Amazon reviews are being sold in bulk online. Retrieved February 20, 2021, from https://nypost.com/2021/02/16/fake-amazon-reviews-are-being-sold-in-bulk-online/
[3] Picchi, A. (2019, February 28). Buyer Beware: Scourge of fake reviews hitting Amazon, Walmart and other major retailers. Retrieved February 22, 2021, from https://www.cbsnews.com/news/buyer-beware-a-scourge-of-fake-online-reviews-is-hitting-amazon-walmart-and-other-major-retailers/
[4] W. Liu, J. He, S. Han, F. Cai, Z. Yang and N. Zhu, "A Method for the Detection of Fake Reviews Based on Temporal Features of Reviews and Comments," in IEEE Engineering Management Review, vol. 47, no. 4, pp. 67-79, 1 Fourthquarter, Dec. 2019, doi: 10.1109/EMR.2019.2928964.
[5] P. Liu, Z. Xu, J. Ai and F. Wang, "Identifying Indicators of Fake Reviews Based on Spammer's Behavior Features," 2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C), Prague, 2017, pp. 396-403, doi: 10.1109/QRS-C.2017.72.
[6] S. K. Chauhan, A. Goel, P. Goel, A. Chauhan and M. K. Gurve, "Research on product review analysis and spam review detection," 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN), Noida, 2017, pp. 390-393, doi: 10.1109/SPIN.2017.8049980.
[7] Zhang, K. (n.d.). How to detect fake online reviews using machine learning. Retrieved February 23, 2021, from https://scoredata.com/how-to-detect-fake-online-reviews-using-machine-learning-2/
[8] Amazon customer Reviews Dataset. (n.d.). Retrieved February 20, 2021, from https://s3.amazonaws.com/amazon-reviews-pds/readme.html