ARJUN REDDY SEELAM

SENTIMENT ANALYSIS ON YELP DATASET REVIEWS

Link to GitHub - Code files and Datasets

PHASE I :

INTRODUCTION:

In today's world as everything is being accessible at the fingertips, people who want to know which restaurant is good or which restaurant is bad, online reviews has a huge impact on restaurants as the reviews can make the customers choose the restaurant or go for another one based on reviews. Basically, online reviews can make the business upside down based on positivity. So, feedbacks can help the restaurants to overcome any changes to be made and improve the quality of the food. Manually it takes many hours to sort the hand written reviews and analyze which restaurant has good reviews. So, with the help of AI tools with no longer time consuming, we use sentiment analysis to gather large amount of text, train the text using the model. Sentiment analysis differentiates the customers feedback reviews into positive, negative or neutral and develop customer's insight about the food item.

PROJECT AIM:

The aim of this project is to identify the sentiment of customers who visited the restaurants according to the YELP data, this includes:

Usually all the reviews in the Yelp data has the star ratings from 1 to 5 (1 being the lowest and 5 being the highest). So, converting these numeric ratings into sentiment by model development, which will help us to understand the sentiment of reviews.
Analyzing the trend by identifying the restaurants which has gained positiveness among the reviews over the time in a particular area.
The conclusion obtained from the trend analysis of the model helps the restaurants to gain insights on the how the business is going on and make changes if necessary.

RESEARCH QUESTION:

Is there any difference in improvement of ratings for a restaurant over the time based on negative review trend?
Did the restaurants really took the reviews into consideration to make changes in the business?

LITERATURE SURVEY:

Sentiment Analysis which is also known as emotion AI, refers to the use of NLP, text analysis to identify and extract the suitable information to find out the emotions. It is widely applied to know the pulse of the people when they express there feelings on anything like, reviews for a product on e-commerce website, opinions on social media, feedback on the food from a restaurant and so on. Sentiment analysis is generally aspect based, when a we observe a review for a restaurant it may include the aspect like quality of the food, behavior of the staff and ambiance of the restaurant.

Presentation Video for Phase - 1

DATASET :

Yelp reviews data downloaded from YELP data. It contains files like business file and review file.
Both the files are merged and formatted into columns. As the data contains 1M+ records, first 100,000 rows are considered for modeling.
Source of Data: https://www.yelp.com/dataset

Link for dataset - GitHub Repository

PHASE II:

Merging multiple files to form the data :

YELP dataset contains multiple "Json" files, so as per requirement for this project we only require two Json files (review.json and business.json). After converting the data of Json file into a data frame, the attributes of files are:
- Review.Json: review_id, user_id, business_id, stars, useful, funny, cool, text, date
- Business.Json: business_id, name, address, city, state, postal_code, latitude, longitude, stars, review_count, is_open, attributes, categories, hours

Using the "Json Loads" convert the Json files into readable data frame.
As both the data frames have a common column "business_id", map both the data frames based on "business_id" column.
After mapping, the final dataset used for the project is achieved.

Performing Exploratory Data Analysis on Final Data:

To perform the initial exploratory data analysis, take the columns of review_id, text and review_stars into a separate data frame. Then, perform the feature extraction which involves in reducing the unwanted number of vocabulary resources required to describe large data set. Feature extraction includes the following functions:

i. total_len( ): Used to find the total length of review.

ii. stop_len( ): Most common words in any language which doesn’t add value to build a model. Words like “the”, ”is”, “in”, “when”, “where” etc. It helps in finding the length of stop words count.

iii. lemma_len( ): Aiming to remove inflectional endings only and to return the base or dictionary form of a word.

iv. Cleansing: Cleansing function removes all the blank spaces, quotation marks, url’s and special characters which are not required for analysis.

v. special_len( ): Helps in finding the length of special characters.

vi. pos_tags( ): Pos tagger tags each sentence and retrieves the dictionary of pos tags.

vii. get_polarity( ): Polarity is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement.

viii. get_subjectivity( ): Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. For example, if we can see that polarity is 0.8, it means that the statement is positive and 0.75 subjectivity refers that mostly it is a public opinion and not a factual information.

ix. word_count( ): total length of words for a review after cleaning.

x. unique_words( ): Count of unique words in a review.

xi. mean_word_len( ): Average length of a word in a review.

From the dataset, we can observe that percentage of 2 - star ratings are the least and the percentage of 5 - star ratings are the highest.

The reviews consists of :

14% of 1 - star reviews
8% of 2 - star reviews
11% of 3 - star reviews
23% of 4 - star reviews
44% of 5 - star reviews

Observation:

Clearly we can observe that, more than 65% of the customers rate the restaurant as 4 - star or 5 - star when they are satisfied with the restaurant and when they are not satisfied they rate it as 1 - star rather than rating the restaurant as 2 - star and 3 - star.

Coming to the word length, after performing cleaning on the text of review data, the average review length for every review is categorized based on the star rating.

1 - star reviews has 133 words
2 - star reviews has 132 words
3 - star reviews has 127 words
4 - star reviews has 110 words
5 - star reviews has 87 words

Observation:

After performing cleaning of using various functions like removing the stop words, we can see that, average length of 5 - star ratings is very less than all the other categories and gradually the average length of words decreases as the rating improves.

In the process of cleaning the text of reviews, removal of stop words helps us to extract the appropriate text content for analysis. Based on the graph we can affirm that on an average, 5 - star rating reviews has less stop words than 1 - star and 2 - star reviews.

1 - star reviews has 63 stop words
2 - star reviews has 64 stop words
3 - star reviews has 61 stop words
4 - star reviews has 55 stop words
5 - star reviews has 43 stop words

The above figures are the average number of stop words based on star rating and from this, we can observe that best reviews has less number of stop words.

Model Type and Initial Outcome:

1. LGBM (Light Gradient Boosting Method):

As the dataset is very huge, I have used LGBM modeling technique. LightGBM is a tree-based learning algorithm-based gradient boosting tool. Light GBM is a high-performance gradient boosting system based on the decision tree algorithm that can be used for ranking, classification, and a variety of other machine learning tasks. It splits the tree leaf wise with the best fit since it is based on decision tree algorithms, while other boosting algorithms break the tree depth wise or level wise rather than leaf wise. As a consequence, when increasing on the same leaf in Light GBM, the leaf-wise algorithm reduces more loss than the level-wise algorithm, resulting in much higher accuracy than either of the current boosting algorithms. It's also surprisingly fast, hence the name "Light." Leaf wise splits increase complexity and can result in overfitting; however, this can be avoided by defining the max-depth parameter, which defines the depth to which splitting can occur.

It is intended to be distributed and effective, and it offers the following benefits:

Faster training speed and higher efficiency: The Light GBM algorithm is based on a histogram algorithm in which it buckets continuous feature values into discrete bins to speed up the training process.
Lower memory use: Continuous values are replaced with separate bins, resulting in lower memory usage.
Better accuracy than any other boosting algorithm: It uses a leaf-wise split approach rather than a level-wise split approach to generate far more complex trees, which is the key factor in achieving higher accuracy. It can, however, lead to overfitting, which can be prevented by increasing the max depth parameter.
Compatibility with Large Datasets: When compared to XGBOOST, it is capable of performing equally well with large datasets while taking significantly less training time.

Applying LGBM model for individual star rating, and training the model for 20,000 iterations, we could observe that the training's multi_loss value is decreased gradually from 1.007 to 0.199 and valid's multi_loss is also decreased from 1.105 to 0.794.

If we look at Actual vs Prediction cross tab obtained from test size of 20%, 7952 reviews were actually 5 - star rated reviews and also they are the predicted reviews. Coming to the evaluation metrics,

Accuracy: 53.42 %

Precision: 44.44 %

Recall: 53.42 %

F1 - Score: 44.63 %

Presentation Video for Phase - 2

PHASE III:

Model 2: Cat Boost Classifier Model:

Yandex's CatBoost machine learning algorithm was recently open-sourced. It can work with a variety of data types to help companies solve a variety of problems. To top it off, it has the highest accuracy in the industry. The name Cat Boost is derived from Category Boosting. Category means dealing with multiple data types like audio, text data, images including historic data. Since this library is built on the gradient boosting library, the name "Boost" comes from the gradient boosting machine learning algorithm. Gradient boosting is a powerful machine learning algorithm that has been successfully applied to a variety of market problems such as fraud detection, recommendations and so on.

Advantages:

i. Performance: It provides good results along with some of the leading algorithms in machine learning.

ii. Handling different categorical data: CatBoost transforms categorical values to numerical values using a variety of statistics based on categorical features and categorical and numerical features.

Applying Cat Boost classifier model for individual star rating, and training the model for 10,000 iterations, we could observe that the learning rate is decreased from 1.599 to 1.011

If we look at Actual vs Prediction cross tab obtained from test size of 20%, 7833 reviews were actually 5 - star rated reviews and also they are the predicted reviews. Coming to the evaluation metrics:

Accuracy: 53.63 %

Precision: 45.80 %

Recall: 53.63 %

F1 - Score: 46.01 %

Model 3: Linear SVM Model:

Linear SVM is the new machine learning (data mining) algorithm for solving multi-class classification problems from extremely large data sets. LinearSVM is a linearly scalable routine, which means it produces an SVM model that scales linearly with the size of the training data set in terms of CPU time. It uses an original patented version of a cutting plane algorithm for designing a linear support vector machine.

Advantages and features of Linear SVM:

i. It is efficient in dealing with larger datasets.

ii. LinearSVM can work with thousands of features and attributes with large dataset.

iii. Can be used in personal laptops with high computation speed.

Applying Text Classification: The method of categorizing text into ordered categories is known as text categorization. Text classifiers use Natural Language Processing (NLP) to automatically analyze text and assign pre-defined tags or categories based on its content. For the sentiment analysis, text classification is the process in which we look whether a text is positive or negative about a given context. By the use of POS tagging, nouns and verbs are removed along with stop words.

Applying Tf-Idf Vectorizer: To find the word frequencies, the most popular method is the IF-Idf method. It stands for Term frequency - Inverse Document frequency from which the resulting component score values are assigned to every single word. With this vectorization, we set the maximum features to be extracted and perform linear modeling.

After applying vectorization and performing the model with Linear SVM, the accuracy on entire data is achieved as 65.17%. This accuracy is better than previous two techniques.

With approximately 65% accuracy with the LinearSVM, when we look at the actual vs prediction table for 30% of test data, we could observe the number of actual and predicted 5-star ratings are 11,335 which accounts for almost 40%. This is because, it is easier to analyze the sentiment of people who rate a review of 5-star than that of other star ratings.

Applying LinearSVM to restaurant data

Applying the same LinearSVM on restaurant data, the accuracy again decreased by 3% approximately when applied to whole data.

When we look at actual vs predicted table for restaurant category, from 20,050 reviews, the reviews which are actually 5-star and predicted correctly are 6685, which accounts for almost 34% of entire test dataset.

Research Question 1: Is there any difference in improvement of ratings for a restaurant over the time based on negative review trend?

For a given business ID from the list of restaurants, the ratings of restaurant with business ID: rcaPajgKOJC2vo_l3xa42A are shown on left side. We could see there was a review with 2 star in 2009 and 1 star in 2013 but, on the whole after 2009, there are more 5 - star and 4 - star ratings. So, there is a difference in improvement.

Research Question 2: Did the restaurants really took the reviews into consideration to make changes in the business?

Yes, looking at the graph on the left side, we can clearly say that the restaurants have taken the reviews seriously to improve their business as the yearly trend is constant between 3.8 to 4.0 approximately.

Video Presentation for Phase 3

Future Scope:

•Along with the ratings and reviews, location and zip code can also be considered to improve the accuracy based on location. As there is less scope to perform modeling on individual star ratings.

•Perform various modeling techniques to improve the accuracy.

•Build a neural network with 1D CNN on top of LSTM layer by using bidirectional recurrent layers to improve the accuracy.

REFERENCES:

H. S. and R. Ramathmika, "Sentiment Analysis of Yelp Reviews by Machine Learning," 2019 International Conference on Intelligent Computing and Control Systems (ICCS), Madurai, India, 2019, pp. 700-704, doi: 10.1109/ICCS45141.2019.9065812.
How to Perform Sentiment Analysis on Yelp Restaurant Reviews. (2020, October 14). MonkeyLearn Blog. https://monkeylearn.com/blog/yelp-sentiment-analysis/
TextBlob Sentiment: Calculating Polarity and Subjectivity. (2015, July 7). TextBlob Sentiment: Calculating Polarity and Subjectivity. https://planspace.org/20150607-textblob_sentiment/
GeeksforGeeks. (2020, November 30). Read JSON file using Python. https://www.geeksforgeeks.org/read-json-file-using-python/
Khandelwal, P. (2020, March 27). Which algorithm takes the crown: Light GBM vs XGBOOST? Analytics Vidhya. https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/
https://github.com/Gunjitbedi/Text-Classification/blob/master/Topic%20Classification.py
https://github.com/Gunjitbedi/Text-Classification/blob/master/Topic%20Classification.py
Ray, S. (2017, August 14). CatBoost: A machine learning library to handle categorical (CAT) data automatically. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2017/08/catboost-automated-categorical-data/#:%7E:text=CatBoost%20is%20a%20recently%20open,problems%20that%20businesses%20face%20today.&text=CatBoost%20is%20a%20recently%20open,problems%20that%20businesses%20face%20today.

Page updated

Report abuse