Using Natural Language Processing (NLP) techniques to classify 120,000 Yelp reviews into sentiment categories: positive, negative, or neutral. It focuses on training machines to perform this classification through two prominent NLP techniques: Bag of Words (BoW) and Text Embedding. This project will provide businesses with important insights from customer feedbacks.
The data set could be found here as: https://www.yelp.com/dataset
Direct dataset downloads link: Train Dataset, Test Dataset
Tools & Technologies: Python, Natural Language Toolkit (NLTK), scikit-learn, Google Colab
Note: In this project, special attention was given to the balance between accuracy and computational efficiency. Due to the large size of the dataset (120,000 Yelp reviews) and the need to operate within a 30-minute time constraint, the project was optimized for GPU usage. This required careful selection and tuning of models and preprocessing methods to ensure high accuracy while adhering to the time limit.
Challenges:
Data Pre-processing (this process take the most time)
Ensuring the quality of input data through effective preprocessing techniques is crucial. Given the diverse and unstructured nature of Yelp reviews, the challenge includes efficiently cleaning the text (e.g., removing punctuation, lowercasing, tokenization, stemming, and removing stop words) to prepare high-quality input for both BoW and Text Embedding methods while make sure it doesn't affect the accuracy of the text embedding method. This process must be optimized to maintain meaningful context while reducing noise in the data.
Model and feature selection is crucial as we want a high accuracy while avoid over-fitting the data
Stratified K-Fold cross-validation as well as fine tuning the model through many run
Method 1: Bag Of Words (BOW) - 83% Accuracy
Data Preprocessing: Implemented text cleaning techniques including lowercasing, punctuation removal, tokenization, stemming, and stop words removal, ensuring high-quality input for the model.
Feature Extraction: Utilized CountVectorizer to transform the processed text into numerical feature vectors suitable for machine learning.
Model Selection & Validation: Explored various classifiers like Logistic Regression, MLPClassifier, RandomForestClassifier, etc., and validated them using Stratified K-Fold cross-validation.
Feature Selection: Improved model performance by integrating feature selection methods such as SelectKBest and tree-based methods.
Model Training & Evaluation: Trained models on a sample of the data, fine-tuning to achieve optimal accuracy. Evaluated different combinations of preprocessing, feature extraction, and classifiers.
Final Model & Prediction: Selected the best-performing model based on cross-validation accuracy. Trained it on the entire dataset and predicted sentiments on a separate test set.
Outcome: Successfully developed a sentiment analysis model, providing insights into customer opinions, with potential applications in business decision-making and strategy.
Result File: Here
Link to the actual code: Here
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif
from google.colab import drive
import nltk
from joblib import Parallel, delayed
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import ExtraTreesClassifier
from google.colab import files
from sklearn.metrics import accuracy_score
# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
drive.mount('/content/drive')
# Preprocessing function for text data
def preprocess_text(text):
# Lowercasing
text = text.lower()
# Remove punctuation and numbers
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Tokenization
tokens = word_tokenize(text)
# Stemming
stemmer = PorterStemmer()
tokens = [stemmer.stem(token) for token in tokens]
# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]
return ' '.join(tokens)
train_data = pd.read_csv("/content/drive/MyDrive/Project/train_yelp_60k.csv") # Load training dataset
test_data= pd.read_csv("/content/drive/MyDrive/Project/test_yelp_60k.csv") # Load test dataset
# Apply preprocessing to each text document
# sample 50% of the data with the same distibution for training so the sample is a good representative of whole dataset.
X_sample, _, y_sample, _ = train_test_split(
train_data['Text'], train_data['Class'], stratify=train_data['Class'],
test_size=0.8, random_state=42
)
# Important of Pre-processing start
def preprocess_text_custom(text, lower=True, remove_punc=True, tokenize=True, stem=True, remove_stop=True):
if lower:
text = text.lower()
if remove_punc:
text = re.sub(r'[^a-zA-Z\s]', '', text)
if tokenize:
tokens = word_tokenize(text)
else:
tokens = text.split()
if stem:
stemmer = PorterStemmer()
tokens = [stemmer.stem(token) for token in tokens]
if remove_stop:
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]
return ' '.join(tokens)
preprocessing_conf = [
{'lower': True, 'remove_punc': True, 'tokenize': True, 'stem': True, 'remove_stop': True},
{'lower': False, 'remove_punc': True, 'tokenize': True, 'stem': True, 'remove_stop': True},
{'lower': True, 'remove_punc': False, 'tokenize': True, 'stem': True, 'remove_stop': True},
{'lower': True, 'remove_punc': True, 'tokenize': True, 'stem': False, 'remove_stop': True},
{'lower': True, 'remove_punc': True, 'tokenize': True, 'stem': True, 'remove_stop': False},
]
cv = StratifiedKFold(n_splits=6)
for config in preprocessing_conf:
X_sample_preprocessed = [preprocess_text_custom(text, **config) for text in X_sample]
pipeline = Pipeline([
('vectorizer', CountVectorizer()),
#('feature_selection', SelectKBest(f_classif, k=500)),
('classifier', LogisticRegression(max_iter=1000))
])
cv_scores = cross_val_score(pipeline, X_sample_preprocessed, y_sample, cv=cv, scoring='accuracy', n_jobs=-1)
print(f"Configuration:{config}")
print(f"Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std()* 2:.3f})\n")
# Important of Pre-processing end
# Apply preprocessing to each text document in the sample
X_sample_preprocessed = [preprocess_text(text) for text in X_sample]
y_sample = y_sample.values
# Set up cross-validation scheme
cv = StratifiedKFold(n_splits=6)
# Define classifiers
classifiers = {
'LogisticRegression': LogisticRegression(max_iter=1000),
'MLPClassifier': MLPClassifier(max_iter=1000),
'RandomForestClassifier': RandomForestClassifier(),
'AdaBoostClassifier': AdaBoostClassifier()
}
# Define vectorizers (using TfidfVectorizer)
vectorizers = {
'CountVectorizer': CountVectorizer(stop_words=None) # stop words already handled in preprocess function
}
# Define feature selection methods
# tuning k to 500
feature_selectors = {
'f_classif': SelectKBest(f_classif, k=500),
'chi2': SelectKBest(chi2, k=500),
'Tree_based': SelectFromModel(ExtraTreesClassifier(n_estimators=50))
}
# Train and evaluate models with feature selection
for vec_name, vectorizer in vectorizers.items():
for fs_name, feature_selector in feature_selectors.items():
for clf_name, classifier in classifiers.items():
pipeline = Pipeline([
(vec_name, vectorizer),
(fs_name, feature_selector),
(clf_name, classifier)
])
# Perform stratified cross-validation and print results
# Adjust n_jobs to use multiple cores
scores = cross_val_score(pipeline, X_sample_preprocessed, y_sample, cv=cv, scoring='accuracy', n_jobs=-1)
print(f"{vec_name} + {fs_name} + {clf_name} Accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
# Additional steps to train the best model and predict the test set would follow here.
# Initialize the best feature selector and classifier as per your findings
best_feature_selector = SelectKBest(f_classif, k=500)
best_classifier = LogisticRegression(max_iter=1000)
# Create the pipeline with the best feature selection method and classifier
best_pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('feature_selection', best_feature_selector),
('classifier', best_classifier)
])
# Fit the pipeline to the entire training dataset
X_train_preprocessed = [preprocess_text(text) for text in train_data['Text']]
y_train = train_data['Class']
best_pipeline.fit(X_train_preprocessed, y_train)
# preprocess the test dataset
X_test_preprocessed = [preprocess_text(text) for text in test_data['Text']]
# Make predictions on the preprocessed test dataset
predictions = best_pipeline.predict(X_test_preprocessed)
# Combine the IDs from the test dataset with the predictions
prediction1 = pd.DataFrame({'ID': test_data['ID'], 'Class': predictions})
# Save the predictions to a CSV file
prediction1.to_csv('prediction1.csv', index=False, header=True)
# Download the file in Google Colab
files.download('prediction1.csv')
Method 2: Text Embeddings - 85% Accuracy
Data Preprocessing: Applied minimal preprocessing, mainly lowercasing, to maintain the integrity of the text for deep learning models.
Embedding Generation: Utilized 'bert-base-nli-mean-tokens' model from SentenceTransformers to create high-dimensional embeddings from Yelp reviews.
Model Selection & Validation: Explored various classifiers (Logistic Regression, MLPClassifier, RandomForestClassifier, SGDClassifier) and evaluated them using Stratified K-Fold cross-validation.
Feature Scaling: Standardized embeddings using StandardScaler to improve model performance.
Model Training & Evaluation: Trained classifiers on scaled embeddings and assessed performance using accuracy metrics.
Final Model & Prediction: Chose the best classifier based on cross-validation results. Trained it on the entire dataset and used it to predict sentiments on the test set.
Outcome: Successfully developed and deployed a text embedding-based sentiment analysis model, demonstrating an advanced approach to understanding customer feedback in reviews.
Result File: Here
Link to the actual code: Here
!pip install -q sentence-transformers
#Text Embedding Script with Pre-Embedding Feature Selection and Text Preprocessing
# Import necessary libraries
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sentence_transformers import SentenceTransformer
import nltk
from sklearn.model_selection import train_test_split
from google.colab import drive
from google.colab import files
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
# Preprocessing function
def preprocess_text(text):
# Lowercasing
text = text.lower()
# Remove punctuation and numbers
#text = re.sub(r'[^a-zA-Z\s]', '', text)
return text
# Load the training dataset
drive.mount('/content/drive')
train_data = pd.read_csv("/content/drive/MyDrive/Project/train_yelp_60k.csv") # Load training dataset
test_data= pd.read_csv("/content/drive/MyDrive/Project/test_yelp_60k.csv") # Load test dataset
# Apply preprocessing to each text document
X_sample, _, y_sample, _ = train_test_split(
train_data['Text'], train_data['Class'], stratify=train_data['Class'],
test_size=0.8, random_state=42
)
# Apply preprocessing to each text document in the sample
X_sample_preprocessed = [preprocess_text(text) for text in X_sample]
y_sample = y_sample.values
# Set up cross-validation scheme
#can increase to 10 or decrease the 3 base on the time allowed
cv = StratifiedKFold(n_splits=6)
# Initialize the sentence transformer model
#'bert-base-nli-mean-tokens' can be used for embed if more time is allowed (distilbert-base-nli-stsb-mean-tokens)
model = SentenceTransformer('bert-base-nli-mean-tokens')
# Generate embeddings for the preprocessed and sampled text
# Set the batch size for the encoding process. Adjust this based on your GPU memory
batch_size = 256
# Generate embeddings for the preprocessed sampled text data
# The `convert_to_numpy` argument converts the output to a NumPy array which might save some time
X_sample_embeddings = model.encode(X_sample_preprocessed,
show_progress_bar=True,
batch_size=batch_size,
convert_to_numpy=True)
scaler = StandardScaler()
X_sample_scaled = scaler.fit_transform(X_sample_embeddings)
# no feature selection required for text embedding
# Define classifiers
# can add or remove classifiers base on time allowed
classifiers = {
'LogisticRegression': LogisticRegression(max_iter=1000, solver='lbfgs', C = 0.1),
'MLPClassifier': MLPClassifier(max_iter=1000),
'RandomForestClassifier': RandomForestClassifier(n_jobs=-1), # can change n_estimator parameter base on time if needed
'SGDClassifier': SGDClassifier(max_iter=1000, tol=1e-3)
#'AdaBoostClassifier': AdaBoostClassifier()
}
# Train and evaluate classifiers with embeddings on the sampled data
for clf_name, classifier in classifiers.items():
# Perform stratified cross-validation and print results
scores = cross_val_score(classifier, X_sample_scaled, y_sample, cv=cv, scoring='accuracy', n_jobs=-1)
print(f"{clf_name} Accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
# Additional steps to train the best model and predict the test set would follow here.
# Preprocess the entire training data
X_train_preprocessed = [preprocess_text(text) for text in train_data['Text']]
y_train = train_data['Class'].values
# Generate embeddings for the whole preprocessed training data
X_train_embeddings = model.encode(X_train_preprocessed, show_progress_bar=True, batch_size=batch_size, convert_to_numpy=True)
X_train_scaled = scaler.fit_transform(X_train_embeddings)
# Initialize and train the Logistic Regression classifier on the full training embeddings
best_classifier = LogisticRegression(max_iter=1000, solver='lbfgs', C = 0.1) # changing max_iter to 2000 if needed more
best_classifier.fit(X_train_scaled, y_train)
# Preprocess the test data
X_test_preprocessed = [preprocess_text(text) for text in test_data['Text']]
# Generate embeddings for the preprocessed test data
X_test_embeddings = model.encode(X_test_preprocessed, show_progress_bar=True, batch_size=batch_size, convert_to_numpy=True)
X_test_scaled = scaler.transform(X_test_embeddings)
# Make predictions on the test data embeddings
predictions = best_classifier.predict(X_test_scaled)
# Create a DataFrame with the IDs and the corresponding predictions
prediction2 = pd.DataFrame({'ID': test_data['ID'], 'Class': predictions})
# Save the predictions to a CSV file
prediction2.to_csv('prediction2.csv', index=False, header=True)
# Download the file in Google Colab
files.download('prediction2.csv')