Text Classification Model

Objective

Using Natural Language Processing (NLP) techniques to classify 120,000 Yelp reviews into sentiment categories: positive, negative, or neutral. It focuses on training machines to perform this classification through two prominent NLP techniques: Bag of Words (BoW) and Text Embedding. This project will provide businesses with important insights from customer feedbacks.

The data set could be found here as: https://www.yelp.com/dataset

Direct dataset downloads link: Train Dataset, Test Dataset

Tools & Technologies: Python, Natural Language Toolkit (NLTK), scikit-learn, Google Colab

Note: In this project, special attention was given to the balance between accuracy and computational efficiency. Due to the large size of the dataset (120,000 Yelp reviews) and the need to operate within a 30-minute time constraint, the project was optimized for GPU usage. This required careful selection and tuning of models and preprocessing methods to ensure high accuracy while adhering to the time limit.

Challenges:

Data Pre-processing (this process take the most time)

Ensuring the quality of input data through effective preprocessing techniques is crucial. Given the diverse and unstructured nature of Yelp reviews, the challenge includes efficiently cleaning the text (e.g., removing punctuation, lowercasing, tokenization, stemming, and removing stop words) to prepare high-quality input for both BoW and Text Embedding methods while make sure it doesn't affect the accuracy of the text embedding method. This process must be optimized to maintain meaningful context while reducing noise in the data.

Model and feature selection is crucial as we want a high accuracy while avoid over-fitting the data

Stratified K-Fold cross-validation as well as fine tuning the model through many run

Method 1: Bag Of Words (BOW) - 83% Accuracy

Data Preprocessing: Implemented text cleaning techniques including lowercasing, punctuation removal, tokenization, stemming, and stop words removal, ensuring high-quality input for the model.
Feature Extraction: Utilized CountVectorizer to transform the processed text into numerical feature vectors suitable for machine learning.
Model Selection & Validation: Explored various classifiers like Logistic Regression, MLPClassifier, RandomForestClassifier, etc., and validated them using Stratified K-Fold cross-validation.
Feature Selection: Improved model performance by integrating feature selection methods such as SelectKBest and tree-based methods.
Model Training & Evaluation: Trained models on a sample of the data, fine-tuning to achieve optimal accuracy. Evaluated different combinations of preprocessing, feature extraction, and classifiers.
Final Model & Prediction: Selected the best-performing model based on cross-validation accuracy. Trained it on the entire dataset and predicted sentiments on a separate test set.
Outcome: Successfully developed a sentiment analysis model, providing insights into customer opinions, with potential applications in business decision-making and strategy.
Result File: Here

Link to the actual code: Here

import pandas as pd

import re

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

from nltk.stem import PorterStemmer

from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.model_selection import StratifiedKFold, cross_val_score

from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB, MultinomialNB

from sklearn.linear_model import LogisticRegression

from sklearn.neural_network import MLPClassifier

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

from sklearn.svm import SVC

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import FunctionTransformer

from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif

from google.colab import drive

import nltk

from joblib import Parallel, delayed

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import SelectFromModel

from sklearn.ensemble import ExtraTreesClassifier

from google.colab import files

from sklearn.metrics import accuracy_score

# Download required NLTK data

nltk.download('punkt')

nltk.download('stopwords')

drive.mount('/content/drive')

# Preprocessing function for text data

def preprocess_text(text):

# Lowercasing

text = text.lower()

# Remove punctuation and numbers

text = re.sub(r'[^a-zA-Z\s]', '', text)

# Tokenization

tokens = word_tokenize(text)

# Stemming

stemmer = PorterStemmer()

tokens = [stemmer.stem(token) for token in tokens]

# Remove stop words

stop_words = set(stopwords.words('english'))

tokens = [token for token in tokens if token not in stop_words]

return ' '.join(tokens)

train_data = pd.read_csv("/content/drive/MyDrive/Project/train_yelp_60k.csv") # Load training dataset

test_data= pd.read_csv("/content/drive/MyDrive/Project/test_yelp_60k.csv") # Load test dataset

# Apply preprocessing to each text document

# sample 50% of the data with the same distibution for training so the sample is a good representative of whole dataset.

X_sample, _, y_sample, _ = train_test_split(

train_data['Text'], train_data['Class'], stratify=train_data['Class'],

test_size=0.8, random_state=42

)

# Important of Pre-processing start

def preprocess_text_custom(text, lower=True, remove_punc=True, tokenize=True, stem=True, remove_stop=True):

if lower:

text = text.lower()

if remove_punc:

text = re.sub(r'[^a-zA-Z\s]', '', text)

if tokenize:

tokens = word_tokenize(text)

else:

tokens = text.split()

if stem:

stemmer = PorterStemmer()

tokens = [stemmer.stem(token) for token in tokens]

if remove_stop:

stop_words = set(stopwords.words('english'))

tokens = [token for token in tokens if token not in stop_words]

return ' '.join(tokens)

preprocessing_conf = [

{'lower': True, 'remove_punc': True, 'tokenize': True, 'stem': True, 'remove_stop': True},

{'lower': False, 'remove_punc': True, 'tokenize': True, 'stem': True, 'remove_stop': True},

{'lower': True, 'remove_punc': False, 'tokenize': True, 'stem': True, 'remove_stop': True},

{'lower': True, 'remove_punc': True, 'tokenize': True, 'stem': False, 'remove_stop': True},

{'lower': True, 'remove_punc': True, 'tokenize': True, 'stem': True, 'remove_stop': False},

]

cv = StratifiedKFold(n_splits=6)

for config in preprocessing_conf:

X_sample_preprocessed = [preprocess_text_custom(text, **config) for text in X_sample]

pipeline = Pipeline([

('vectorizer', CountVectorizer()),

#('feature_selection', SelectKBest(f_classif, k=500)),

('classifier', LogisticRegression(max_iter=1000))

])

cv_scores = cross_val_score(pipeline, X_sample_preprocessed, y_sample, cv=cv, scoring='accuracy', n_jobs=-1)

print(f"Configuration:{config}")

print(f"Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std()* 2:.3f})\n")

# Important of Pre-processing end

# Apply preprocessing to each text document in the sample

X_sample_preprocessed = [preprocess_text(text) for text in X_sample]

y_sample = y_sample.values

# Set up cross-validation scheme

cv = StratifiedKFold(n_splits=6)

# Define classifiers

classifiers = {

'LogisticRegression': LogisticRegression(max_iter=1000),

'MLPClassifier': MLPClassifier(max_iter=1000),

'RandomForestClassifier': RandomForestClassifier(),

'AdaBoostClassifier': AdaBoostClassifier()

}

# Define vectorizers (using TfidfVectorizer)

vectorizers = {

'CountVectorizer': CountVectorizer(stop_words=None) # stop words already handled in preprocess function

}

# Define feature selection methods

# tuning k to 500

feature_selectors = {

'f_classif': SelectKBest(f_classif, k=500),

'chi2': SelectKBest(chi2, k=500),

'Tree_based': SelectFromModel(ExtraTreesClassifier(n_estimators=50))

}

# Train and evaluate models with feature selection

for vec_name, vectorizer in vectorizers.items():

for fs_name, feature_selector in feature_selectors.items():

for clf_name, classifier in classifiers.items():

pipeline = Pipeline([

(vec_name, vectorizer),

(fs_name, feature_selector),

(clf_name, classifier)

])

# Perform stratified cross-validation and print results

# Adjust n_jobs to use multiple cores

scores = cross_val_score(pipeline, X_sample_preprocessed, y_sample, cv=cv, scoring='accuracy', n_jobs=-1)

print(f"{vec_name} + {fs_name} + {clf_name} Accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

# Additional steps to train the best model and predict the test set would follow here.

# Initialize the best feature selector and classifier as per your findings

best_feature_selector = SelectKBest(f_classif, k=500)

best_classifier = LogisticRegression(max_iter=1000)

# Create the pipeline with the best feature selection method and classifier

best_pipeline = Pipeline([

('vectorizer', CountVectorizer()),

('feature_selection', best_feature_selector),

('classifier', best_classifier)

])

# Fit the pipeline to the entire training dataset

X_train_preprocessed = [preprocess_text(text) for text in train_data['Text']]

y_train = train_data['Class']

best_pipeline.fit(X_train_preprocessed, y_train)

# preprocess the test dataset

X_test_preprocessed = [preprocess_text(text) for text in test_data['Text']]

# Make predictions on the preprocessed test dataset

predictions = best_pipeline.predict(X_test_preprocessed)

# Combine the IDs from the test dataset with the predictions

prediction1 = pd.DataFrame({'ID': test_data['ID'], 'Class': predictions})

# Save the predictions to a CSV file

prediction1.to_csv('prediction1.csv', index=False, header=True)

# Download the file in Google Colab

files.download('prediction1.csv')

Method 2: Text Embeddings - 85% Accuracy

Data Preprocessing: Applied minimal preprocessing, mainly lowercasing, to maintain the integrity of the text for deep learning models.
Embedding Generation: Utilized 'bert-base-nli-mean-tokens' model from SentenceTransformers to create high-dimensional embeddings from Yelp reviews.
Model Selection & Validation: Explored various classifiers (Logistic Regression, MLPClassifier, RandomForestClassifier, SGDClassifier) and evaluated them using Stratified K-Fold cross-validation.
Feature Scaling: Standardized embeddings using StandardScaler to improve model performance.
Model Training & Evaluation: Trained classifiers on scaled embeddings and assessed performance using accuracy metrics.
Final Model & Prediction: Chose the best classifier based on cross-validation results. Trained it on the entire dataset and used it to predict sentiments on the test set.
Outcome: Successfully developed and deployed a text embedding-based sentiment analysis model, demonstrating an advanced approach to understanding customer feedback in reviews.
Result File: Here

Link to the actual code: Here

!pip install -q sentence-transformers

#Text Embedding Script with Pre-Embedding Feature Selection and Text Preprocessing

# Import necessary libraries

import pandas as pd

import re

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

from nltk.stem import PorterStemmer

from sklearn.model_selection import StratifiedKFold, cross_val_score

from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.linear_model import LogisticRegression

from sklearn.neural_network import MLPClassifier

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

from sentence_transformers import SentenceTransformer

import nltk

from sklearn.model_selection import train_test_split

from google.colab import drive

from google.colab import files

from sklearn.linear_model import SGDClassifier

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV

# Preprocessing function

def preprocess_text(text):

# Lowercasing

text = text.lower()

# Remove punctuation and numbers

#text = re.sub(r'[^a-zA-Z\s]', '', text)

return text

# Load the training dataset

drive.mount('/content/drive')

train_data = pd.read_csv("/content/drive/MyDrive/Project/train_yelp_60k.csv") # Load training dataset

test_data= pd.read_csv("/content/drive/MyDrive/Project/test_yelp_60k.csv") # Load test dataset

# Apply preprocessing to each text document

X_sample, _, y_sample, _ = train_test_split(

train_data['Text'], train_data['Class'], stratify=train_data['Class'],

test_size=0.8, random_state=42

)

# Apply preprocessing to each text document in the sample

X_sample_preprocessed = [preprocess_text(text) for text in X_sample]

y_sample = y_sample.values

# Set up cross-validation scheme

#can increase to 10 or decrease the 3 base on the time allowed

cv = StratifiedKFold(n_splits=6)

# Initialize the sentence transformer model

#'bert-base-nli-mean-tokens' can be used for embed if more time is allowed (distilbert-base-nli-stsb-mean-tokens)

model = SentenceTransformer('bert-base-nli-mean-tokens')

# Generate embeddings for the preprocessed and sampled text

# Set the batch size for the encoding process. Adjust this based on your GPU memory

batch_size = 256

# Generate embeddings for the preprocessed sampled text data

# The `convert_to_numpy` argument converts the output to a NumPy array which might save some time

X_sample_embeddings = model.encode(X_sample_preprocessed,

show_progress_bar=True,

batch_size=batch_size,

convert_to_numpy=True)

scaler = StandardScaler()

X_sample_scaled = scaler.fit_transform(X_sample_embeddings)

# no feature selection required for text embedding

# Define classifiers

# can add or remove classifiers base on time allowed

classifiers = {

'LogisticRegression': LogisticRegression(max_iter=1000, solver='lbfgs', C = 0.1),

'MLPClassifier': MLPClassifier(max_iter=1000),

'RandomForestClassifier': RandomForestClassifier(n_jobs=-1), # can change n_estimator parameter base on time if needed

'SGDClassifier': SGDClassifier(max_iter=1000, tol=1e-3)

#'AdaBoostClassifier': AdaBoostClassifier()

}

# Train and evaluate classifiers with embeddings on the sampled data

for clf_name, classifier in classifiers.items():

# Perform stratified cross-validation and print results

scores = cross_val_score(classifier, X_sample_scaled, y_sample, cv=cv, scoring='accuracy', n_jobs=-1)

print(f"{clf_name} Accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

# Additional steps to train the best model and predict the test set would follow here.

# Preprocess the entire training data

X_train_preprocessed = [preprocess_text(text) for text in train_data['Text']]

y_train = train_data['Class'].values

# Generate embeddings for the whole preprocessed training data

X_train_embeddings = model.encode(X_train_preprocessed, show_progress_bar=True, batch_size=batch_size, convert_to_numpy=True)

X_train_scaled = scaler.fit_transform(X_train_embeddings)

# Initialize and train the Logistic Regression classifier on the full training embeddings

best_classifier = LogisticRegression(max_iter=1000, solver='lbfgs', C = 0.1) # changing max_iter to 2000 if needed more

best_classifier.fit(X_train_scaled, y_train)

# Preprocess the test data

X_test_preprocessed = [preprocess_text(text) for text in test_data['Text']]

# Generate embeddings for the preprocessed test data

X_test_embeddings = model.encode(X_test_preprocessed, show_progress_bar=True, batch_size=batch_size, convert_to_numpy=True)

X_test_scaled = scaler.transform(X_test_embeddings)

# Make predictions on the test data embeddings

predictions = best_classifier.predict(X_test_scaled)

# Create a DataFrame with the IDs and the corresponding predictions

prediction2 = pd.DataFrame({'ID': test_data['ID'], 'Class': predictions})

# Save the predictions to a CSV file

prediction2.to_csv('prediction2.csv', index=False, header=True)

# Download the file in Google Colab

files.download('prediction2.csv')

Page updated

Google Sites

Report abuse