Naive Bayes (Python)

Naive Bayes In Python

In Python, Naive Bayes algorithm use GaussianNBClassifier to apply Naive Bayes algorithm in Python.

Introduction To Dataset

In this step, we are using StudentEvent_Resample.xlsx dataset. The value for this dataset has been standardized in Rapidminer and its contains 100 rows with 11 columns.

Select Data From Dataset

Select data to be analyzed in this activity.

#Naive Bayes Model

data_nb = df[['Assignment','Forum','Activity','LectureNote',

'Tutorial','Questionnaire','Quiz','MarksBin']].copy()

data_nb.info()

GaussianNB Classifier

Import Library

#Decision Tree Model

#import the classifier

from sklearn import metrics

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

import warnings

warnings.filterwarnings('ignore')

Initialize X and Y value

This is how we initialized x and y value.

# Split into predictor and response dataframes.

X_nb = data_nb.drop('MarksBin', axis=1)

y_nb = data_nb['MarksBin']

X_nb.shape,y_nb.shape

Split Data Intro Train And Test Set

We split the data to be tested using the ratio of 30:70 and 50:50.

from sklearn.model_selection import train_test_split

X_train_nb, X_test_nb, y_train_nb, y_test_nb = train_test_split(X_nb, y_nb, test_size = 0.30, random_state = 30)

X_train_nb1, X_test_nb1, y_train_nb1, y_test_nb1 = train_test_split(X_nb, y_nb, test_size = 0.50, random_state = 30)

Test Model Using 30:70 Ratio

We test the model based on the ratio of 30:70 to see the prformance accuracy for our model.

from sklearn.naive_bayes import GaussianNB

#lets try with another classifier: Naive Bayes (70:30)

clf_nb = GaussianNB()

clf_nb.fit(X_train_nb, y_train_nb)

y_pred_nb = clf_nb.predict(X_test_nb)

print("Performance Accuracy: {:.2f} %".format(metrics.accuracy_score(y_pred_nb,y_test_nb)*100))

print("Performance Error: {:.2f} %".format(100-(metrics.accuracy_score(y_test_nb,y_pred_nb)*100)))

print("Classification Report")

classify_nb = metrics.classification_report(y_pred_nb,y_test_nb);

print(classify_nb)

print("Confusion Matrix")

confusion_matrix_nb = pd.crosstab(y_test_nb, y_pred_nb, rownames=['Actual'], colnames=['Predicted'], margins = True)

print (confusion_matrix_nb)

The result shows that our model accuracy is 80%. From Confusion Matrix, our model predicts 10 students will get 1 (Grade A), 11 students will get 2 (Grade A-), 3 students will get 3 (Grade B+) and 6 students will get 4 (Grade B). It seems like our model predicts all the students pass in this online course!

Test Model Using 50:50 Ratio

We test our model using the ratio of 50:50.

from sklearn.naive_bayes import GaussianNB

#lets try with another classifier: Naive Bayes (50:50)

clf_nb1 = GaussianNB()

clf_nb1.fit(X_train_nb1, y_train_nb1)

y_pred_nb1 = clf_nb.predict(X_test_nb1)

print("Performance Accuracy: {:.2f} %".format(metrics.accuracy_score(y_pred_nb1,y_test_nb1)*100))

print("Performance Error: {:.2f} %".format(100-(metrics.accuracy_score(y_test_nb1,y_pred_nb1)*100)))

print("Classification Report")

classify_nb1 = metrics.classification_report(y_pred_nb1,y_test_nb1);

print(classify_nb1)

print("Confusion Matrix")

confusion_matrix_nb1 = pd.crosstab(y_test_nb1, y_pred_nb1, rownames=['Actual'], colnames=['Predicted'], margins = True)

print (confusion_matrix_nb1)

The result shows that our model accuracy is only 68%. From Confusion Matrix, our model predicts 24 students will get 1 (Grade A), 10 students will get 2 (Grade A-), 11 students will get 3 (Grade B+) and 5 students will get 4 (Grade B). It seems like our model predicts all the students pass in this online course even using the ratio of 50:50.

Hyperparameter Tuning using GridSearch Algorithm

Hyperparameter tuning is searching the hyperparameter space for a set of values that will optimize our model architecture. It is very important to make sure our model performance is in good condition. Therefore, hyperparameter tuning is a way to increase the performance accuracy which means it will decrease the error percentage.

For GaussianNBClassififier, only 2 parameters can be tuning which are prior and var_smoothing. For this project, we only tuned the var_smoothing parameter using the GridSearchCV algorithm. This algorithm suggested the new value for X_test which has been transformed to a new value using PowerTransformer() function.

#hyperparameter tuning for Naive Bayes Model

import numpy as np

from sklearn.model_selection import train_test_split,GridSearchCV

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import cross_val_score

from sklearn.metrics import accuracy_score, confusion_matrix

sns.set_style("whitegrid")

np.logspace(0,-9, num=10)

from sklearn.model_selection import RepeatedStratifiedKFold

cv_method = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=999)

from sklearn.preprocessing import PowerTransformer

params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}

gs_NB = GridSearchCV(estimator=clf_nb, param_grid=params_NB, cv=cv_method,verbose=1,scoring='accuracy')

Data_transformed = PowerTransformer().fit_transform(X_test_nb)

gs_NB.fit(Data_transformed, y_test_nb);

results_NB = pd.DataFrame(gs_NB.cv_results_['params'])

results_NB['test_score'] = gs_NB.cv_results_['mean_test_score']

# predict the target on the test dataset

predict_test_nb = gs_NB.predict(Data_transformed)

# Accuracy Score on test dataset

accuracy_test_nb = accuracy_score(y_test_nb,predict_test_nb)

print('Accuracy_score on test dataset : ', (accuracy_test_nb*100))

gs_NB1 = GridSearchCV(estimator=clf_nb, param_grid=params_NB, cv=cv_method,verbose=1,scoring='accuracy')

Data_transformed1 = PowerTransformer().fit_transform(X_test_nb1)

gs_NB1.fit(Data_transformed1, y_test_nb1);

results_NB1 = pd.DataFrame(gs_NB1.cv_results_['params'])

results_NB1['test_score'] = gs_NB1.cv_results_['mean_test_score']

# predict the target on the test dataset

predict_test_nb1 = gs_NB1.predict(Data_transformed1)

# Accuracy Score on test dataset

accuracy_test_nb1 = accuracy_score(y_test_nb1,predict_test_nb1)

print('Accuracy_score on test dataset : ', (accuracy_test_nb1*100))

Below is the result for hyperparameter tuning for our model. From the result, it's already increases our model performance accuracy for the ratio of 30:70 to 83.33% and for the ratio of 50:50 to 84%

Naive Bayes Performance After Tuning

This is how we apply the suggestion value into our model.

Test Model Using 30:70 Ratio

Based on the suggestion value for X_test, we applied and test our model using the ratio of 30:70

#after tuning test on 30:70 ratio

X_test_nb = Data_transformed

y_pred_nb = predict_test_nb

print("\nPerformance Accuracy: {:.2f} %".format(metrics.accuracy_score(y_pred_nb,y_test_nb)*100))

print("Performance Error: {:.2f} %".format(100-(metrics.accuracy_score(y_test_nb,y_pred_nb)*100)))

print("\nClassification Report")

classify_nb2 = metrics.classification_report(y_pred_nb,y_test_nb);

print(classify_nb2)

print("\nConfusion Matrix")

confusion_matrix_nb2 = pd.crosstab(y_test_nb, y_pred_nb, rownames=['Actual'], colnames=['Predicted'], margins = True)

print (confusion_matrix_nb2)

From the result below, we can see that our model performance accuracy increased to 83.33%. From the Confusion Matrix, we can see that our model predicts all the students will pass this online course.

Test Model Using 50:50 Ratio

This is how we test our model using suggested value using the ratio of 50:50.

#after tuning test on 50:50 ratio

X_test_nb1 = Data_transformed1

y_pred_nb1 = predict_test_nb1

print("\nPerformance Accuracy: {:.2f} %".format(metrics.accuracy_score(y_pred_nb1,y_test_nb1)*100))

print("Performance Error: {:.2f} %".format(100-(metrics.accuracy_score(y_test_nb1,y_pred_nb1)*100)))

print("\nClassification Report")

classify_nb3 = metrics.classification_report(y_pred_nb1,y_test_nb1);

print(classify_nb3)

print("\nConfusion Matrix")

confusion_matrix_nb3 = pd.crosstab(y_test_nb1, y_pred_nb1, rownames=['Actual'], colnames=['Predicted'], margins = True)

print (confusion_matrix_nb3)

From the result below, our model performance accuracy also increases for this ratio to 84%. From the Confusion Matrix, we can see that our model predicted the students same with actual record which is 11 (Grade F)

Summary

From the above results, we can see that how effective hyperparameter tuning for our Naive Bayes model. For the ratio of 30:70, the performance accuracy increase to 83.3%. For the ratio of 50:50, the performance accuracy increase to 84%. From all the results, we can conclude that the best ratio for the Naive Bayes model using Python is for the ratio of 50:50.

Next Topic: Results And Analysis

Page updated

Report abuse

Naive Bayes (Python)

Naive Bayes In Python

Introduction To Dataset

Select Data From Dataset

GaussianNB Classifier

Import Library

Initialize X and Y value

Split Data Intro Train And Test Set

Test Model Using 30:70 Ratio

Test Model Using 50:50 Ratio

Hyperparameter Tuning using GridSearch Algorithm

Naive Bayes Performance After Tuning

Test Model Using 30:70 Ratio

Test Model Using 50:50 Ratio

Summary

Next Topic: Results And Analysis

Copyright by 199607-Build using sites.google.com