In Python, Naive Bayes algorithm use GaussianNBClassifier to apply Naive Bayes algorithm in Python.
In this step, we are using StudentEvent_Resample.xlsx dataset. The value for this dataset has been standardized in Rapidminer and its contains 100 rows with 11 columns.
Select data to be analyzed in this activity.
#Naive Bayes Model
data_nb = df[['Assignment','Forum','Activity','LectureNote',
'Tutorial','Questionnaire','Quiz','MarksBin']].copy()
data_nb.info()
#Decision Tree Model
#import the classifier
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings('ignore')
This is how we initialized x and y value.
# Split into predictor and response dataframes.
X_nb = data_nb.drop('MarksBin', axis=1)
y_nb = data_nb['MarksBin']
X_nb.shape,y_nb.shape
We split the data to be tested using the ratio of 30:70 and 50:50.
from sklearn.model_selection import train_test_split
X_train_nb, X_test_nb, y_train_nb, y_test_nb = train_test_split(X_nb, y_nb, test_size = 0.30, random_state = 30)
X_train_nb1, X_test_nb1, y_train_nb1, y_test_nb1 = train_test_split(X_nb, y_nb, test_size = 0.50, random_state = 30)
We test the model based on the ratio of 30:70 to see the prformance accuracy for our model.
from sklearn.naive_bayes import GaussianNB
#lets try with another classifier: Naive Bayes (70:30)
clf_nb = GaussianNB()
clf_nb.fit(X_train_nb, y_train_nb)
y_pred_nb = clf_nb.predict(X_test_nb)
print("Performance Accuracy: {:.2f} %".format(metrics.accuracy_score(y_pred_nb,y_test_nb)*100))
print("Performance Error: {:.2f} %".format(100-(metrics.accuracy_score(y_test_nb,y_pred_nb)*100)))
print("Classification Report")
classify_nb = metrics.classification_report(y_pred_nb,y_test_nb);
print(classify_nb)
print("Confusion Matrix")
confusion_matrix_nb = pd.crosstab(y_test_nb, y_pred_nb, rownames=['Actual'], colnames=['Predicted'], margins = True)
print (confusion_matrix_nb)
The result shows that our model accuracy is 80%. From Confusion Matrix, our model predicts 10 students will get 1 (Grade A), 11 students will get 2 (Grade A-), 3 students will get 3 (Grade B+) and 6 students will get 4 (Grade B). It seems like our model predicts all the students pass in this online course!
We test our model using the ratio of 50:50.
from sklearn.naive_bayes import GaussianNB
#lets try with another classifier: Naive Bayes (50:50)
clf_nb1 = GaussianNB()
clf_nb1.fit(X_train_nb1, y_train_nb1)
y_pred_nb1 = clf_nb.predict(X_test_nb1)
print("Performance Accuracy: {:.2f} %".format(metrics.accuracy_score(y_pred_nb1,y_test_nb1)*100))
print("Performance Error: {:.2f} %".format(100-(metrics.accuracy_score(y_test_nb1,y_pred_nb1)*100)))
print("Classification Report")
classify_nb1 = metrics.classification_report(y_pred_nb1,y_test_nb1);
print(classify_nb1)
print("Confusion Matrix")
confusion_matrix_nb1 = pd.crosstab(y_test_nb1, y_pred_nb1, rownames=['Actual'], colnames=['Predicted'], margins = True)
print (confusion_matrix_nb1)
The result shows that our model accuracy is only 68%. From Confusion Matrix, our model predicts 24 students will get 1 (Grade A), 10 students will get 2 (Grade A-), 11 students will get 3 (Grade B+) and 5 students will get 4 (Grade B). It seems like our model predicts all the students pass in this online course even using the ratio of 50:50.
Hyperparameter tuning is searching the hyperparameter space for a set of values that will optimize our model architecture. It is very important to make sure our model performance is in good condition. Therefore, hyperparameter tuning is a way to increase the performance accuracy which means it will decrease the error percentage.
For GaussianNBClassififier, only 2 parameters can be tuning which are prior and var_smoothing. For this project, we only tuned the var_smoothing parameter using the GridSearchCV algorithm. This algorithm suggested the new value for X_test which has been transformed to a new value using PowerTransformer() function.
#hyperparameter tuning for Naive Bayes Model
import numpy as np
from sklearn.model_selection import train_test_split,GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
sns.set_style("whitegrid")
np.logspace(0,-9, num=10)
from sklearn.model_selection import RepeatedStratifiedKFold
cv_method = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=999)
from sklearn.preprocessing import PowerTransformer
params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}
gs_NB = GridSearchCV(estimator=clf_nb, param_grid=params_NB, cv=cv_method,verbose=1,scoring='accuracy')
Data_transformed = PowerTransformer().fit_transform(X_test_nb)
gs_NB.fit(Data_transformed, y_test_nb);
results_NB = pd.DataFrame(gs_NB.cv_results_['params'])
results_NB['test_score'] = gs_NB.cv_results_['mean_test_score']
# predict the target on the test dataset
predict_test_nb = gs_NB.predict(Data_transformed)
# Accuracy Score on test dataset
accuracy_test_nb = accuracy_score(y_test_nb,predict_test_nb)
print('Accuracy_score on test dataset : ', (accuracy_test_nb*100))
gs_NB1 = GridSearchCV(estimator=clf_nb, param_grid=params_NB, cv=cv_method,verbose=1,scoring='accuracy')
Data_transformed1 = PowerTransformer().fit_transform(X_test_nb1)
gs_NB1.fit(Data_transformed1, y_test_nb1);
results_NB1 = pd.DataFrame(gs_NB1.cv_results_['params'])
results_NB1['test_score'] = gs_NB1.cv_results_['mean_test_score']
# predict the target on the test dataset
predict_test_nb1 = gs_NB1.predict(Data_transformed1)
# Accuracy Score on test dataset
accuracy_test_nb1 = accuracy_score(y_test_nb1,predict_test_nb1)
print('Accuracy_score on test dataset : ', (accuracy_test_nb1*100))
Below is the result for hyperparameter tuning for our model. From the result, it's already increases our model performance accuracy for the ratio of 30:70 to 83.33% and for the ratio of 50:50 to 84%
This is how we apply the suggestion value into our model.
Based on the suggestion value for X_test, we applied and test our model using the ratio of 30:70
#after tuning test on 30:70 ratio
X_test_nb = Data_transformed
y_pred_nb = predict_test_nb
print("\nPerformance Accuracy: {:.2f} %".format(metrics.accuracy_score(y_pred_nb,y_test_nb)*100))
print("Performance Error: {:.2f} %".format(100-(metrics.accuracy_score(y_test_nb,y_pred_nb)*100)))
print("\nClassification Report")
classify_nb2 = metrics.classification_report(y_pred_nb,y_test_nb);
print(classify_nb2)
print("\nConfusion Matrix")
confusion_matrix_nb2 = pd.crosstab(y_test_nb, y_pred_nb, rownames=['Actual'], colnames=['Predicted'], margins = True)
print (confusion_matrix_nb2)
From the result below, we can see that our model performance accuracy increased to 83.33%. From the Confusion Matrix, we can see that our model predicts all the students will pass this online course.
This is how we test our model using suggested value using the ratio of 50:50.
#after tuning test on 50:50 ratio
X_test_nb1 = Data_transformed1
y_pred_nb1 = predict_test_nb1
print("\nPerformance Accuracy: {:.2f} %".format(metrics.accuracy_score(y_pred_nb1,y_test_nb1)*100))
print("Performance Error: {:.2f} %".format(100-(metrics.accuracy_score(y_test_nb1,y_pred_nb1)*100)))
print("\nClassification Report")
classify_nb3 = metrics.classification_report(y_pred_nb1,y_test_nb1);
print(classify_nb3)
print("\nConfusion Matrix")
confusion_matrix_nb3 = pd.crosstab(y_test_nb1, y_pred_nb1, rownames=['Actual'], colnames=['Predicted'], margins = True)
print (confusion_matrix_nb3)
From the result below, our model performance accuracy also increases for this ratio to 84%. From the Confusion Matrix, we can see that our model predicted the students same with actual record which is 11 (Grade F)
From the above results, we can see that how effective hyperparameter tuning for our Naive Bayes model. For the ratio of 30:70, the performance accuracy increase to 83.3%. For the ratio of 50:50, the performance accuracy increase to 84%. From all the results, we can conclude that the best ratio for the Naive Bayes model using Python is for the ratio of 50:50.