Naive Bayes is one classification techniques based on Bayes' Theorem with an assumption of independence among predictors. In other terms, Naive Bayes classifier assumes that the presence of a particular feature in a class is not related to the presence of any other features. Naive Bayes is an easy model to build and very useful for large datasets.
The preprocessing my group did was explained on Preprocessing Page. To see in detail how each experiment was done for Python, you can look at my group's Google Colab.
We have seen how me and my group did the Decision Tree model, now we will look at how we did Naive Bayes using RapidMiner and Python. Let's get started!
#importing libraries for data manipulation
import numpy as np #library for numerical computing
import pandas as pd #library for data manipulation and analysis
#importing cleaned dataset done using Rapidminer
from google.colab import files
uploaded = files.upload()
dtt=pd.read_csv('Processed KWAP.csv')
We removed the attributes that will not be used in this predicitve data mining for this target.
data = data.drop(columns=['Attitude towards retirement', 'life satisfaction'])
Since f2healthstat is in nominal, we changed it to numerical. This will allow us to calculate the performance accuracy later on.
# change to categorical variable
# category codes
data["f2healthstat"]=data["f2healthstat"].astype('category')
data["Leveldistress"]=data["Leveldistress"].astype('category')
# categorical to int
data["f2healthstat"]=data["f2healthstat"].cat.codes
data["Leveldistress"]=data["Leveldistress"].cat.codes
In this step, we selected only required features for this model. The x_data2 = data.drop(columns=['Level of education', 'financial well being', 'AgeBin', 'f2healthstat', 'Leveldistress'], axis=1) is used to remove unnecessary features while y_data2 = data['f2healthstat']is used to set the target for this model. In our case, our target is f2healthstat so that is why we set the f2healthstat as y_data2.
# selecting features that we want
x_data2 = data.drop(columns=['Level of education', 'financial well being', 'f2healthstat', 'Leveldistress'], axis=1)
y_data2 = data['f2healthstat']
For our experiments, we decided to use 60:40 ratio and 80:20 ratio. Thus, the test_size was set to 0.4 and 0.2 accordingly.
#splitting the data into train and test
#test size can be changed
#if test_size is 0.6 and 0.8, then ratio for train:test is 60:40 and 80:20
from sklearn.model_selection import train_test_split
x_train_data2, x_test_data2, y_train_data2, y_test_data2=train_test_split(x_data2, y_data2,test_size=0.4, random_state= 0)
x_train_data2_1, x_test_data2_1, y_train_data2_1, y_test_data2_1=train_test_split(x_data2, y_data2, test_size=0.2, random_state= 0)
For this model, we used the naive bayes algorithm. So to train the algorithm, we first needed to import the GaussianNB from sklearn.
#train the naive bayes model for 60:40
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(x_train_data2, y_train_data2)
y_pred = classifier.predict(x_test_data2)
#train the naive bayes model for 80:20
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(x_train_data2_1, y_train_data2_1)
y_pred1 = classifier.predict(x_test_data2_1)
#Showing accuracy and precision score for 60:40
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score,precision_score
print("\nClassification Report")
print(classification_report(y_test_data2, y_pred))
print("\nConfusion Matrix")
print(pd.crosstab(y_test_data2, y_pred, rownames=['Actual'], colnames=['Predicted'], margins = True))
print("\nPerformance Accuracy: {:.2f} %".format(accuracy_score(y_test_data2, y_pred)*100))
print("\nPrecision: {:.2f} %".format(precision_score(y_test_data2, y_pred)*100))
print("\nPerformance Error: {:.2f} %".format(100-(accuracy_score(y_test_data2, y_pred)*100)))
#Showing accuracy and precision score for 80:20
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score,precision_score
print("\nClassification Report")
print(classification_report(y_test_data2_1, y_pred1))
print("\nConfusion Matrix")
print(pd.crosstab(y_test_data2_1, y_pred1, rownames=['Actual'], colnames=['Predicted'], margins = True))
print("\nPerformance Accuracy: {:.2f} %".format(accuracy_score(y_test_data2_1, y_pred1)*100))
print("\nPrecision: {:.2f} %".format(precision_score(y_test_data2_1, y_pred1)*100))
print("\nPerformance Error: {:.2f} %".format(100-(accuracy_score(y_test_data2_1, y_pred1)*100)))
Step 1 until step Step 6 remained unchanged as before but for step 7, it would be different since we we used Grid Search to optimize the model. After Step 7, Step 8 will be the same which is calculating the accuracy performance, precision and performance error.
To increase the accuracy of the model, we decided to do Hyperparameter Tuning. For Hyperparameter Tuning, we decided to use Grid Search from sklearn. The code below will show how we did the hyperparameter tuning using GridSearchCV.
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
#create a dictionary of all values we want to test (60:40 ratio)
nb_classifier = GaussianNB()
params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}
clf = GridSearchCV(estimator=nb_classifier, param_grid=params_NB, cv=5)
clf.fit(x_train_data2, y_train_data2)
y_pred = clf.predict(x_test_data2)
#create a dictionary of all values we want to test (80:20 ratio)
nb_classifier = GaussianNB()
params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}
clf = GridSearchCV(estimator=nb_classifier, param_grid=params_NB, cv=5)
clf.fit(x_train_data2_1, y_train_data2_1)
y_pred1 = clf.predict(x_test_data2_1)
Data was retrieved using the retrieve operator in data mining.
In this step, we will only choose the attributes that will be using for health target which are Age, Socio economic, Salary, self-rate health, self care and health status.
The target for this predictive model is health status since we want to predict the health level of the retiree. So the role was set to health status and the target data is label.
We have two ratios so for 80:20 ratio, we will let the training ratio into 0.8 and test ratio into 0.2. While for 60:40 ratio, the training ratio will be 0.6 and the test ratio will be 0.4. So we can see which ratio has higher accuracy; either 80:20 or 60:40. The Split Data operator was used to split the training and test ratio.
So to apply the decision tree algorithm for the predictive model, we use the Naive Bayes operator.
We used the Performance operator to calculate the performance of the model so we can see the classification report such as the accuracy and confusion matrix.
Step 1 until step Step 3 remained unchanged as before.
This is used to optimize and do hyperparameter tuning to the model.
To split the data into different ratio, the split data operator was used. So we can set how much is training ratio and how much is test ratio.
So to apply the naive bayes algorithm for the predictive model, we use the Naive Bayes operator.
We used the Performance operator to calculate the performance of the model. By doing this, we can see the classification report such as the accuracy and confusion matrix.
All of the steps above for RapidMiner and Python are repeated for experiment 2 and 3.