MODEL DEVELOPMENT

NAIVE BAYES

Naive Bayes is one classification techniques based on Bayes' Theorem with an assumption of independence among predictors. In other terms, Naive Bayes classifier assumes that the presence of a particular feature in a class is not related to the presence of any other features. Naive Bayes is an easy model to build and very useful for large datasets.

The preprocessing my group did was explained on Preprocessing Page. To see in detail how each experiment was done for Python, you can look at my group's Google Colab.

We have seen how me and my group did the Decision Tree model, now we will look at how we did Naive Bayes using RapidMiner and Python. Let's get started!

Python

Step 1: Import libraries

#importing libraries for data manipulation

import numpy as np #library for numerical computing

import pandas as pd #library for data manipulation and analysis

Step 2: Import data

#importing cleaned dataset done using Rapidminer

from google.colab import files

uploaded = files.upload()

dtt=pd.read_csv('Processed KWAP.csv')

Step 3: Remove unused attributes

We removed the attributes that will not be used in this predicitve data mining for this target.

data = data.drop(columns=['Attitude towards retirement', 'life satisfaction'])

Step 4: Change the nominal data do numerical

Since f2healthstat is in nominal, we changed it to numerical. This will allow us to calculate the performance accuracy later on.

# change to categorical variable

# category codes

data["f2healthstat"]=data["f2healthstat"].astype('category')

data["Leveldistress"]=data["Leveldistress"].astype('category')

# categorical to int

data["f2healthstat"]=data["f2healthstat"].cat.codes

data["Leveldistress"]=data["Leveldistress"].cat.codes

Step 5: Select features

In this step, we selected only required features for this model. The x_data2 = data.drop(columns=['Level of education', 'financial well being', 'AgeBin', 'f2healthstat', 'Leveldistress'], axis=1) is used to remove unnecessary features while y_data2 = data['f2healthstat']is used to set the target for this model. In our case, our target is f2healthstat so that is why we set the f2healthstat as y_data2.

# selecting features that we want

x_data2 = data.drop(columns=['Level of education', 'financial well being', 'f2healthstat', 'Leveldistress'], axis=1)

y_data2 = data['f2healthstat']

Step 6: Set training and test ratio

For our experiments, we decided to use 60:40 ratio and 80:20 ratio. Thus, the test_size was set to 0.4 and 0.2 accordingly.

#splitting the data into train and test

#test size can be changed

#if test_size is 0.6 and 0.8, then ratio for train:test is 60:40 and 80:20

from sklearn.model_selection import train_test_split

x_train_data2, x_test_data2, y_train_data2, y_test_data2=train_test_split(x_data2, y_data2,test_size=0.4, random_state= 0)

x_train_data2_1, x_test_data2_1, y_train_data2_1, y_test_data2_1=train_test_split(x_data2, y_data2, test_size=0.2, random_state= 0)

Step 6: Train the decision tree model for each ratio.

For this model, we used the naive bayes algorithm. So to train the algorithm, we first needed to import the GaussianNB from sklearn.

#train the naive bayes model for 60:40

from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(x_train_data2, y_train_data2)

y_pred = classifier.predict(x_test_data2)

#train the naive bayes model for 80:20

from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(x_train_data2_1, y_train_data2_1)

y_pred1 = classifier.predict(x_test_data2_1)

Step 7: Calculate the classification report which include confusion matrix, peformance accuracy, precision and performance error.

#Showing accuracy and precision score for 60:40

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score,precision_score

print("\nClassification Report")

print(classification_report(y_test_data2, y_pred))

print("\nConfusion Matrix")

print(pd.crosstab(y_test_data2, y_pred, rownames=['Actual'], colnames=['Predicted'], margins = True))

print("\nPerformance Accuracy: {:.2f} %".format(accuracy_score(y_test_data2, y_pred)*100))

print("\nPrecision: {:.2f} %".format(precision_score(y_test_data2, y_pred)*100))

print("\nPerformance Error: {:.2f} %".format(100-(accuracy_score(y_test_data2, y_pred)*100)))

#Showing accuracy and precision score for 80:20

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score,precision_score

print("\nClassification Report")

print(classification_report(y_test_data2_1, y_pred1))

print("\nConfusion Matrix")

print(pd.crosstab(y_test_data2_1, y_pred1, rownames=['Actual'], colnames=['Predicted'], margins = True))

print("\nPerformance Accuracy: {:.2f} %".format(accuracy_score(y_test_data2_1, y_pred1)*100))

print("\nPrecision: {:.2f} %".format(precision_score(y_test_data2_1, y_pred1)*100))

print("\nPerformance Error: {:.2f} %".format(100-(accuracy_score(y_test_data2_1, y_pred1)*100)))

With Hyperparameter Tuning

Step 1 until step Step 6 remained unchanged as before but for step 7, it would be different since we we used Grid Search to optimize the model. After Step 7, Step 8 will be the same which is calculating the accuracy performance, precision and performance error.

Step 7: Do Hyperparameter Tuning using Grid Search.

To increase the accuracy of the model, we decided to do Hyperparameter Tuning. For Hyperparameter Tuning, we decided to use Grid Search from sklearn. The code below will show how we did the hyperparameter tuning using GridSearchCV.

from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import GridSearchCV

#create a dictionary of all values we want to test (60:40 ratio)

nb_classifier = GaussianNB()

params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}

clf = GridSearchCV(estimator=nb_classifier, param_grid=params_NB, cv=5)

clf.fit(x_train_data2, y_train_data2)

y_pred = clf.predict(x_test_data2)

#create a dictionary of all values we want to test (80:20 ratio)

nb_classifier = GaussianNB()

params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}

clf = GridSearchCV(estimator=nb_classifier, param_grid=params_NB, cv=5)

clf.fit(x_train_data2_1, y_train_data2_1)

y_pred1 = clf.predict(x_test_data2_1)

RapidMiner

Step 1: Retrieve data

Data was retrieved using the retrieve operator in data mining.

Step 2: Select features

In this step, we will only choose the attributes that will be using for health target which are Age, Socio economic, Salary, self-rate health, self care and health status.

Step 3: Set role

The target for this predictive model is health status since we want to predict the health level of the retiree. So the role was set to health status and the target data is label.

Step 4: Split the data into training and test ratio

We have two ratios so for 80:20 ratio, we will let the training ratio into 0.8 and test ratio into 0.2. While for 60:40 ratio, the training ratio will be 0.6 and the test ratio will be 0.4. So we can see which ratio has higher accuracy; either 80:20 or 60:40. The Split Data operator was used to split the training and test ratio.

Step 5: Train the decision tree model for the ratio.

So to apply the decision tree algorithm for the predictive model, we use the Naive Bayes operator.

Step 6: Calculate the performance of the model

We used the Performance operator to calculate the performance of the model so we can see the classification report such as the accuracy and confusion matrix.

With Hyperparameter Tuning

Step 1 until step Step 3 remained unchanged as before.

Step 4: Put the Optimize Parameter

This is used to optimize and do hyperparameter tuning to the model.

Step 5: Split the data into training and test ratio

To split the data into different ratio, the split data operator was used. So we can set how much is training ratio and how much is test ratio.

Step 6: Train the naive bayes model for the ratio.

So to apply the naive bayes algorithm for the predictive model, we use the Naive Bayes operator.

Step 7: Calculate the performance of the model

We used the Performance operator to calculate the performance of the model. By doing this, we can see the classification report such as the accuracy and confusion matrix.

All of the steps above for RapidMiner and Python are repeated for experiment 2 and 3.

Page updated

Report abuse

MODEL DEVELOPMENT

NAIVE BAYES

Python

Step 1: Import libraries

Step 2: Import data

Step 3: Remove unused attributes

Step 4: Change the nominal data do numerical

Step 5: Select features

Step 6: Set training and test ratio

Step 6: Train the decision tree model for each ratio.

Step 7: Calculate the classification report which include confusion matrix, peformance accuracy, precision and performance error.

With Hyperparameter Tuning

Step 7: Do Hyperparameter Tuning using Grid Search.

RapidMiner

Step 1: Retrieve data

Step 2: Select features

Step 3: Set role

Step 4: Split the data into training and test ratio

Step 5: Train the decision tree model for the ratio.

Step 6: Calculate the performance of the model

With Hyperparameter Tuning

Step 4: Put the Optimize Parameter

Step 5: Split the data into training and test ratio

Step 6: Train the naive bayes model for the ratio.

Step 7: Calculate the performance of the model

© 2021 by 202333 Project Portfolio