MODEL DEVELOPMENT

Decision Tree

Decision Tree is a supervised machine learning and has a flowchart-like structure in which each internal node represents a test on a feature and each leaf node will represent a class label (the decision is taken after computing all the features). Moreover, the branches represent the conjunctions of features that lead to those class labels while the paths from the root to leaf represents classification rules. Decision Tree is built via an algorithmic approach that identifies ways to split the dataset based on certain conditions and it is one of the popular and widely used for the classification task.

Now I will explain how each algorithm for predictive data mining is carried out. For predictive data mining, we used both Python and RapidMiner. However, the experiments and algorithms used for both Python and RapidMiner are the same. Thus, in this page I will only explain how we did the algorithm for only one experiment since the only difference between all these experiments are just the preprocessing in which we used different datasets only. The preprocessing my group did was explained on Preprocessing Page. To see in detail how each experiment was done for Python, you can look at my group's Google Colab. Now let's begin!

Python

Step 1: Import libraries

#importing libraries for data manipulation

import numpy as np #library for numerical computing

import pandas as pd #library for data manipulation and analysis

Step 2: Import data

#importing cleaned dataset done using Rapidminer

from google.colab import files

uploaded = files.upload()

data=pd.read_csv('Sampling Distress Level.csv')

Step 3: Remove attributes

We removed the attributes that will not be used in this predicitve data mining for this target.

data = data.drop(columns=['Attitude towards retirement', 'life satisfaction'])

Step 4: Change the nominal data do numerical

Since f2healthstat is in nominal, we changed it to numerical. This will allow us to calculate the performance accuracy later on.

# change to categorical variable

# category codes

data["f2healthstat"]=data["f2healthstat"].astype('category')

data["Leveldistress"]=data["Leveldistress"].astype('category')

# categorical to int

data["f2healthstat"]=data["f2healthstat"].cat.codes

data["Leveldistress"]=data["Leveldistress"].cat.codes

Step 5: Select features

In this step, we selected only required features for this predictive data mining. The x_data = data.drop(columns=['Level of education', 'financial well being', 'f2healthstat', 'Leveldistress'], axis=1)is used to remove unnecessary features while y_data = data['f2healthstat']is used to set the target for this model. In our case, our target is f2healthstat so that is why we set the f2healthstat as y_data.

# selecting features that we want

x_data = data.drop(columns=['Level of education', 'financial well being', 'f2healthstat', 'Leveldistress'], axis=1)

y_data = data['f2healthstat']

Step 6: Set training and test ratio

For our experiments, we decided to use 60:40 ratio and 80:20 ratio. Thus, the test_size was set to 0.4 and 0.2 accordingly.

#splitting the data into train and test

#test size can be changed

#if test_size is 0.6 and 0.8, then ratio for train:test is 60:40 and 80:20

from sklearn.model_selection import train_test_split

x_train_data, x_test_data, y_train_data, y_test_data=train_test_split(x_data, y_data,test_size=0.4, random_state= 0)

x_train_data1, x_test_data1, y_train_data1, y_test_data1=train_test_split(x_data, y_data, test_size=0.2, random_state= 0)

Step 6: Train the decision tree model for each ratio.

For decision tree model, the criterion was set to entropy and max_depth to 10. Then, we used the Decision Tree Classifier from sklearn to build and train this model.

#train the decision tree model for 60:40 and maximum of depth 10

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=10)

model.fit(x_train_data, y_train_data)

predictions = model.predict(x_test_data)

#train the decision tree model for 80:20 and maximum of depth 10

from sklearn.tree import DecisionTreeClassifier

model1 = DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=10)

model1.fit(x_train_data1, y_train_data1)

predictions1 = model1.predict(x_test_data1)

Step 7: Calculate the classification report which include confusion matrix, peformance accuracy, precision and performance error.

#Showing accuracy and precision score for 60:40

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score,precision_score

print("\nClassification Report")

print(classification_report(y_test_data, predictions))

print("\nConfusion Matrix")

print(pd.crosstab(y_test_data, predictions, rownames=['Actual'], colnames=['Predicted'], margins = True))

print("\nPerformance Accuracy: {:.2f} %".format(accuracy_score(y_test_data, predictions)*100))

print("\nPrecision: {:.2f} %".format(precision_score(y_test_data, predictions)*100))

print("\nPerformance Error: {:.2f} %".format(100-(accuracy_score(y_test_data, predictions)*100)))

#Showing accuracy and precision score for 80:20

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score,precision_score

print("\nClassification Report")

print(classification_report(y_test_data1, predictions1))

print("\nConfusion Matrix")

print(pd.crosstab(y_test_data1, predictions1, rownames=['Actual'], colnames=['Predicted'], margins = True))

print("\nPerformance Accuracy: {:.2f} %".format(accuracy_score(y_test_data1, predictions1)*100))

print("\nPrecision: {:.2f} %".format(precision_score(y_test_data1, predictions1)*100))

print("\nPerformance Error: {:.2f} %".format(100-(accuracy_score(y_test_data1, predictions1)*100)))

With Hyperparameter Tuning

Step 1 until step Step 6 remained unchanged as before but for step 7, it would be different since we we used Grid Search to optimize the model. After Step 7, Step 8 will be the same which is calculating the accuracy performance, precision and performance error.

Step 7: Do Hyperparameter Tuning using Grid Search.

To increase the accuracy of the model, we decided to do Hyperparameter Tuning. For Hyperparameter Tuning, we decided to use Grid Search from sklearn. The code below will show how we did the hyperparameter tuning using GridSearchCV.

#train the decision tree model for 60:40 and maximum of depth 10

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import GridSearchCV

#create a dictionary of all values we want to test

param_grid = {'criterion':['gini', 'entropy'],'max_depth': np.arange(1, 15)}

clf = GridSearchCV(DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=10), param_grid, cv=5)

clf.fit(x_train_data, y_train_data)

predictions = clf.predict(x_test_data)

#create a dictionary of all values we want to test

param_grid = {'criterion':['gini', 'entropy'],'max_depth': np.arange(1, 15)}

clfh = GridSearchCV(DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=10), param_grid, cv=5)

clfh.fit(x_train_data1, y_train_data1)

predictions1 = clfh.predict(x_test_data1)

RapidMiner

Step 1: Retrieve data

Data was retrieved using the retrieve operator in data mining.

Step 2: Select features

In this step, we will only choose the attributes that will be using for health target which are Age, Socio economic, Salary, self-rate health, self care and health status.

Step 3: Set role

The target for this predictive model is health status since we want to predict the health level of the retiree. So the role was set to health status and the target data is label.

Step 4: Split the data into training and test ratio

We have two ratios so for 80:20 ratio, we will let the training ratio into 0.8 and test ratio into 0.2. While for 60:40 ratio, the training ratio will be 0.6 and the test ratio will be 0.4. So we can see which ratio has higher accuracy; either 80:20 or 60:40. The Split Data operator was used to split the training and test ratio.

Step 5: Train the decision tree model for the ratio.

So to apply the decision tree algorithm for the predictive model, we use the Decision Tree operator. The criterio was set to gain_ratio and maximal depth to 10. So the tree will have a maximum of 10 altogether.

Step 6: Calculate the performance of the model

We used the Performance operator to calculate the performance of the model so we can see the classification report such as the accuracy and confusion matrix.

With Hyperparameter Tuning

Step 1 until step Step 3 remained unchanged as before.

Step 4: Put the Optimize Parameter

This is used to optimize and do hyperparameter tuning to the model.

Step 5: Split the data into training and test ratio

To split the data into different ratio, the split data operator was used. So we can set how much is training ratio and how much is test ratio.

Step 6: Train the decision tree model for the ratio.

Step 7: Calculate the performance of the model

We used the Performance operator to calculate the performance of the model so we can see the classification report such as the accuracy and confusion matrix.

All these steps will be repeated for Experiment 2 and Experiment 3.

Page updated

Report abuse

MODEL DEVELOPMENT

Decision Tree

Python

Step 1: Import libraries

Step 2: Import data

Step 3: Remove attributes

Step 4: Change the nominal data do numerical

Step 5: Select features

Step 6: Set training and test ratio

Step 6: Train the decision tree model for each ratio.

Step 7: Calculate the classification report which include confusion matrix, peformance accuracy, precision and performance error.

With Hyperparameter Tuning

Step 7: Do Hyperparameter Tuning using Grid Search.

RapidMiner

Step 1: Retrieve data

Step 2: Select features

Step 3: Set role

Step 4: Split the data into training and test ratio

Step 5: Train the decision tree model for the ratio.

Step 6: Calculate the performance of the model

With Hyperparameter Tuning

Step 4: Put the Optimize Parameter

Step 5: Split the data into training and test ratio

Step 6: Train the decision tree model for the ratio.

Step 7: Calculate the performance of the model

© 2021 by 202333 Project Portfolio