Decision Tree is a supervised machine learning and has a flowchart-like structure in which each internal node represents a test on a feature and each leaf node will represent a class label (the decision is taken after computing all the features). Moreover, the branches represent the conjunctions of features that lead to those class labels while the paths from the root to leaf represents classification rules. Decision Tree is built via an algorithmic approach that identifies ways to split the dataset based on certain conditions and it is one of the popular and widely used for the classification task.
Now I will explain how each algorithm for predictive data mining is carried out. For predictive data mining, we used both Python and RapidMiner. However, the experiments and algorithms used for both Python and RapidMiner are the same. Thus, in this page I will only explain how we did the algorithm for only one experiment since the only difference between all these experiments are just the preprocessing in which we used different datasets only. The preprocessing my group did was explained on Preprocessing Page. To see in detail how each experiment was done for Python, you can look at my group's Google Colab. Now let's begin!
#importing libraries for data manipulation
import numpy as np #library for numerical computing
import pandas as pd #library for data manipulation and analysis
#importing cleaned dataset done using Rapidminer
from google.colab import files
uploaded = files.upload()
data=pd.read_csv('Sampling Distress Level.csv')
We removed the attributes that will not be used in this predicitve data mining for this target.
data = data.drop(columns=['Attitude towards retirement', 'life satisfaction'])
Since f2healthstat is in nominal, we changed it to numerical. This will allow us to calculate the performance accuracy later on.
# change to categorical variable
# category codes
data["f2healthstat"]=data["f2healthstat"].astype('category')
data["Leveldistress"]=data["Leveldistress"].astype('category')
# categorical to int
data["f2healthstat"]=data["f2healthstat"].cat.codes
data["Leveldistress"]=data["Leveldistress"].cat.codes
In this step, we selected only required features for this predictive data mining. The x_data = data.drop(columns=['Level of education', 'financial well being', 'f2healthstat', 'Leveldistress'], axis=1)is used to remove unnecessary features while y_data = data['f2healthstat']is used to set the target for this model. In our case, our target is f2healthstat so that is why we set the f2healthstat as y_data.
# selecting features that we want
x_data = data.drop(columns=['Level of education', 'financial well being', 'f2healthstat', 'Leveldistress'], axis=1)
y_data = data['f2healthstat']
For our experiments, we decided to use 60:40 ratio and 80:20 ratio. Thus, the test_size was set to 0.4 and 0.2 accordingly.
#splitting the data into train and test
#test size can be changed
#if test_size is 0.6 and 0.8, then ratio for train:test is 60:40 and 80:20
from sklearn.model_selection import train_test_split
x_train_data, x_test_data, y_train_data, y_test_data=train_test_split(x_data, y_data,test_size=0.4, random_state= 0)
x_train_data1, x_test_data1, y_train_data1, y_test_data1=train_test_split(x_data, y_data, test_size=0.2, random_state= 0)
For decision tree model, the criterion was set to entropy and max_depth to 10. Then, we used the Decision Tree Classifier from sklearn to build and train this model.
#train the decision tree model for 60:40 and maximum of depth 10
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=10)
model.fit(x_train_data, y_train_data)
predictions = model.predict(x_test_data)
#train the decision tree model for 80:20 and maximum of depth 10
from sklearn.tree import DecisionTreeClassifier
model1 = DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=10)
model1.fit(x_train_data1, y_train_data1)
predictions1 = model1.predict(x_test_data1)
#Showing accuracy and precision score for 60:40
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score,precision_score
print("\nClassification Report")
print(classification_report(y_test_data, predictions))
print("\nConfusion Matrix")
print(pd.crosstab(y_test_data, predictions, rownames=['Actual'], colnames=['Predicted'], margins = True))
print("\nPerformance Accuracy: {:.2f} %".format(accuracy_score(y_test_data, predictions)*100))
print("\nPrecision: {:.2f} %".format(precision_score(y_test_data, predictions)*100))
print("\nPerformance Error: {:.2f} %".format(100-(accuracy_score(y_test_data, predictions)*100)))
#Showing accuracy and precision score for 80:20
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score,precision_score
print("\nClassification Report")
print(classification_report(y_test_data1, predictions1))
print("\nConfusion Matrix")
print(pd.crosstab(y_test_data1, predictions1, rownames=['Actual'], colnames=['Predicted'], margins = True))
print("\nPerformance Accuracy: {:.2f} %".format(accuracy_score(y_test_data1, predictions1)*100))
print("\nPrecision: {:.2f} %".format(precision_score(y_test_data1, predictions1)*100))
print("\nPerformance Error: {:.2f} %".format(100-(accuracy_score(y_test_data1, predictions1)*100)))
Step 1 until step Step 6 remained unchanged as before but for step 7, it would be different since we we used Grid Search to optimize the model. After Step 7, Step 8 will be the same which is calculating the accuracy performance, precision and performance error.
To increase the accuracy of the model, we decided to do Hyperparameter Tuning. For Hyperparameter Tuning, we decided to use Grid Search from sklearn. The code below will show how we did the hyperparameter tuning using GridSearchCV.
#train the decision tree model for 60:40 and maximum of depth 10
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
#create a dictionary of all values we want to test
param_grid = {'criterion':['gini', 'entropy'],'max_depth': np.arange(1, 15)}
clf = GridSearchCV(DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=10), param_grid, cv=5)
clf.fit(x_train_data, y_train_data)
predictions = clf.predict(x_test_data)
#create a dictionary of all values we want to test
param_grid = {'criterion':['gini', 'entropy'],'max_depth': np.arange(1, 15)}
clfh = GridSearchCV(DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=10), param_grid, cv=5)
clfh.fit(x_train_data1, y_train_data1)
predictions1 = clfh.predict(x_test_data1)
Data was retrieved using the retrieve operator in data mining.
In this step, we will only choose the attributes that will be using for health target which are Age, Socio economic, Salary, self-rate health, self care and health status.
The target for this predictive model is health status since we want to predict the health level of the retiree. So the role was set to health status and the target data is label.
We have two ratios so for 80:20 ratio, we will let the training ratio into 0.8 and test ratio into 0.2. While for 60:40 ratio, the training ratio will be 0.6 and the test ratio will be 0.4. So we can see which ratio has higher accuracy; either 80:20 or 60:40. The Split Data operator was used to split the training and test ratio.
So to apply the decision tree algorithm for the predictive model, we use the Decision Tree operator. The criterio was set to gain_ratio and maximal depth to 10. So the tree will have a maximum of 10 altogether.
We used the Performance operator to calculate the performance of the model so we can see the classification report such as the accuracy and confusion matrix.
Step 1 until step Step 3 remained unchanged as before.
This is used to optimize and do hyperparameter tuning to the model.
To split the data into different ratio, the split data operator was used. So we can set how much is training ratio and how much is test ratio.
So to apply the decision tree algorithm for the predictive model, we use the Decision Tree operator. The criterio was set to gain_ratio and maximal depth to 10. So the tree will have a maximum of 10 altogether.
We used the Performance operator to calculate the performance of the model so we can see the classification report such as the accuracy and confusion matrix.
All these steps will be repeated for Experiment 2 and Experiment 3.