SSK4604-Data Mining

Decision Tree In Python

In Python, Decision Tree use DecisionTreeClassifier to apply Decision Tree algorithm in Python.

Introduction To Dataset

In this step, we are using StudentEvent_Resample.xlsx dataset. The value for this dataset has been standardized in Rapidminer and its contains 100 rows with 11 columns.

Import Dataset

path = "/content/drive/My Drive/Colab Notebooks/199607-Portfolio/StudentEvent_Resample.xlsx"

df = pd.read_excel(path)

df.head(3)

Change StudentID to Numeric Value

df['StudentID'] = df['StudentID'].str[1:]

df['StudentID'] = pd.to_numeric(df['StudentID'])

df.info()

Select Data From Dataset

Select data to be analyzed in this activity.

data_dt = df[['Assignment','Forum','Activity','LectureNote',

'Tutorial','Questionnaire','Quiz','MarksBin']].copy()

data_dt.head()

Decision Tree Classifier

Import Library

#Decision Tree Model

#import the classifier

from sklearn import metrics

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

import warnings

warnings.filterwarnings('ignore')

Initialize X and Y value

This is how we initialized x and y value.

X_dt = data_dt[['Assignment','Forum','Activity','LectureNote',

'Tutorial','Questionnaire','Quiz']].values #data_dt.values[:, :-1]

print(X_dt)

y_dt = data_dt.MarksBin.values

print(X_dt.shape,y_dt.shape)

Split Data Intro Train And Test Set

We split the data to be tested using the ratio of 30:70 and 50:50.

from sklearn.model_selection import train_test_split

X_train_dt, X_test_dt, y_train_dt, y_test_dt = train_test_split(X_dt,y_dt, test_size = 0.3, random_state = 10)

X_train_dt1, X_test_dt1, y_train_dt1, y_test_dt1 = train_test_split(X_dt,y_dt, test_size = 0.5, random_state = 10)

Test Model Using 30:70 Ratio

We test the model based on the ratio of 30:70 to see the prformance accuracy for our model.

#test model using 30:70 ratio

#lets try with another classifier: DecisionTreeClassifier

mod_dt = DecisionTreeClassifier(random_state = 100, max_depth = 3)

mod_dt.fit(X_train_dt,y_train_dt)

y_pred_dt=mod_dt.predict(X_test_dt)

pred_train_dt = mod_dt.predict(X_train_dt)

print("ACCURACY ON VALIDATION SET")

print("Performance Accuracy (30:70): {:.2f} %".format(accuracy_score(y_pred_dt,y_test_dt)*100))

print("Performance Error: {:.2f} %".format(100-(accuracy_score(y_test_dt,y_pred_dt)*100)))

print("Classification Report")

classifyV_dt = classification_report(y_pred_dt,y_test_dt);

print(classifyV_dt)

print("Confusion Matrix")

confusion_matrixV_dt = pd.crosstab(y_test_dt, y_pred_dt, rownames=['Actual'], colnames=['Predicted'], margins = True)

print (confusion_matrixV_dt)

The result shows that our model accuracy is 80%. From Confusion Matrix, our model predicts 10 students will get 1 (Grade A), 11 students will get 2 (Grade A-), 3 students will get 3 (Grade B+) and 6 students will get 4 (Grade B). It seems like our model predicts all the students pass in this online course!

Test Model Using 50:50 Ratio

We test our model using the ratio of 50:50.

#test model using 50:50 ratio

#lets try with another classifier: DecisionTreeClassifier

mod_dt1 = DecisionTreeClassifier(random_state = 100, max_depth = 3)

mod_dt1.fit(X_train_dt1,y_train_dt1)

y_pred_dt1=mod_dt1.predict(X_test_dt1)

pred_train_dt1 = mod_dt1.predict(X_train_dt1)

print("ACCURACY ON VALIDATION SET (50:50)")

print("Performance Accuracy (50:50): {:.2f} %".format(accuracy_score(y_pred_dt1,y_test_dt1)*100))

print("Performance Error: {:.2f} %".format(100-(accuracy_score(y_test_dt1,y_pred_dt1)*100)))

print("Classification Report")

classifyV_dt1 = classification_report(y_pred_dt1,y_test_dt1);

print(classifyV_dt1)

print("Confusion Matrix")

confusion_matrixV_dt1 = pd.crosstab(y_test_dt1, y_pred_dt1, rownames=['Actual'], colnames=['Predicted'], margins = True)

print (confusion_matrixV_dt1)

The result shows that our model accuracy is only 68%. From Confusion Matrix, our model predicts 24 students will get 1 (Grade A), 10 students will get 2 (Grade A-), 11 students will get 3 (Grade B+) and 5 students will get 4 (Grade B). It seems like our model predicts all the students pass in this online course even using the ratio of 50:50.

Hyperparameter Tuning using GridSearch Algorithm

Hyperparameter tuning is searching the hyperparameter space for a set of values that will optimize your model architecture. It is very important to make sure our model performance is in good condition. Therefore, hyperparameter tuning is a way to increase the performance accuracy which means it will decrease the error percentage.

For our model, we tune only for criterion and max_depth parameter. To apply hyperparameter tuning in our model, we used the GridSearchCV algorithm. GridSearchCV is a library function that is a member of sklearn's model_selection package. It helps to loop through predefined hyperparameters and fit with our estimator (model) on our training set. So, in the end, we can select the best parameters from the listed hyperparameters.

#Decision Tree Hyperparameter Tuning Using GridSearch

# importing libraries

from sklearn import decomposition

from sklearn import tree

from sklearn.pipeline import Pipeline

from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import StandardScaler

# Creating an standardscaler object

std_slc = StandardScaler()

# Creating a pca object

pca = decomposition.PCA()

# Creating a DecisionTreeClassifier

dec_tree = tree.DecisionTreeClassifier()

# Creating a pipeline of three steps. First, standardizing the data.

# Second, tranforming the data with PCA.

# Third, training a Decision Tree Classifier on the data.

pipe = Pipeline(steps=[('std_slc', std_slc),

('pca', pca),

('dec_tree', dec_tree)])

# Creating Parameter Space

# Creating a list of a sequence of integers from 1 to 30 (the number of features in X + 1)

n_components = list(range(1,X_dt.shape[1]+1,1))

# Creating lists of parameter for Decision Tree Classifier

criterion = ['gini', 'entropy']

max_depth = [2,4,6,8,10,12,14,16,18,20,22,24,26,28,30]

# Creating a dictionary of all the parameter options

# Note that we can access the parameters of steps of a pipeline by using '__’

parameters = dict(pca__n_components=n_components,

dec_tree__criterion=criterion,

dec_tree__max_depth=max_depth)

# Conducting Parameter Optmization With Pipeline

# Creating a grid search object

clf_GS = GridSearchCV(pipe, parameters)

# Fitting the grid search

clf_GS.fit(X_dt, y_dt)

# Viewing The Best Parameters

print('Best Criterion:', clf_GS.best_estimator_.get_params()['dec_tree__criterion'])

print('Best max_depth:', clf_GS.best_estimator_.get_params()['dec_tree__max_depth'])

print('Best Number Of Components:', clf_GS.best_estimator_.get_params()['pca__n_components'])

print(); print(clf_GS.best_estimator_.get_params()['dec_tree'])

Below is the result for hyperparameter tuning for our model. From the result, it suggested that the best criterion is gini with best max_depth is 22. Other than that it also suggested the best number of component that we can use is 4. Since we want to analyze all the features in our dataset, therefore we ignore the last suggestion.

Decision Tree Performance After Tuning

This is how we apply the suggestion value into our model.

Test Model Using 30:70 Ratio

Based on the suggestion, we test our model using the ratio of 30:70

#after tuning test on 30:70 ratio

#lets try the value suggested by GridSearch algorithm with another classifier: DecisionTreeClassifier

mod_dt = DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth = 14)

mod_dt.fit(X_train_dt,y_train_dt)

y_pred_dt=mod_dt.predict(X_test_dt)

pred_train_dt = mod_dt.predict(X_train_dt)

print("ACCURACY ON VALIDATION SET-AFTER TUNING (30:70)")

print("Performance Accuracy: {:.2f} %".format(accuracy_score(y_pred_dt,y_test_dt)*100))

print("Performance Error: {:.2f} %".format(100-(accuracy_score(y_test_dt,y_pred_dt)*100)))

print("Classification Report")

classifyV_dt2 = classification_report(y_pred_dt,y_test_dt);

print(classifyV_dt2)

print("Confusion Matrix")

confusion_matrixV_dt2 = pd.crosstab(y_test_dt, y_pred_dt, rownames=['Actual'], colnames=['Predicted'], margins = True)

print (confusion_matrixV_dt2)

From the result below, we can see that our model performance accuracy increased from 80% to 93.33%. From the Confusion Matrix, we can see that our model predicts 10 students will get 1 (Grade A), 9 students will get 2 (Grade B+), 3 students will get 3 (Grade B), 5 students will get 4 (Grade B-), 1 student will get 7(Grade C) and 2 students will get 11 (Grade F). It seems our model is improved in predicting since we have a failed student in our dataset.

Test Model Using 50:50 Ratio

This is how we test our model using suggested value using the ratio of 50:50.

#after tuning test on 50:50 ratio

#lets try the value suggested by GridSearch algorithm with another classifier: DecisionTreeClassifier

mod_dt1 = DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth = 14)

mod_dt1.fit(X_train_dt1,y_train_dt1)

y_pred_dt1=mod_dt1.predict(X_test_dt1)

pred_train_dt1 = mod_dt1.predict(X_train_dt1)

print("ACCURACY ON VALIDATION SET-AFTER TUNING (50:50)")

print("Performance Accuracy: {:.2f} %".format(accuracy_score(y_pred_dt1,y_test_dt1)*100))

print("Performance Error: {:.2f} %".format(100-(accuracy_score(y_test_dt1,y_pred_dt1)*100)))

print("Classification Report")

classifyV_dt3 = classification_report(y_pred_dt1,y_test_dt1);

print(classifyV_dt3)

print("Confusion Matrix")

confusion_matrixV_dt3 = pd.crosstab(y_test_dt1, y_pred_dt1, rownames=['Actual'], colnames=['Predicted'], margins = True)

print (confusion_matrixV_dt3)

From the result below, our model performance accuracy also increases for this ratio from 68% to 84%. But it seems in the Confusion matrix, our model not predicting any failed students. Even in the Classification report, the average precision is only 55%, average recall is 48% and f1-score only 51%.

Decision Tree Visualization

Below is our Decision Tree Visualization based on 30:70 test ratio. From the graph, we can see that there are five features which are Assignment, Forum, Tutorial, LectureNote, and Quiz. But there are two more features that missing from this tree which are Questionnare and Activity. It seems like this two features are not important or not giving any affected to the student performance.

Summary

From the above results, we can see that how effective hyperparameter tuning for our Decision Tree model. For the ratio of 30:70, the performance accuracy increase from 80% to 93.3%. For the ratio of 50:50, the performance accuracy increase from 68% to 84%. From all the results, we can conclude that the best ratio for the Decision Tree model using Python is for the ratio of 30:70.

Next Topic: Naive Bayes In Rapidminer

Page updated

Report abuse

Decision Tree In Python

Decision Tree In Python

Introduction To Dataset

Import Dataset

Change StudentID to Numeric Value

Select Data From Dataset

Decision Tree Classifier

Import Library

Initialize X and Y value

Split Data Intro Train And Test Set

Test Model Using 30:70 Ratio

Test Model Using 50:50 Ratio

Hyperparameter Tuning using GridSearch Algorithm

Decision Tree Performance After Tuning

Test Model Using 30:70 Ratio

Test Model Using 50:50 Ratio

Decision Tree Visualization

Summary

Next Topic: Naive Bayes In Rapidminer

Copyright by 199607-Build using sites.google.com