In Python, Decision Tree use DecisionTreeClassifier to apply Decision Tree algorithm in Python.
In this step, we are using StudentEvent_Resample.xlsx dataset. The value for this dataset has been standardized in Rapidminer and its contains 100 rows with 11 columns.
path = "/content/drive/My Drive/Colab Notebooks/199607-Portfolio/StudentEvent_Resample.xlsx"
df = pd.read_excel(path)
df.head(3)
df['StudentID'] = df['StudentID'].str[1:]
df['StudentID'] = pd.to_numeric(df['StudentID'])
df.info()
Select data to be analyzed in this activity.
data_dt = df[['Assignment','Forum','Activity','LectureNote',
'Tutorial','Questionnaire','Quiz','MarksBin']].copy()
data_dt.head()
#Decision Tree Model
#import the classifier
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings('ignore')
This is how we initialized x and y value.
X_dt = data_dt[['Assignment','Forum','Activity','LectureNote',
'Tutorial','Questionnaire','Quiz']].values #data_dt.values[:, :-1]
print(X_dt)
y_dt = data_dt.MarksBin.values
print(X_dt.shape,y_dt.shape)
We split the data to be tested using the ratio of 30:70 and 50:50.
from sklearn.model_selection import train_test_split
X_train_dt, X_test_dt, y_train_dt, y_test_dt = train_test_split(X_dt,y_dt, test_size = 0.3, random_state = 10)
X_train_dt1, X_test_dt1, y_train_dt1, y_test_dt1 = train_test_split(X_dt,y_dt, test_size = 0.5, random_state = 10)
We test the model based on the ratio of 30:70 to see the prformance accuracy for our model.
#test model using 30:70 ratio
#lets try with another classifier: DecisionTreeClassifier
mod_dt = DecisionTreeClassifier(random_state = 100, max_depth = 3)
mod_dt.fit(X_train_dt,y_train_dt)
y_pred_dt=mod_dt.predict(X_test_dt)
pred_train_dt = mod_dt.predict(X_train_dt)
print("ACCURACY ON VALIDATION SET")
print("Performance Accuracy (30:70): {:.2f} %".format(accuracy_score(y_pred_dt,y_test_dt)*100))
print("Performance Error: {:.2f} %".format(100-(accuracy_score(y_test_dt,y_pred_dt)*100)))
print("Classification Report")
classifyV_dt = classification_report(y_pred_dt,y_test_dt);
print(classifyV_dt)
print("Confusion Matrix")
confusion_matrixV_dt = pd.crosstab(y_test_dt, y_pred_dt, rownames=['Actual'], colnames=['Predicted'], margins = True)
print (confusion_matrixV_dt)
The result shows that our model accuracy is 80%. From Confusion Matrix, our model predicts 10 students will get 1 (Grade A), 11 students will get 2 (Grade A-), 3 students will get 3 (Grade B+) and 6 students will get 4 (Grade B). It seems like our model predicts all the students pass in this online course!
We test our model using the ratio of 50:50.
#test model using 50:50 ratio
#lets try with another classifier: DecisionTreeClassifier
mod_dt1 = DecisionTreeClassifier(random_state = 100, max_depth = 3)
mod_dt1.fit(X_train_dt1,y_train_dt1)
y_pred_dt1=mod_dt1.predict(X_test_dt1)
pred_train_dt1 = mod_dt1.predict(X_train_dt1)
print("ACCURACY ON VALIDATION SET (50:50)")
print("Performance Accuracy (50:50): {:.2f} %".format(accuracy_score(y_pred_dt1,y_test_dt1)*100))
print("Performance Error: {:.2f} %".format(100-(accuracy_score(y_test_dt1,y_pred_dt1)*100)))
print("Classification Report")
classifyV_dt1 = classification_report(y_pred_dt1,y_test_dt1);
print(classifyV_dt1)
print("Confusion Matrix")
confusion_matrixV_dt1 = pd.crosstab(y_test_dt1, y_pred_dt1, rownames=['Actual'], colnames=['Predicted'], margins = True)
print (confusion_matrixV_dt1)
The result shows that our model accuracy is only 68%. From Confusion Matrix, our model predicts 24 students will get 1 (Grade A), 10 students will get 2 (Grade A-), 11 students will get 3 (Grade B+) and 5 students will get 4 (Grade B). It seems like our model predicts all the students pass in this online course even using the ratio of 50:50.
Hyperparameter tuning is searching the hyperparameter space for a set of values that will optimize your model architecture. It is very important to make sure our model performance is in good condition. Therefore, hyperparameter tuning is a way to increase the performance accuracy which means it will decrease the error percentage.
For our model, we tune only for criterion and max_depth parameter. To apply hyperparameter tuning in our model, we used the GridSearchCV algorithm. GridSearchCV is a library function that is a member of sklearn's model_selection package. It helps to loop through predefined hyperparameters and fit with our estimator (model) on our training set. So, in the end, we can select the best parameters from the listed hyperparameters.
#Decision Tree Hyperparameter Tuning Using GridSearch
# importing libraries
from sklearn import decomposition
from sklearn import tree
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
# Creating an standardscaler object
std_slc = StandardScaler()
# Creating a pca object
pca = decomposition.PCA()
# Creating a DecisionTreeClassifier
dec_tree = tree.DecisionTreeClassifier()
# Creating a pipeline of three steps. First, standardizing the data.
# Second, tranforming the data with PCA.
# Third, training a Decision Tree Classifier on the data.
pipe = Pipeline(steps=[('std_slc', std_slc),
('pca', pca),
('dec_tree', dec_tree)])
# Creating Parameter Space
# Creating a list of a sequence of integers from 1 to 30 (the number of features in X + 1)
n_components = list(range(1,X_dt.shape[1]+1,1))
# Creating lists of parameter for Decision Tree Classifier
criterion = ['gini', 'entropy']
max_depth = [2,4,6,8,10,12,14,16,18,20,22,24,26,28,30]
# Creating a dictionary of all the parameter options
# Note that we can access the parameters of steps of a pipeline by using '__’
parameters = dict(pca__n_components=n_components,
dec_tree__criterion=criterion,
dec_tree__max_depth=max_depth)
# Conducting Parameter Optmization With Pipeline
# Creating a grid search object
clf_GS = GridSearchCV(pipe, parameters)
# Fitting the grid search
clf_GS.fit(X_dt, y_dt)
# Viewing The Best Parameters
print('Best Criterion:', clf_GS.best_estimator_.get_params()['dec_tree__criterion'])
print('Best max_depth:', clf_GS.best_estimator_.get_params()['dec_tree__max_depth'])
print('Best Number Of Components:', clf_GS.best_estimator_.get_params()['pca__n_components'])
print(); print(clf_GS.best_estimator_.get_params()['dec_tree'])
Below is the result for hyperparameter tuning for our model. From the result, it suggested that the best criterion is gini with best max_depth is 22. Other than that it also suggested the best number of component that we can use is 4. Since we want to analyze all the features in our dataset, therefore we ignore the last suggestion.
This is how we apply the suggestion value into our model.
Based on the suggestion, we test our model using the ratio of 30:70
#after tuning test on 30:70 ratio
#lets try the value suggested by GridSearch algorithm with another classifier: DecisionTreeClassifier
mod_dt = DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth = 14)
mod_dt.fit(X_train_dt,y_train_dt)
y_pred_dt=mod_dt.predict(X_test_dt)
pred_train_dt = mod_dt.predict(X_train_dt)
print("ACCURACY ON VALIDATION SET-AFTER TUNING (30:70)")
print("Performance Accuracy: {:.2f} %".format(accuracy_score(y_pred_dt,y_test_dt)*100))
print("Performance Error: {:.2f} %".format(100-(accuracy_score(y_test_dt,y_pred_dt)*100)))
print("Classification Report")
classifyV_dt2 = classification_report(y_pred_dt,y_test_dt);
print(classifyV_dt2)
print("Confusion Matrix")
confusion_matrixV_dt2 = pd.crosstab(y_test_dt, y_pred_dt, rownames=['Actual'], colnames=['Predicted'], margins = True)
print (confusion_matrixV_dt2)
From the result below, we can see that our model performance accuracy increased from 80% to 93.33%. From the Confusion Matrix, we can see that our model predicts 10 students will get 1 (Grade A), 9 students will get 2 (Grade B+), 3 students will get 3 (Grade B), 5 students will get 4 (Grade B-), 1 student will get 7(Grade C) and 2 students will get 11 (Grade F). It seems our model is improved in predicting since we have a failed student in our dataset.
This is how we test our model using suggested value using the ratio of 50:50.
#after tuning test on 50:50 ratio
#lets try the value suggested by GridSearch algorithm with another classifier: DecisionTreeClassifier
mod_dt1 = DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth = 14)
mod_dt1.fit(X_train_dt1,y_train_dt1)
y_pred_dt1=mod_dt1.predict(X_test_dt1)
pred_train_dt1 = mod_dt1.predict(X_train_dt1)
print("ACCURACY ON VALIDATION SET-AFTER TUNING (50:50)")
print("Performance Accuracy: {:.2f} %".format(accuracy_score(y_pred_dt1,y_test_dt1)*100))
print("Performance Error: {:.2f} %".format(100-(accuracy_score(y_test_dt1,y_pred_dt1)*100)))
print("Classification Report")
classifyV_dt3 = classification_report(y_pred_dt1,y_test_dt1);
print(classifyV_dt3)
print("Confusion Matrix")
confusion_matrixV_dt3 = pd.crosstab(y_test_dt1, y_pred_dt1, rownames=['Actual'], colnames=['Predicted'], margins = True)
print (confusion_matrixV_dt3)
From the result below, our model performance accuracy also increases for this ratio from 68% to 84%. But it seems in the Confusion matrix, our model not predicting any failed students. Even in the Classification report, the average precision is only 55%, average recall is 48% and f1-score only 51%.
Below is our Decision Tree Visualization based on 30:70 test ratio. From the graph, we can see that there are five features which are Assignment, Forum, Tutorial, LectureNote, and Quiz. But there are two more features that missing from this tree which are Questionnare and Activity. It seems like this two features are not important or not giving any affected to the student performance.
From the above results, we can see that how effective hyperparameter tuning for our Decision Tree model. For the ratio of 30:70, the performance accuracy increase from 80% to 93.3%. For the ratio of 50:50, the performance accuracy increase from 68% to 84%. From all the results, we can conclude that the best ratio for the Decision Tree model using Python is for the ratio of 30:70.