Decision Tree & Random Forest

Google Colab

First and foremost, dataset needs to be imported into Google Drive so that, we can retrieve it after verification. For feature selections, X, t_data will hold the attributes that need to be tested such as 'BLK', 'STL', 'MADE', 'REB', 'FG (TOT ATT' while Y,t_data is for 'STRENGTH' which to indicate the prediction of player strength is Offensive or Defensive.

KBinsDiscretizer is used because decision tree are made of complex data, by using discretization it can help to reduce the impact of small fluctuation in data and also reduce range of data's values. From this figure, we set the bin value by 3.

Comparison Using Different Number of Bins

By using number of bins equal to 2, accuracy result by using both ratio of Split Data are increase after tuning.

When the number of bins is 3, the result shows that the result is a higher than using number of bins 2 even before tuning.

For Random Forest, the result by using number bins 2 shows that for Split Data's ratio 07:03 is higher than 0.5:0.5, even higher than both results using number bins of 3.

Decision Tree

DecisionTreeClassifier is imported as is to build tree structure. For the criterion is entropy is for information gain, this criterion will do attribute selection manners that will partition the data into the best way. Maximum depth of tree is set to 6 while random state which is the parameter to choose the split strategy is set equal to 1.

Before tuning, our decision tree shows the result accuracy is 0.6923 which is 69.23%

Decision Tree using Gridsearch Tuning

GridSearch Tuning are applied for both predictive which is for Decision Tree and Random Forest Tree. We have to import Pipeline, GridSearchCV, StandardScaler and also decomposition. Pipeline is imported as it will gather several steps that can be verified together while setting different parameters.

From the given code, we use a code to create the Pipelines which is pipe = Pipeline(steps=[('std_slc', std_slc),('pca', pca),('dec_tree', dec_tree)])

n_components code n_components = list(range(1,X.shape[1]+1,1)) are use to create the parameter spaces. Criterion that we used for gridsearch is 'gini', 'entropy', value for the maximum depth is 2,4,6,8,10,12. For the clf code, are use to conduct the parameter with pipeline. This clf will help to store train model values, which can assist to predict value.

The accuracy result after using GridSearch Tuning is increasing from 0.6923 ( 69.23%) to 0.8462 which is 84.62%

Random Forest

Random Forest classifier is a classification that uses multiple model of Decision Tree. For this process, at first we need to import the RandomForestClassifier. n_estimators is set to value 100 which is number of tree that we want to build before taking the maximum votes or average predictions.

The result of our Random Forest Tree before tuning is 0.6923 which is 62.93% the result is the same as Decision Tree.

Random Forest Using GridSearch Tuning

Gridsearch Tuning is used to optimize our model and prediction. In the parameter grid, we set it to few parameter such as 'bootstrap' with value 'True', 'max_depth' using value 80,90,100 and 110, 'max_features'is set to value 2 and 3, 'min_samples_leaf' with value 3, 4, and 5, 'min_samples_split' 8,10, 12 and finally the 'n_estimators' is set to value 100, 200, 300 and 1000.

Gaussian Classifier is created and to get our accuracy score, we need to import the accuracy_score. Stated that test set accuracy is equal to 0.7692 or more precisely is 76.92% which is increased than before committing GridSearch Tuning.

Comparison Result using different Split Data for with and without GridSearch Tuning

Decision Tree accuracy result shows that the value is increase when using Split Data ratio 0.7:0.3 however, the result after tuning is drop when using ratio 0.5:0.5. This might occur because of results of model not using as much as training data as the model 0.5556.

For Random Forest, result for both Split Data shows that the accuracy is increases when performing GridSearch Tuning it will help to minimize the predetermined loss function to give a better results.

Page updated

Google Sites

Report abuse