In this step, we are using the StudentEvent_Resample dataset from the local repository. This dataset contains 100 rows with 11 columns.
Create an empty process and drag the StudentEvent_Resample dataset into the blank process. It will create a Retrieve operator.
Choose Select Attribute operator to select and determine the attribute to be analyzed. Connect Retrieve operator with the Select Attribute operator and select the attributes. In this analysis, we do not analyze StudentID, Marks, and MarksBin attributes.
Choose subset and invert selection option.
Select the unwanted attributes to the right side.
Choose Select Operator and connect it with Select Attribute operator. Use the Select Role operator to set Grade as the label.
Set Grade as label.
Choose Split Data operator and connect Set Role operator with it. This operator produces the desired number of subsets of the given ExampleSet. The ExampleSet is partitioned into subsets according to the specified relative sizes.
Setting the sample type to automatic.
Set 30:70 ratio
Choose Decision Tree Operator. Connect Split Data output into this operator. This Operator generates a decision tree model, which can be used for classification and regression. Setting the parameters as below:
Choose Apply Model operator. Connect the output from Split Data and Decision Tree as input for Apply Model operator.
Choose Performance operator and connect Apply Model to the Performance operator. Connect Performance output to the result point.
Below is the accuray for our model which is 80.65% and its predict that all students will pass this online course.
This is the visualization of Decision Tree which the root is Forum which it predict all students will pass in this online courses.
For our model using the ratio 50:50, the acciracy is 84% higher from the ratio 30:70. But the precision percentage (100%) only for predicting Grade B, F and C.
Below is generated Decision Tree for the ratio 50:50 and the root is also Forum. From the tree, we can see that it predicts that some students will failed in this online courses.
Using the same Decision Tree model and the same dataset, we just add in Optimize Parameter (Grid) operator to optimize our Decision Tree parameters to increase our prediction accuracy and increase the performance values .
The Optimize Parameters (Grid) operator is a nested operator. It executes the subprocess for all combinations of selected values of the parameters and then delivers the optimal parameter values through the parameter set port.
This is the view for our main process. Since Optimizer Parameter is nested operator, all the operator that related to Decision Tree model is placed inside subprocess of this operator.
This is view for subprocess inside Optimizer Parameter.
This is the most important part of optimizing the Decision Tree parameters.
For this optimizer, we choose log performance, log all criteria and enable parallel execution
This is where we set which parameters will be tuning by Optimizer Parameter. For the Decision Tree model, there are 8 more parameters that can be optimized in order to increase the model performance. But for this project, we choose only two parameters which are DecesionTree.criterion and DecisionTree.maximal_depth.
DecisionTree.criterion
Selects the criterion on which Attributes will be selected for splitting. For each of these criteria, the split value is optimized with regard to the chosen criterion. It can have one of the following values:
information_gain: The entropies of all the Attributes are calculated and the one with the least entropy is selected for a split. This method has a bias towards selecting Attributes with a large number of values.
gain_ratio: A variant of information gain that adjusts the information gain for each Attribute to allow the breadth and uniformity of the Attribute values.
gini_index: A measure of inequality between the distributions of label characteristics. Splitting on a chosen Attribute results in a reduction in the average gini index of the resulting subsets.
accuracy: An Attribute is selected for splitting, which maximizes the accuracy of the whole tree.
least_square: An Attribute is selected for splitting, which minimizes the squared distance between the average of values in the node with regards to the true value.
2. DecisionTree.maximal_depth
The depth of a tree varies depending upon the size and characteristics of the ExampleSet. This parameter is used to restrict the depth of the decision tree. If its value is set to '-1', the maximal depth parameter puts no bound on the depth of the tree. In this case the tree is built until other stopping criteria are met. If its value is set to '1', a tree with a single node is generated.
From the result below, the accuracy of our model after tuning is 100%. Its means all the prediction from this model is correct!
Below is Tree generated for our model and the root node is Assignment.
Below is the result for our model using the ratio of 50:50 and the accuracy increases to 96%. All the class precision percentage is 100% except for predicting Grade B and B+.
From the tree below. we can see that Quiz has higher probability and followed by Assignment.
For Decision Tree in Rapidminer, we can see that the Decision Tree best model is using the ratio of 30:70.