In this step, we are using the StudentEvent_Resample dataset from the local repository. This dataset contains 100 rows with 11 columns.
Create an empty process and drag the StudentEvent_Resample dataset into the blank process. It will create a Retrieve operator.
Choose Select Attribute operator to select and determine the attribute to be analyzed. Connect Retrieve operator with the Select Attribute operator and select the attributes. In this analysis, we do not analyze StudentID, Marks, and MarksBin attributes.
Choose subset and invert selection option.
Select the unwanted attributes to the right side.
Choose Select Operator and connect it with Select Attribute operator. Use the Select Role operator to set Grade as the label.
Set Grade as label.
Choose Split Data operator and connect Set Role operator with it. This operator produces the desired number of subsets of the given ExampleSet. The ExampleSet is partitioned into subsets according to the specified relative sizes.
Setting the sample type to automatic.
Set 30:70 ratio
Choose Naive Bayes Operator. Connect Split Data output into this operator. This Operator generates a Naive Bayes classification model. Setting the parameters as below:
Choose Apply Model operator. Connect the output from Split Data and Naive Bayes operator as input for Apply Model operator.
Choose Performance operator and connect Apply Model to the Performance operator. Connect Performance output to the result point.
Distribution Table
Using the same Naive Bayes model and the same dataset, we just add in Optimize Parameter (Grid) operator to optimize our Naive Bayes parameters to increase our prediction accuracy and increase the performance values .
The Optimize Parameters (Grid) operator is a nested operator. It executes the subprocess for all combinations of selected values of the parameters and then delivers the optimal parameter values through the parameter set port.
This is the view for our main process. Since Optimizer Parameter is nested operator, all the operator that related to Decision Tree model is placed inside subprocess of this operator.
This is view for subprocess inside Optimizer Parameter.
This is the most important part of optimizing the Naive Bayes parameters.
We choose to check log performance, log all criteria and enable parallel execution. For error gandling, choose fail on error.
This is where we set which parameters will be tuning by Optimizer Parameter. For the Naive Bayes model, only 1 parameter can be optimized in order to increase the model performance which is Laplace correction.
laplace_correction
The simplicity of Naive Bayes includes a weakness: if within the training data a given Attribute value never occurs in the context of a given class, then the conditional probability is set to zero. When this zero value is multiplied together with other probabilities, those values are also set to zero, and the results will be misleading. Laplace correction is a simple trick to avoid this problem, adding one to each count to avoid the occurrence of zero values. For most training sets, adding one to each count has only a negligible effect on the estimated probabilities.
Based on the result below, we can see that the accuracy for Naive Bayes increases to 80.65%. The class precision percentage is 100% only for predict Grade A-, B- and C.
For Naive Bayes with the ratio of 50:50, the accuracy also increased to 70%. But the increases is still below 80%. Also the percentage class precision 100 only in predict Grade B-, F and C.
This is SimpeDistrubation that sumamry the distribution for our target which is Grade. All grades has 7 distributions but with the different values.
For Naive Bayes in Rapodminer, we can see that our model increases after tuning. But for the model using the ratio 50 and 50, the accuracy still below 80%. Its means that our model percentage error is 30%. For NaiveBayes in Rapidminer, the best performance is 80.65% using 30:70 ratio.