In this section, we can see all the results and analysis for our project. For Descriptive analytics, we are not comparing the result for K-Means Clustering and Agglomerative Clustering. The performance comparison is only for predictive analytics which is between Decision Tree and Naive Bayes.
This section will be focusing on the result and analysis produced from the Descriptive Analysis using Rapidminer and Python. We have chosen two clustering algorithms which is K-Means Clustering and Agglomerative Clustering.
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) data points are within the same cluster.
In this project, we define the k-value using the Elbow Method and Silhouette Method using yellowbrick library in Python. Based on the result, we define our k-value to 6.
This is the overview for K-Means Clustering in Rapidminer. We set the k-value to 6 as suggested in the Elbow method and the Silhouette Method in Python. The biggest cluster is Cluster 0 with 22 members followed by Cluster 2 with 8 members. From this overview, we can see that Forum, Quiz, and LectureNote is the most event that students like to access as shown in Cluster 0. While for Cluster 8 we can see that Assignment, LectureNote, and Activity are the most access events for Cluster 2.
Rapidminer gives a good solution for us where we can identify the features and members for each cluster. As we can see, Forum, Quiz, and LectureNote are the most event that most students access for this online course. Students access Quiz because usually, it will contribute marks for students. As for the Forum event, it is a suitable place for students to either want to share their knowledge or experience which can be a reference to other students.
This is K-Means Clustering visualization using scatter plot in Python. As we can see, Cluster 0 is the big cluster among other clusters. Cluster 3, 4, and 5 only has only 1 member. While Cluster 1 has 2 members, Cluster 2 has 3 members, and the rest of the members are belonging to Cluster 0.
Below is our output for Agglomerative Clustering in Rapidminer. The left side is Cluster Description which we can see a number of clusters and the right side are the member for each cluster. As we can see, Cluster 1 is the biggest cluster which has 29 members. From the table, we can see that what is the member for each cluster and what is the value for each feature.
Below is the output generated in Python. On the left side is the visualization for Agglomerative clustering using dendrogram visualization. While at the right site, we using a scatter plot to visualized Agglomerative clustering.
A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering. The main use of a dendrogram is to work out the best way to allocate objects to clusters. Based on my research and findings, not like K-Means Clustering, we can use Elbow and Silhouette Method to determine the k-value. The key to interpreting a dendrogram is to focus on the height at which any two objects are joined together.
I choose to draw a line to get 6 clusters same with K-Means to see if Agglomerative Clustering can produce the same or similar result with K-Means Clustering. The dendrogram below shows the hierarchical clustering of six features which are Assignment, Forum, Activity, LectureNote, Tutorial, Questionnaire, and Quiz based on 6 clusters. The x-axis is the predicted value for MarksBin features (1-11). From the scatter plot, we can see that Cluster 2 is the biggest cluster. For Cluster 3, 4, and 6 it only has 1 member while Cluster 1 has 3 members and Cluster 5 has 2 members. Its means that Cluster 2 has a total of 26 members.
Dendrogram Visualization
Scatter Plot Visualization
As for Agglomerative Clustering, even we use the same number of clusters for Rapidminer and Python, it did not produce the same results as we expected.
This section will be discussing on the result for Predictive Analysis (Decision Tree and Naive Bayes Algorithm) in Rapidminer and Python.
Based on these two results, we can see that the Decision Tree model in Rapidminer is higher than the Decision Tree model in Python. Interm of class precision percentage, the model in Rapidminer predict precisely for Grade A, B+, and C while in Python the model precise on predicting Grade B+ and C. In terms of the number of students by grade also different.
From the result below, the model accuracy in Rapidminer is still the best with 84% while in Python only 68% is a bit different. In terms of precision percentage, the Decision Tree in Rapidminer precise in predicting for Grade B, C, and F. While Python precise in predicting Grade A and B.
Even we use the same parameter in the hyperparameter tuning, still; model in Rapidminer got the highest accuracy which is 100%, therefore; all the precision percentage is also 100%. Its means that all the prediction in Rapidminer is correct. The accuracy for the model in Python only increases to 93.33%, where the precision percentage in predicting 2 (Grade A-) only 82%.
Same as the ratio of 30:70, the performance accuracy for the Rapidminer model is higher than in Python which is 96%. The precision is not very accurate in predicting Grade B and B+.
Rapidminer still leads even in Naive Bayes, where the accuracy is different only 0.97%. But, with only 70.97% accuracy, the precision percentage for Naive Bayes in Rpidminer is 100% in predicting for Grade A-, B-, B+, and C. For the Naive Bayes model in Python, with an accuracy of 70%, it able to precisely predict on 5 and 7 (Grade C and C-)
This is the result of Naive Bayes's performance before tuning. From the result, we can see that the accuracy for the model in Rapidminer only 66% while for Python is 72%. This is the first time Python leads the result.
Python
After tuning the Naive Bayes parameter, the performance accuracy in Python is the highest with 83.33%. Model in Rpidminer accuracy is 80.65%. The precision percentage for the model in Python is not 100% is for predicting 2 and 3 (A- and B+).
Different from the result on the ratio of 30:70, the model in Rapidminer only increases 4%! It still less than 80%. It looks like the tuning is not effective for this model using the ratio of 50:50. But, for the model accuracy in Python is increases to 84%.
From the above comparison, we can conclude that the best ratio is 30:70 and the best model is Decision Tree in Rapidminer with the performance accuracy is 100%.