Me and my other group members had done several experiments to find the best model for the data product. We have three different experiments in total and every experiment were tested out using the Decision Tree algorithm and Naive Bayes algorithm with two different ratios which are 80:20 and 60:40. Based on the results above, Decision Tree has better performance accuracy compared to Naive Bayes. After done with all the experiments, we then continued with hyperparameter tuning to analyze whether the performance accuracies of all the experiments remain unchanged or get higher. This method is to tune the dataset to get a better performance. We used Grid Search for hyperparameter tuning and some of the results did get higher. To sum up, the best model is Decision Tree experiment 1 (resampling) before hyperparameter tuning 80:20 for Python and for RapidMiner, the best model would be experiment 1 (resampling) after hyperparameter tuning 80:20 with Decision Tree. The experiment remained the same which is experiment 1; however, for Python before hyperparameter tuning is higher while for RapidMiner, the best one is after tuning. We can say, Decision Tree is the best algorithm to use for predictive data mining.
After completing all the experiments for every ratio and algorithm, here is the result that we got. Overall, the Decision Tree has the best accuracy as it is higher compared to Naive Bayes. RapidMiner produced slightly better results than Python but not that much different. 80:20 ratio clearly has better accuracies since the training data is more than 60:40 ratio. So, the dataset was able to get trained in more than 60:40 ratio. We also did hyperparameter tuning to analyze whether the accuracies of the experiments increase or not. After applying Grid Search as a hyperparameter tuning method, we can see that the accuracies of the models did get higher. To conclude, the best model for financial level using RapidMiner would be Decision Tree experiment 1 80:20 (resampling only) after hyperparameter tuning (74.38%) while for Python, the best model is Decision Tree experiment 1 80:20 (resampling only) after hyperparameter tuning (69.55%). Hence, the best model would be resampling using the Decision Tree algorithm with an 80:20 ratio despite using RapidMiner or Python.
This project was very challenging and interesting to me. I spent a lot of time reading various pages and sources on Google to understand ways to complete this project. For Power BI and RapidMiner, I was not struggling as much as Python because I am already used to Power BI and RapidMiner; however, Python is rather new to me since I never learned and had any basic knowledge regarding Python. So, I spent a lot of time on Python trying to understand how the algorithms work. Nevertheless, thanks to Dr Nurfadhlina's lectures, I managed to understand Python better. Thus, I would like to express my gratitude towards Dr Nurfadhlina for always helping me and guiding me throughout this entire journey. I am extremely thankful to Dr Nurfahlina for being very patient with me and my group and spent some of her time doing a mentoring session for my group. Although sometimes we faced some difficulties, we still managed to come out with good solutions. Other than that, I have also learned several new skills and new knowledge that will come in handy later on. I can see how important Data Mining is, and every step is indeed very crucial to produce an accurate model and accurate data product. To wrap it all up, it was an amazing experience to learn Data Mining this semester. I will for sure apply all the skills I learned in this course from today onwards.