Based on all the experiments done in this project, it is very important to measure the performance accuracy for the machine learning model before we build an application. Without measuring accuracy, there is no way to know if the model is working or not. Unlike regular code, which can be tested with a prior assumption that it works perfectly, 100% of the time as designed, machine learning code is expected to fail on some number of samples. So measuring that exact number of failures is the key to testing a machine learning model. The reason why we have to measure the accuracy of a model is so that we can improve its accuracy. Using Hyperparameter Tuning is one of the ways to increased the performance accuracy of the model. The most important thing is we want to reduce the error, therefore; the model will predict precisely.
Other than that, the prediction is very dependent on the dataset that we are going to use. Like our case, not all the eleven grade is inside our dataset, therefore; our model cannot predict the grade which not in the dataset. Its look like our model will predict all student will pass this course since there are very few of the student failed in this course. Since our model learns through train and test, the accuracy of prediction can be more precise if the dataset contains all the student's grades.
In conclusion, it is very important to measure the machine learning model accuracy in order to make a precise prediction and also to ensure the model is useful to stakeholders.
For me, this course is a challenging course. In order to finish this project, I need to learn so many tools or application that benefits in Data Analytics but the time is very limited. For some of the applications or tools, I did not manage to explore all the features.
Also about our dataset that is too small to split into train and test, we need to do a resampling of our dataset. It's not easy to do a resampling in Python, because our class is many which are 11 compared to our dataset 35. Because of that, we cannot apply SMOTE technique for our dataset. Other than that, we try also using random sampling and we manage to increase the total number of rows. But every time we run the sample, it keeps changing. Last but not least, we try to use sampling by majority and minority. The concept is to balance the minority based on majority data. When we apply this, our data balance for all grades, but our model performance accuracy drop. Luckily Rapidminer has the Bootstrap sampling feature, we manage to increase our dataset from 35 to 100 rows. And then, the journey begins. After all, what we had done in this project is we do so many experiments in order to ensure that we will produce the best product of machine learning.
In conclusion, it was a great experience to learn this course because we can learn so many things about what Data Analytics is all about. "From nothing to everything"
So many thanks to my team member which always give their full commitment to this project. Finally, we manage to finish it. Thank you!
Noor Azura Binti Abd Aziz
Zarin Arni Binti Hashim