The algorithms that I chose to do were: Logistic regression, Decision tree, Random forest, XGBoost, and Support vector machine (SVM).
The dataset used for this project is the Subsentence retail dataset in data in brief: Dataset. I adapted this dataset to a minimum dataset, which was done in the previous project. For each classifier, I chose to split the test and training data using 20% test and 80% training. I also allocated all the columns except for Purchase intention under X. Then for Y I only included Purchase intention, as purchase intention is the target variable that should be predicted. I also used grid search to optimise the hyperparameters for each classifier to achieve the best performance. I also further evaluated the performance of each classifier based on Precision, Recall, F1-score and Accuracy.
Accuracy: 0.72
Precision: 0.74
Recall: 0.72
F1 Score: 0.70
Logistic regression shows reasonable performance but is outperformed by all the other models. Therefore based on the results of the Logistic regression model, the model shows to be less suitable for the prediction of purchase intention on this specific dataset. Therefore the rest of the models will be considered.
Accuracy: 0.81
Precision: 0.82
Recall: 0.81
F1 Score: 0.80
The decision tree performs quite well with all measures equal to or above 80%. It is slightly less accurate compared to Random Forest and XGBoost but overall performs really well, with high accuracy and precision.
Accuracy: 0.84
Precision: 0.86
Recall: 0.84
F1 Score: 0.84
Random forest shows to have the highest scores across all metrics, making it the top-performing model. Random forest performs really well with all measures equal to or above 84%. Random Forest has the highest accuracy for the prediction of purchase intention on this specific dataset.
Accuracy: 0.82
Precision: 0.85
Recall: 0.82
F1 Score: 0.82
XGBoost performs quite well with all measures equal to or above 82%. XGBoost shows to have the second highest scores across all metrics. This shows that XGBoost has quite a high accuracy for the prediction of purchase intention on this specific dataset.
Accuracy: 0.79
Precision: 0.81
Recall: 0.79
F1 Score: 0.79
SVM performs quite well with all measures equal to or above 79%. It is slightly less accurate compared to Random forest and XGBoost and Decision tree but overall performs well and outperforms Logistic regression.
The best model according to the evaluation metrics for this dataset was the Random forest classification model. The pickle file was created to test the model without reloading the dataset, thus once new data is imported the model will still be able to predict a possible Purchase Intention level. A snippet of the code is shown below.
I would recommend that the business needs to look at the various factors that are considered important for purchase intention to see where they can make improvements. I would also advise the business to potentially fill in values to the different variables, as a measure to see where the business views itself and what level of purchase intention is obtained through the use of the random forest model. I would also advise the business to collect data from the customers every 3 months, for a week to see whether after making changes to for example perceived product quality the purchase intention did in fact increase.