GitHub Repo : github.com/Oshada-Kasun/complete-analysis-of-online-Shopping-behavior
The dataset consists of 10 numerical and 8 categorical attributes.
The 'Revenue' attribute using as the class label.
"Administrative", "Administrative Duration", "Informational", "Informational Duration", "Product Related" and "Product Related Duration" represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories.
The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another.
The "Bounce Rate", "Exit Rate" and "Page Value" features represent the metrics measured by "Google Analytics" for each page in the e-commerce site.
The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session.
The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page and it represents the percentage that the page was seen in the last session.
The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction.
The "Special Day" feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date.
The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.
Fit a best Linear and non-linear classifier and explaining the optimal features set and dimensions. Linear classifier aids explainability while non-linear classifier enhance classification performance.
Showing and explaining the hyper-parameter fitting process.
Verifying the work with AutoML to baseline and explaining the rationale behind choices of classifier, parameters and features.
Built a predictive classification model. Trained the model on data entries corresponding to the months of June-Dec, and tested the model on data entries corresponding to Feb-March.
Generating user-behavior clusters based on the purchasing behavior data for the complete dataset.
Identifying the significant differences between the various clusters in terms of the size of the clusters and purchase ratio.
Returning plots and cluster images generated for the data.
Performed a detailed analysis for each cluster corresponding to the variations in features and identifying some behaviors corresponding to each particular cluster.
Built a Semi-Supervised self labelling model to estimate 'Revenue' for the missing records in Oct-Dec and then fit the classifier. Report classification performance on Feb-March data set with and without the self-labelled data.
Using the self labelled data and training data together compared the classification performance on Test data.