Statistical Analysis and Exploratory Data Analysis (EDA)

The first part of the project was to use the dataset to perform statistical analysis by describing the central tendencies and variability of the selected variables through summary statistics tables and visualisations as well as conducting inferential statistics.

The dataset used for this project is the Subsentence retail dataset in data in brief: Dataset. There are 41 variables in the dataset and through this project, I wanted to drill down to the minimum set of variables that have the strongest associations with purchase intentions.

The above histogram shows the comparison between the mean and the median for each variable, the mean is shown in purple and the median is shown in blue. When both the mean and median values are high it means that the majority of the responses strongly agreed with the questions that were posed in the survey for example for CT7, and when the mean and median is low, it means that the majority of the responses strongly disagreed. It should although be noted that the demographic variables, from Gender to Shopping frequency will have lower values due to the response options being fewer, the responses were not on the scale of agree or disagree.

The above histogram shows the standard deviations for all the variables. The standard deviations for the demographic variables are lower then that of the measurement instrument variables. Overall the standard deviations for the measurement variables vary between 1 and 1.35. Since the majority of the responses only has 5 options, those standard deviations are quite high showing that the responses level varies, as some may vote 1 and other may vote 3 or 5 for the same question.

The second part of the project was to use the dataset to perform inferential statistics through t-tests, chi-square tests, and correlation matrices.

As the dataset consisted of 41 variables, I used heatmaps for t-tests, chi-square tests, and correlations to see how each variable relates to another variable. These heatmaps were quite large (41 x 41 variables) therefore I applied filters to the heatmaps to identify significant associations.

The above heatmap shows the filtered correlation values between the variables, where orange/red or dark blue shows a high correlation between the two variables. This heatmap also shows the correlation value within the blocks. The heatmap was filtered to only show correlations that are above 40% with a p-value that is less than 0.05. The variables that show to have a high correlation, or a correlation of 40% and above with purchase intention are:

PV2 has a high correlation with PI1, PI2, PI3, PI4
PV1 has a high correlation with PI1, PI2, PI3, PI4
PS3 has a high correlation with PI1, PI2, PI3, PI4
PS1 has a high correlation with PI1, PI2, PI3, PI4
PPQ3 has a high correlation with PI1, PI3, PI4
PPQ1 has a high correlation with PI1, PI2, PI4
PPQ2 has a high correlation with PI1, PI4
PE2 has a high correlation with PI2, PI4
CT5 has a high correlation with PI3

The above heatmap shows the Filtered Cramér's V values, which were filtered to show values above 0.4 while having a P-value of less than 0.05. This measure will be used as an additional measure to allocate additional variables that were not included based on the correlation results.

Based on Cramér's V values from Chi-squared results the following variables have a high association with Purchase intention - PI variables:

PV3 has a high association with PI1, PI2, PI3, PI4
PV2 has a high association with PI1, PI4
CT2 has a high association with PI3, PI4
The PI variables also show a high association between themselves (PI to PI)

The third part of the project was to feature engineer the numerical variable values into categorical values.

The dataset had a Likert scale of 1 to 5 for each variable, where 1 was seen as Strongly disagree and 5 was Strongly agree. These Likert scale descriptions were quite vague in my opinion, therefore I changed each variable to have more clear descriptions since I wanted to include these descriptions within my dashboard, which will be discussed further below. The descriptions were based on a "Low likelihood of buying" which resembled 1 and 2 and an "Average likelihood of buying" which resembled 3 and a "High likelihood of buying" which resembled 4 and 5. An example of two of the variables descriptions are:

PPQ2: Likert scale values 1 and 2 should be changed to: Quality of the produce is poor.

PPQ2: Likert scale value 3 should be changed to: Quality of the produce is average.

PPQ2: Likert scale values 4 and 5 should be changed to: Quality of the produce is good.

CT2: Likert scale values 1 and 2 should be changed to: Store does not meet my needs.

CT2: Likert scale value 3 should be changed to: Store sometimes meets my needs.

CT2: Likert scale values 4 and 5 should be changed to: Store always meets my needs.

I did the feature engineering in Google Colab and applied all the variable mappings to the data frame to change the variables into categorical values, an example of the dataset before and after the feature engineering will be showed below.

Before feature engineering

After feature engineering

The final part of the project involved creating a Looker Studio Dashboard

This dashboard uses graphs to provide insights into how key factors influence purchase intentions in grocery stores. It focuses on variables such as Perceived Value, Product Quality, Price Sensitivity, Customer Trust, and the Physical Environment. These factors were selected based on inferential statistical analyses such as correlation heatmaps and chi-square tests. that were discussed above. The link to the dashboard is: Link to the Dashboard

Home

Google Colab

Next project

Page updated

Google Sites

Report abuse