How has COVID-19 impacted the reach of kjhk.org?
Jaya Chakka, Dylan Gundersen, Ying Qing Won, Samuel Clark
PSYC 500 final project
Jaya Chakka, Dylan Gundersen, Ying Qing Won, Samuel Clark
PSYC 500 final project
Introduction
COVID-19 has impacted our day to day life in various ways, including the operation of businesses. Specifically for media companies, they are struggling to navigate COVID-19 and address how it impacts their outreach and success. Therefore, by using the datasets from the KU campus radio station, KJHK, we believe there will be meaningful insights to look at how has COVID-19 impacted the reach of KJHK.
Data Curation and Ethics
Description of Data
One of our team members, Jaya, is on the student executive staff of the campus radio station, KJHK. KJHK tracks metrics in many areas, from its stream/FM broadcasting reach to its website popularity. For this project, we decided to analyze information about visitors to kjhk.org from January to October of 2020. KJHK, like other student organizations, has had to come up with innovative ways to create content and engage the KU and Lawrence community during the era of COVID. We want to analyze trends in website visits during this time period to determine whether COVID has impacted KJHK’s outreach (and how). The data we are utilizing is collected directly from the website by the Google Analytics application.
Ethics
The main ethical concern would be whether or not the managers of KJHK are comfortable with our data usage. Jaya addressed this concern by speaking to both her bosses and confirming that we can use the data.
Another potential ethical concern is whether Google Analytics is collecting personal information from users browsing the website. While it does seem that Google Analytics assigns an ID to repeat browsers (in order to monitor how frequently they revisit the site), these IDs seem to be randomly assigned and do not reveal personal information. Additionally, none of the variables our group will be analyzing are tied to these IDs. The dataset we will be running through Google Colab has been extracted from Google Analytics so that only those variables are accessible.
Though demographic information is collected based on a site viewer’s location (city, country, language, etc.), this data is non-specific enough that it does not infringe on an individual’s privacy or reveal personal information.
Research Questions
Our overarching question was: How has COVID-19 impacted the reach of kjhk.org?
Additional Questions:
Does the number of sessions per user fit a normal pattern of distribution?
How do average session duration and bounce rate relate? Do they exhibit a linear relationship and/or a strong correlation?
Do sessions per user, session duration, or bounce rate vary significantly pre-COVID versus during COVID?
Data Preparation and Exploration
From Google Analytics, we pulled two data sets: one of continuous variables, and one of discrete variables. We then pared these sets down further to isolate the variables we wished to analyze. These variables are as follows:
kjhk_continuous: ‘Day Index,’ ‘Number of Sessions per User,’ ‘Avg. Session Duration (sec),’ ‘Bounce Rate’
country: ‘Country,’ ‘Total Users,’ ‘Pre-COVID,’ ‘COVID’
In addition to the existing variables in the datasets, we added new variables:
New categorical variable "Condition" with "Pre-covid" and "Post-covid." Cutoff date is March 11, 2020 because it is when KU officially announces first response to COVID-19 (at least 1 week of virtual instruction).
New variable "Bounce Rate New" that includes bounce rates as float (61.95), instead of str (61.95%), so it is easy to calculate descriptive statistics later on
kjhk_continuous contains 11 variables/columns and 305 rows after data preparation.
There are 3,355 observations in total.
Columns we analyzed:
Number of Sessions per User
Avg. Session Duration (sec)
Bounce Rate
As ‘Day Index’ was simply a variable to mark the procession of time and had no practical significance, we did not “analyze” it as a variable.
country contains 4 variables/columns and 143 rows.
There are 572 observations in total.
Planned on doing a hypothesis test for Pre-COVID vs. COVID data
Not feasible given nature of data
Exploratory Data Analysis
Number of sessions per user
Data visualization (Histogram and Scatterplot below)
There doesn't seem to be any particular day or period in which users re visiting the site multiple times a day. There are a few outliers, but they are not drastically unique.
Descriptive statistics of Number of Sessions per User
2. Bounce Rate
Data visualization (Seaborn distplot showing a histogram with a kernel density estimate)
The observations of bounce rate do not seem to resemble a perfect normal distribution. No outliers are detected.
Descriptive statistics of Bounce Rate
3. Country
Data visualization (Barplot as below)
It looks like United States has most number of users.
Descriptive statistics of Country
Model Building and Validation
Is Sessions per User distributed normally?
A histogram reveals a bell curve with a significant floor effect. This is to be expected when there is a minimum value
Plotting a CDF of empirical vs. theoretical values also suggests that SPU follows a relatively normal pattern of distribution
How do average session duration and bounce rate relate?
slope = -0.013393741736750047; intercept = 71.03710836214083
Weak linear correlation across all session durations
As session duration increases by 1 sec, bounce rate tends to decreases by ~0.013%. For a session of 0 sec average duration, average bounce rate is about 71.04%
Stronger linear correlation when average session duration between 0-300 sec
Does bounce rate vary significantly pre-COVID versus during COVID?
EDA
Swarmplot
Mean of bounce rate pre-covid
Mean of bounce rate during covid
2. Compute test statistic (difference of means)
Empirical mean difference = -5.30
3. State null hypothesis
The distributions of bounce rate are identical pre-covid and during covid
Visualizing (Plotting CDFs)
permutation samples
empirical data
4,5,6. Simulate data assuming null hypothesis is true
Generate 10,000 permutation replicates
Plotting them
7. Decision
p-value = 0.0
0% chance that we would get the mean difference observed in the experiment if the bounce rates pre-covid and during covid were identical
Simulated data assuming equal distributions, but it is very unlikely (0%)
Reject null hypothesis
Suggest that there is a significant difference in the distributions of bounce rate pre-covid and during covid
Does Sessions per User differ significantly pre-COVID versus during COVID?
EDA
Swarmplot
Mean sessions per user pre-covid
Mean sessions per user during covid
2. Compute test statistic (difference of means)
Empirical mean difference = -0.00489
3. State null hypothesis
The distributions of bounce rate are identical pre-covid and during covid
Visualizing (Plotting CDFs)
Permutation samples
4,5,6. Simulate data assuming null hypothesis is true
Generate 10,000 permutation replicates
Plotting them
7. Decision
p-value = 0.2641
26% chance that we would get the mean difference observed in the experiment by chance
The observed difference in means was also astronomically small (-0.00489) so it is not asking much to have a significant p value
Do not reject the null hypothesis
Does the average session duration (sec) vary significantly between pre-COVID and during COVID?
EDA
Swarmplot
Mean of Average Session Duration Pre-COVID
Mean of Average Session Duration during COVID
2. Compute test statistic (difference of means)
Empirical difference of Means = -29.9488
3. State null hypothesis
The average session duration in seconds on the KJHK website does not vary between the pre-COVID era and the COVID era.
Visualizing (Plotting CDFs)
Permutation samples
4,5,6. Simulate data assuming null hypothesis is true
Generate 10,000 permutation replicates
Plot them
7. Decision
P-value is equal to 0.0499
4.99% chance that the difference in the means of the two data sets (pre and during COVID) is due to chance
This does pass an alpha value of 0.05 and so the null hypothesis is rejected
Though it should be noted that there is a possibility that the difference in the means is due to the major outliers that are found during the COVID era
Discussion
Summary
The goal of our final project was to analyze visitorship metrics from kjhk.org to determine general trends in the metrics as well as whether certain variables differed significantly pre-COVID vs. during COVID. These questions were of interest because KJHK, like other student organizations, has had to shift its standard operations due to COVID and has relied more on website content to engage its followers. Data for this project was pulled from the Google Analytics site connected to KJHK and included data from January 1st - October 31st of 2020. Minor preperatory work was performed, including separating the continuous dataset out by date so that there was a pre-COVID and COVID distinction. Then, we conducted 5 tests: a linear regression between average session duration and bounce rate; a normality test for number of sessions per user; and permutation hypothesis testing for pre-COVID vs. COVID average session duration, sessions per user, and bounce rate.
Limitation
In our datasets, we noticed that there were certain days which might have been considered outliers and therefore could have skewed our analyses. This was especially evident in the "COVID" dates because several notable holidays/events happened within this period, including our 45th birthday celebration and Halloween.
As with most data, our analyses strictly provided correlational information rather than causal; in other words, we can use our findings to make statements about notable trends, but not to establish and cause-effect relationships.
Our discrete dataset, which included information about countries of visitorship, was not suitable for the types of tests we needed to perform for this project. Thus, although we were interested in what information this data could provide, we were unfortunately unable to use it.
Because kjhk.org has only recently taken measures to filter out bots and other non-real users, there is a chance that some of the data collected (especially with regards to bounce rate and country) was not valid, which could have impacted the overall validity of our analyses. There is nothing we could have done to combat this, but this is important to note.
Future Direction
With more time to complete data analysis, it would be beneficial to extract data from a much larger period in order to equalize the number of dates in our pre-COVID vs. COVID categories and to provide a larger sample size in general. One way we could accomplish this is by pulling data from the same date in 2019 vs. 2020 and doing a direct comparison for each date. It would also be beneficial to have more large holidays in the pre-COVID category for the sake of comparison.
Outside the range of this project, it would be interesting to employ some other types of statstical tests to the country data to more effectively examine trends.
Implications
We found that sessions per user throughout the full time period was normally distributed based on its histogram and CDF plot. This suggests that there is a fairly consistent number of sessions for the visitors to the site, which is what we would hope for and anticipate.
We found a weak linear relationship between average session duration and bounce rate. The relationship was stronger for session durations between 0-300 seconds, which suggests that some of the higher session duration data points may be outliers. We expected these two variables to be closely related given that bounce rate is defined as the percentage of times users visit a page without interacting and the session duration measures the time spent in a session.
From our permutation hypothesis tests, we found that session duration and bounce rate had p-values low enough to reject the null hypotheses; this suggests that both of these variables differed significantly pre-COVID vs. during COVID. Sessions per user did not vary pre-COVID vs. during COVID.
Average bounce rate actually increased during the COVID period, which suggests that more people were visiting the website without directly interacting with any of the pages. The p-value for this hypothesis test was p < 0.00.
Session duration increased during COVID, but the difference was barely significant as the p-value was 0.0499.
Sessions per user had a p-value of 0.2641, which exceeded the threshold of p > 0.05 and therefore does not suggest that any difference in means was significant.
Most of of our research questions were based as much on curiosity as on practical applications for the station. However, we can draw a few general conclusions from our analyses. We can conclude that KJHK might need to take further measures to keep bots off the website, as there are instances when these clearly may have impacted the data. We can also reasonably conclude that kjhk.org remained has remained as engaging to visitors during COVID as it was before, and possibly even more engaging (based on the slight increase in average session duration). The unchanging average number of sessions per user suggests that there is still more that can be done to make the website engaging and easy to navigate. All in all, KJHK's website metrics have remained steady during this odd period and demonstrate potential for future growth.