GOOGLE DATA ANALYTICS STUDY:
HOW CAN A WELLNESS TECHNOLOGY PLAY IT SMART?
BY: DORINA JOY ALMAZAN
GOOGLE DATA ANALYTICS STUDY:
HOW CAN A WELLNESS TECHNOLOGY PLAY IT SMART?
BY: DORINA JOY ALMAZAN
You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.
ASK PHASE
BUSINESS TASK
To analyze smart device usage data in order to gain insight into how people are already using their smart devices for their marketing strategy.
WHO ARE THE STAKEHOLDERS?
Bellabeat’s cofounder
Chief Creative Officer
Mathematician and Bellabeat’s cofounder
Key member of the Bellabeat executive team
Responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy
PREPARE PHASE
WHERE IS YOUR DATA STORED?
The dataset can be accessed via Kaggle, a website that publishes datasets.
HOW IS THE DATA ORGANIZED?
The dataset contains both long (narrow) and wide versions. However, some .csv files are only stored in a long version, thus all data will be utilized in a long format.
HOW ARE YOU ADDRESSING LICENSING PRIVACY, SECURITY, AND ACCESSIBILITY?
The dataset is shared by Mobius, the collaborator, with a CC0 Public domain license. Thus, anyone can access the same dataset.
ARE THERE ANY PROBLEMS OR ISSUES WITH BIAS OR CREDIBILITY IN THIS DATA?
We will use ROCCC (Reliable, Original, Comprehensive, Current, & Cited) to determine if it is a good data source or not
Reliable: The data is rated LOW on this criteria since 30 respondents are not enough samples for an accurate analysis.
Original: It is rated LOW because it came from the third-party provider (Amazon Mechanical Turk)
Comprehensive: This is rated MEDIUM on this criteria as the data provides the data we might need for the analysis. However, it lacks demographic data which can be useful for analysis.
Current: The data is collected 7 years ago, thus this is rated LOW on this criteria.
Cited: The data is collected from a third-party provider, hence it is rated LOW.
Overall, it is not a recommended data source to begin with.
HOW DOES IT HELP YOU ANSWER YOUR QUESTION?
Although the data source is outdated and can be from a third-party provider, the data provides sufficient information that we need to do the business task.
PROCESS PHASE
THE TOOLS THAT I USED ARE
MS Excel to inspect and rename the files
R to clean, analyze, and visualize the datasets
STEPS IN THE PROCESS DATA
First, I downloaded the zip file from the Kaggle website to get the .csv files. For this project, I used the narrow files.
Using MS Excel, I renamed the narrow files into simpler names.
dailyActivity_merged.csv -> Daily_Activity.csv
dailyCalories_merged.csv -> Daily_Calories
dailyIntensities_merged.csv -> Daily_Intensities
dailySteps_merged.csv -> Daily_Steps
heartrate_seconds_merged.csv -> Heartrate_Seconds.csv
hourlyCalories_merged.csv -> Hourly_Calories.csv
hourlyIntensities_merged.csv -> Hourly_Intensities.csv
hourlySteps_merged.csv -> Hourly_Steps.csv
minuteCaloriesNarrow_merged.csv -> Minutes_Calories.csv
minuteIntensitiesNarrow_merged.csv -> Minutes_Intensities
minuteMETsNarrow_merged.csv -> Minutes_METs
minuteSleep_merged.csv -> Minutes_Sleep.csv
minuteStepsNarrow_merged.csv -> Minutes_Steps.csv
sleepDay_merged.csv -> Sleep_Day.csv
weightLogInfo_merged.csv -> WeightLogInfo.csv
After renaming and reformatting the data, I installed and imported the datasets needed for the analysis using R
Installing packages
install.packages(“tidyverse”)
library(tidyverse)
library(readr)
library(ggplot2)
library(dplyr)
Importing datasets
Daily_Activity <- read_csv("Daily_Activity.csv")
Daily_Calories <- read_csv("Daily_Calories.csv")
Daily_Intensities <- read_csv("Daily_Intensities.csv")
Daily_Steps <- read_csv("Daily_Steps.csv")
Heartrate_Seconds <- read_csv("Heartrate_Seconds.csv")
Hourly_Calories <- read_csv("Hourly_Calories.csv")
Hourly_Intensities <- read_csv("Hourly_Intensities.csv")
Hourly_Steps <- read_csv("Hourly_Steps.csv")
Minutes_Calories <- read_csv("Minute_Calories.csv")
Minutes_Intensities <- read_csv("Minutes_Intensities.csv")
Minutes_METs <- read_csv("Minutes_METs.csv")
Minutes_Sleep <- read_csv("Minutes_Sleep.csv")
Minutes_Steps <- read_csv("Minutes_Steps.csv")
Sleep_Day <- read_csv("Sleep_Day.csv")
Weight_Log_Info <- read_csv("WeightLogInfo.csv")
DATA MANIPULATION AND CLEANING
Inspecting the data
sapply(list(Daily_Activity, Daily_Calories, Daily_Intensities, Daily_Steps, Heartrate_Seconds, Hourly_Calories, Hourly_Intensities, Hourly_Steps, Minutes_Calories, Minutes_Intensities, Minutes_METs, Minutes_Sleep, Minutes_Steps, Sleep_Day, Weight_Log_Info), head)
sapply(list(Daily_Activity, Daily_Calories, Daily_Intensities, Daily_Steps, Heartrate_Seconds, Hourly_Calories, Hourly_Intensities, Hourly_Steps, Minutes_Calories, Minutes_Intensities, Minutes_METs, Minutes_Sleep, Minutes_Steps, Sleep_Day, Weight_Log_Info), str)
sapply(list(Daily_Activity, Daily_Calories, Daily_Intensities, Daily_Steps, Heartrate_Seconds, Hourly_Calories, Hourly_Intensities, Hourly_Steps, Minutes_Calories, Minutes_Intensities, Minutes_METs, Minutes_Sleep, Minutes_Steps, Sleep_Day, Weight_Log_Info), colnames)
sapply(list(Daily_Activity, Daily_Calories, Daily_Intensities, Daily_Steps, Heartrate_Seconds, Hourly_Calories, Hourly_Intensities, Hourly_Steps, Minutes_Calories, Minutes_Intensities, Minutes_METs, Minutes_Sleep, Minutes_Steps, Sleep_Day, Weight_Log_Info), glimpse)
Cleaning the data
sum(is.na(Daily_Activity))
[1] 0
sum(is.na(Daily_Calories))
[1] 0
sum(is.na(Daily_Intensities))
[1] 0
sum(is.na(Daily_Steps))
[1] 0
sum(is.na(Heartrate_Seconds))
[1] 0
sum(is.na(Hourly_Calories))
[1] 0
sum(is.na(Hourly_Intensities))
[1] 0
sum(is.na(Hourly_Steps))
[1] 0
sum(is.na(Minutes_Calories))
[1] 0
sum(is.na(Minutes_Intensities))
[1] 0
sum(is.na(Minutes_METs))
[1] 0
sum(is.na(Minutes_Sleep))
[1] 0
sum(is.na(Minutes_Steps))
[1] 0
sum(is.na(Sleep_Day))
[1] 0
sum(is.na(Weight_Log_Info))
[1] 65
View(Weight_Log_Info)
While checking if there are NA values, I saw that Weight_Log_Info has 65 NA values in the fat column. However, there were 2 cells that have data in the said column, thus, I removed the column
Weight_Log_Info <- subset(Weight_Log_Info, select = -Fat)
sum(is.na(Weight_Log_Info))
[1] 0
I checked if there are NULL values in each dataset
sum(is.null(Daily_Activity))
[1] 0
sum(is.null(Daily_Calories))
[1] 0
sum(is.null(Daily_Intensities))
[1] 0
sum(is.null(Daily_Steps))
[1] 0
sum(is.null(Heartrate_Seconds))
[1] 0
sum(is.null(Hourly_Calories))
[1] 0
sum(is.null(Hourly_Intensities))
[1] 0
sum(is.null(Hourly_Steps))
[1] 0
sum(is.null(Minutes_Calories))
[1] 0
sum(is.null(Minutes_Intensities))
[1] 0
sum(is.null(Minutes_METs))
[1] 0
sum(is.null(Minutes_Sleep))
[1] 0
sum(is.null(Minutes_Steps))
[1] 0
sum(is.null(Sleep_Day))
[1] 0
sum(is.null(Weight_Log_Info))
[1] 0
There were no nulls in the datasets. After looking if the datasets have NAs or nulls, I counted the unique values using distinct
Daily_Activity %>% distinct(Id)
Daily_Calories %>% distinct(Id)
Daily_Intensities %>% distinct(Id)
Daily_Steps %>% distinct(Id)
Heartrate_Seconds %>% distinct(Id)
Hourly_Calories %>% distinct(Id)
Hourly_Intensities %>% distinct(Id)
Hourly_Steps %>% distinct(Id)
Minutes_Calories %>% distinct(Id)
Minutes_Intensities %>% distinct(Id)
Minutes_METs %>% distinct(Id)
Minutes_Sleep %>% distinct(Id)
Minutes_Steps %>% distinct(Id)
Sleep_Day %>% distinct(Id)
Weight_Log_Info %>% distinct(Id)
Next, I counted the number of observations in each dataframe
nrow(Daily_Activity)
[1] 940
nrow(Daily_Calories)
[1] 940
nrow(Daily_Intensities)
[1] 940
nrow(Daily_Steps)
[1] 940
nrow(Heartrate_Seconds)
[1] 683001
nrow(Hourly_Calories)
[1] 22099
nrow(Hourly_Intensities)
[1] 22099
nrow(Hourly_Steps)
[1] 22099
nrow(Minutes_Calories)
[1] 1048575
nrow(Minutes_Intensities)
[1] 1048575
nrow(Minutes_METs)
[1] 1048575
nrow(Minutes_Sleep)
[1] 188521
nrow(Minutes_Steps)
[1] 1048575
nrow(Sleep_Day)
[1] 413
nrow(Weight_Log_Info)
[1] 67
Then, I merged the datasets with the same set of Id number and rows using inner joins
inner_join(Daily_Activity, Daily_Calories, by = "Id")
inner_join(Daily_Activity, Daily_Intensities, by = "Id")
inner_join(Daily_Activity, Daily_Steps, by = "Id")
inner_join(Minutes_Calories, Minutes_Intensities, by = "Id")
Note: Unfortunately, R crashed while running the formula for other datasets. Hence, I used MS Excel to merge the other datasets
Hourly_Calories, Hourly_Intensities, & Hourly_Steps -> Hourly_Cal_Int_Steps
Minutes_Calories, Minutes_Intensities, Minutes_METs, & Minutes_Steps -> Minutes_Cal_Int_METs_Steps
I uploaded the merged datasets in R and double-checked again if there are NAs or NULLs
Hourly_Data_Merged <- read_csv("Hourly_Cal_Int_Steps.csv")
Minutes_Data_Merged_ExceptSleep <- read_csv("Minutes_Cal_Int_METs_Steps.csv")
sum(is.na(Hourly_Data_Merged))
[1] 0
sum(is.na(Minutes_Data_Merged_ExceptSleep))
[1] 0
sum(is.null(Hourly_Data_Merged))
[1] 0
sum(is.null(Minutes_Data_Merged_ExceptSleep))
[1] 0
ANALYZE PHASE
I wanted to get the descriptive statistics by performing the summary function.
summary(Daily_Activity)
summary(Heartrate_Seconds)
summary(Hourly_Data_Merged)
summary(Minutes_Data_Merged_ExceptSleep)
summary(Minutes_Sleep)
summary(Sleep_Day)
summary(Weight_Log_Info)
DATA FINDINGS
Average steps per day: 7638 steps
Average total distance walked per day: 5.9 km
Average sedentary hours: 16 ½ hours (991.2 minutes)
Average calories they burned: 2306 calories
Average heartrate: 79 beats per minute
Average total time of sleep per day: approximately 7 hours or 419.5 minutes
Calculate the number of minutes awake in bed
Sleep_Day <- Sleep_Day %>%
mutate(c=TotalTimeInBed-TotalMinutesAsleep)
Sleep_Day <- Sleep_Day %>%
rename("MinutesAwakeInBed" = "Difference")
Average minutes awake in bed: 39 minutes
Average weight of users: 72kg or 158.8 pounds
Average BMI: 25.19 (Overweight)
Correlation of very Active Minutes & Calories
cor(Daily_Activity$VeryActiveMinutes, Daily_Activity$Calories, method = "spearman")
[1] 0.5395405
SHARE PHASE
In order to know the number of application usage in the whole week, let’s transform dates into days
Library(lubridate)
Daily_Activity$New_Date <- mdy(Daily_Activity$ActivityDate)
Daily_Activity$weekday <- weekdays(Daily_Activity$New_Date)
Additionally, I made a bar chart to visualize the other important details.
ggplot(Daily_Activity, aes(x=fct_infreq(weekday), fill=weekday)) + geom_bar(stat= 'count') + labs(title = "Days with most number of usage")
ggplot(data=Hourly_Data_Merged, aes(x=Time, y=StepTotal)) + geom_bar(stat="identity", col=terrain.colors(22099)) + labs(title="Hourly Steps")
ggplot(data=Daily_Activity, aes(x=TotalSteps, y = Calories, color=Calories)) + geom_point() + stat_smooth(method=lm) + scale_color_gradient(low="bisque2", high="hotpink") + labs(title="Total Number of Steps vs. Calories Burned") + theme(legend.position="none")
ggplot(data=Daily_Activity, aes(x=TotalSteps, y = SedentaryMinutes, color=Calories)) +
geom_point() + stat_smooth(method=lm) + scale_color_gradient(low="", "yellowgreen") + labs(title="Very Active Minutes vs. Calories Burned") + theme(legend.position="none")
ACT PHASE
WHAT ARE SOME TRENDS IN SMART DEVICE USAGE?
991.2 minutes or 16.5 hours is the average recorded sedentary minutes among participants. This is quite alarming since 7 hours of sedentary behavior can place people at great risk of developing cardiovascular diseases and cancer
Participants have an average BMI of 25.19, which is classified as “Overweight”.
According to a recent study, people who walk at least 7,000 steps a day can help lower the risk of early death. Thus, this will be a great start for those who want to be healthy.
The average sleep time is 7 hours per day, which is recommended for adults.
When it comes to Bellabeat usage, participants choose to use the application during weekdays (except for Monday). This might indicate the preference of individuals to be active after work, which is demonstrated in hourly steps.
Expectedly, there is a positive correlation between the total number of steps and calories, as well as very active minutes and calories. This indicates the importance of physical activity in order to stay in shape.
HOW COULD THESE TRENDS APPLY TO BELLABEAT CUSTOMERS?
These trends will help the customers in an indirect way. By presenting these findings to the stakeholders, they might do action to improve the smart devices which aim to promote and encourage healthy behaviors to the customers.
HOW COULD THESE TRENDS HELP BELLABEAT MARKETING STRATEGY?
There are actions that can be taken to improve Bellabeat’s marketing strategy
Send notifications in instances when hours of sedentary behavior rise. They can also send reminders to do physical activity or to inform customers of their goal.
Since participants have a high BMI which qualifies them as “overweight”, the company can develop a calorie tracker which can be connected to their phones. Customers can input the servings of food they consumed and it will automatically display the number of calories they already consumed within the day.
To promote healthy behaviors, they can do a point system where there will be points given in every workout session. However, reaching a number of sedentary hours can lead them to lose a few points each day.
TAKE NOTE THAT THERE ARE LIMITATIONS TO THE DATA
There is no demographic data that was taken. They should include the demographics of the participants, such as age, gender, ethnicity,etc. This will help the analysts to gain more insights to the company for further actions.
Having 30 participants does not truly reflect the findings provided. The number of participants should represent each category of individuals and increase the number of participants.
The data is outdated and it needs to be updated each year to gain accurate reports.