Data Quality : While the dataset is openly available, it comes with some limitations. Its sample size is relatively small, and the timeframe covers only two months in 2016. However, despite these constraints, it offers a foundational basis for analysis. To compensate for its limitations, we'll complement our analysis with inferential, descriptive, and explanatory approaches. While this dataset provides initial insights, for more accurate trend identification and conclusive findings, a larger dataset covering a full year would be optimal. Nonetheless, we'll leverage the available data to extract valuable initial insights.
Analysis Methodology are as follows:
The dataset comprised CSV files with multiple sheets, emphasizing daily and hourly usage data alongside sleep patterns. To enhance analysis efficiency, we consolidated pertinent data into three distinct datasets: daily_data.csv, hourly_data.csv, and sleepday_data.csv. Additional data points like heart rate, body MET, and body weight were omitted due to their scarcity and limited relevance to our analysis objectives.
Steps via Excel:
Reviewed all files individually in Excel, examining column data types and identifying potential missing data errors.
Discovered that the "Activity Hour" column in the hourly dataset was formatted as DATE/TIME, which could pose challenges for future analysis in SQL and Tableau visualization.
Utilized Power Query in Excel to separate the "Activity Hour" column into two distinct columns: "Activity Date" and "Activity Hour," simplifying the data processing workflow.
Due to Excel's limitation of 1 million rows and the need for more efficient processing, transitioned the data cleaning process to R, where handling larger datasets is more feasible.
Additionally, leveraging R for data cleaning allows for quicker processing and the opportunity to conduct basic exploratory data analysis (EDA) simultaneously.
In our data import and cleaning phase, we utilized R along with various packages such as readr, dplyr, lubridate, and ggplot2 to efficiently process and prepare our datasets for analysis.
Let's outline the key steps taken:
Imported daily activity data from a CSV file, removed any missing values, and ensured dataset uniqueness.
Imported multiple raw hourly datasets, merged them into a single dataset for comprehensive analysis, and handled missing values.
Imported sleep day data from a CSV file, addressed missing values, and conducted initial data exploration.
Now, let's break down these steps into bullet points :
Utilized read_csv to import daily activity and sleep day datasets, handling missing values with na.omit, and ensuring dataset uniqueness with distinct.
Imported Excel cleaned hourly datasets(hourlyCalories_merged, hourlyIntensities_merged, hourlySteps_merged) with read_csv.
Merged these hourly datasets into a comprehensive one (hourly_activity) using merge while accounting for all entries.
Conducted a final data check by examining column names for each dataset (daily_activity, sleep_day, hourly_activity).
In our descriptive and explanatory analyses, we delve into the insights gleaned from our daily, hourly, and sleep data. Let's break down our findings and visualizations:
Daily Data Analysis:
Head and Glimpse: We start by inspecting the structure and initial rows of our daily activity dataset (daily_activity) to understand its composition.
Summary Statistics: Utilizing summary() and glimpse(), we obtain summary statistics and an overview of the dataset's structure.
Pie Chart: Visualizing the distribution of activity minutes with a pie chart, highlighting the share of various activity types.
Hourly Data Analysis:
Head and Glimpse: We examine the structure and initial rows of our hourly activity dataset (hourly_data) to grasp its characteristics.
Summary Statistics: Through summary() and glimpse(), we extract summary statistics and insights into the dataset's structure.
Sleep Data Analysis:
Head and Glimpse: We explore the structure and initial rows of our sleep day dataset (sleepday_data) to understand its format.
Summary Statistics: Utilizing summary() and glimpse(), we derive summary statistics and insights into the dataset's structure.
Explanatory Analysis
Share of Activity Minutes: We calculate and visualize the distribution of activity minutes across various activity types using a pie chart.
Relations between Total Steps & Calories Burned: We explore the relationship between total steps and calories burned through a scatterplot.
Relations between Sedentary Minutes & Calories Burned: We investigate the association between sedentary minutes and calories burned using another scatterplot visualization.
**All codes mentioned below images**
#descriptive analysis
total_rows <- nrow(daily_activity)
print(total_rows)
str(daily_activity)
summary(daily_activity)
# Calculate the sums of minutes for each activity type for pie chart
daily_activity_col_sum <-
colSums(daily_activity[c("VeryActiveMinutes", "FairlyActiveMinutes", "LightlyActiveMinutes", "SedentaryMinutes")], na.rm = TRUE)
daily_activity_col_sum
# Create a data frame for plotting
daily_activity_share <-
data.frame(Activity=names(daily_activity_col_sum),Minutes=daily_activity_col_sum)
daily_activity_share
# Convert "Activity" column to factor and specify the desired order of levels
daily_activity_share$Activity <-
factor(daily_activity_share$Activity,levels =c("VeryActiveMinutes", "FairlyActiveMinutes", "LightlyActiveMinutes", "SedentaryMinutes"))
daily_activity_share$Activity
# Plot pie chart
ggplot(daily_activity_share,aes(x="",y=Minutes,fill=Activity))+
geom_bar(stat = "identity",width = 2)+
coord_polar("y",start = 0)+
labs(title = "Share of Activity Minutes",fill="Activity",y="Minutes")+
theme_void()+
theme(legend.position = "right")
#Relations between SedentaryMinutes & calories burned
ggplot(daily_activity,aes(x=SedentaryMinutes,y=TotalSteps))+
geom_jitter(stat ="identity",color="blue")+
labs(title="Relations between SedentaryMinutes & calories
burned",x="SedentaryMinutes",y="TotalSteps")+
theme_minimal()
#Relations between total stepstotal steps & calories burned
ggplot(daily_activity,aes(x=Calories,y=TotalSteps))+geom_jitter(color="blue",size=2)+
labs(title = "Relations between total steps & calories
burned",x="Calories",y="Total Steps")+
theme_minimal()
The following CSV files have been exported for further analysis in SQL, providing structured datasets conducive to detailed exploration:
daily_activity.csv: This file contains daily activity data, including metrics such as Very Active Minutes, Fairly Active Minutes, Lightly Active Minutes, and Sedentary Minutes. The dataset allows for analysis of daily activity trends, patterns, and aggregate statistics in SQL.
hourly_activity.csv: Consolidating hourly activity data from multiple sources, this file facilitates detailed analysis of activity patterns throughout the day. With metrics like calorie burn and step count, SQL analysis can uncover insights into hourly activity trends, correlations between activity metrics, and user behavior.
sleep_day.csv: Focused on sleep patterns, this file includes data on sleep duration, efficiency, and stages. In SQL, this dataset enables analysis of sleep trends over time, factors impacting sleep quality, and relationships between sleep patterns and other variables.
These CSV files were generated using the following R code:
write.csv(daily_activity, "daily_activity.csv", row.names = FALSE)
write.csv(hourly_activity, "hourly_activity.csv", row.names = FALSE)
write.csv(sleep_day, "sleep_day.csv", row.names = FALSE)
Importing and analyzing these datasets in SQL offers opportunities for comprehensive analysis, including descriptive statistics, trend identification, correlation studies, and more, providing deeper insights into users' activity and sleep behavior.
Subsequent analyses delved into various facets of user behavior, including sleep patterns, activity type comparisons, weekday analyses, and trend identification. Originally planned to utilize MS SQL Server, a pivot occurred when encountering challenges, leading to the adoption of Google BigQuery. Leveraging the robust capabilities of BigQuery, we successfully extracted and filtered relevant information essential for comprehending user activity trends. Through structured queries and data manipulation techniques, we gained insights into user behavior patterns, enabling us to make informed decisions and drive strategic initiatives.
Analysis steps are as follows:
**All codes mentioned below every image.**
In this section, I delved into user segmentation based on activity level within the Bellbeat database using SQL queries. By categorizing users into different groups based on their activity logs, activity levels, and activity types, I aimed to provide insights into user behavior and engagement with the Bellbeat platform.
Conducted segmentation of users into three categories based on their activity logs:
Active users: Users with 20 to 31 activity logs.
Moderate users: Users with 10 to 19 activity logs.
Casual users: Users with 0 to 9 activity logs.
Provided insights into the distribution of users across these categories to understand user engagement levels.
Segmented users into three activity levels:
Active: Users with an average of more than 10,000 total steps.
Moderate: Users with an average of more than 5,000 total steps.
Inactive: Users with an average of fewer than 5,000 total steps.
Presented the distribution of users across these activity levels to gauge the overall activity levels of the user base.
Analyzed user activity types based on their average activity minutes in different categories: Very Active, Fairly Active, Lightly Active, Sedentary
Highlighted the distribution of users across these activity types to understand the predominant activity patterns within the user base.
Calculated the average minutes spent in different activity levels: Very Active Minutes, Fairly Active Minutes, Lightly Active Minutes
Computed the average duration of sleep and time spent in bed.
Extracted daily activity summaries including: Calories burned, Total steps taken
Minutes spent in different activity levels: very active, fairly active, lightly active, sedentary
Summarized sleep-related metrics: Total minutes asleep, Total time spent in bed, Incorporated hourly data to calculate overall intensity throughout the day.
Analyzed the daily total steps taken by users over a period of time.
Aggregated the total steps for each activity date to identify trends and patterns in users' daily activity levels.
Presented the data chronologically to visualize changes in daily step counts over time, which can be indicative of shifts in user behavior or engagement.
Investigated users' sleep patterns by examining the average minutes asleep and average time spent in bed per activity day.
Calculated the average sleep duration and time spent in bed to understand users' sleep habits and patterns over time.
Provided insights into users' sleep behaviors and trends, which can inform strategies for promoting healthy sleep habits within the Bellbeat platform.
Explored trends in hourly steps by aggregating step totals for each activity date and hour.
Analyzed the distribution of steps throughout the day to identify peak activity periods and patterns in users' daily activity levels.
Presented the data to highlight variations in hourly step counts, providing insights into users' activity patterns and behaviors at different times of the day.
A Tableau dashboard was created to visualize and communicate key insights from the SQL analysis. Leveraging data from trend analyses on daily steps, sleep patterns, and hourly steps, stakeholders can explore users' activity levels and sleep behaviors over time. The interactive report facilitates informed decision-making for product enhancements and user engagement strategies within the Bellabeat platform.
For access to the detailed documentation and interactive dashboard related to this project, please visit - Summary doc , SQL queries & Tableau report
Low to Moderate Activity Levels: Majority of device users exhibit low to moderate activity levels, with a significant portion leading sedentary lifestyles.
High Sedentary Time: Users spend substantial time in a sedentary state, indicating a lack of physical activity.
Skewed Activity-to-Sleep Ratio: The ratio of activity to sleep time is notably skewed, suggesting a priority towards sedentary behaviors.
Minimal Weekend Activity: Despite no significant calorie burn difference between weekends and weekdays, users engage minimally in physical activities beyond routine tasks.
Midday Peak Activity: Analysis of hourly data reveals peak activity during midday, indicating potential missed opportunities for exercise and workouts.
Daily & Hourly Reminders: Implement customizable reminders for various physical activities through the Bellabeat app to encourage users to engage in activities such as walking, running, or other exercises.
Promote Healthy Lifestyle: Deliver notifications containing articles, news, and videos on healthy lifestyle practices to empower users with valuable insights into maintaining a healthy lifestyle.
Affordable Healthcare Services: Expand offerings beyond traditional healthcare services by incorporating regular health awareness programs for users, aiming to equip them with essential knowledge about maintaining a healthy lifestyle and mitigating potential health issues.
Despite the limitations imposed by a small sample size, accurately measuring user behavior to discern actual trends in the usage of smart healthcare devices becomes challenging. Consequently, there may be biases in the results obtained due to the restricted scope of data collection. Additionally, variations in user demographics and device usage patterns can further complicate the analysis, potentially leading to skewed interpretations of user behavior.
However, it remains imperative to prioritize user health in all recommendations, ensuring that recommendations are made with the utmost care and consideration for user well-being. By acknowledging and addressing these challenges, the analysis strives to provide valuable insights while recognizing the need for cautious interpretation and decision-making in the context of smart healthcare device usage.