First let’s see how many rows of data we have:
SQL code
SELECT
COUNT(*)
FROM
divvy.analysis;
Result
4,443,081
So that’s a 24% reduction of rows, so all that data cleaning was worth it.
The Case Study Roadmap suggests some potential analysis paths, so let’s review those.
But first, we need to correct an oversight from the process stage, and create another column; ride_length is good enough for calculations, but if we wish to have human-understandable information, it’s better to have it in terms of hours:minutes:seconds, instead of a large number of seconds that provide no context for the actual length.
SQL code
ALTER TABLE
divvy.analysis ADD COLUMN ride_duration TIME;
UPDATE
divvy.analysis
SET
ride_duration = TIME(TIMESTAMP_SECONDS(ride_length))
WHERE
1=1;
Result
ride_duration TIME
Now we can choose how to calculate and present our analysis, by picking the appropriate column when needed.
Let’s go back to the first stage, and remember what we were asked:
How do annual members and casual riders use Cyclistic bikes differently?
In other words, this is a hint that our query should be grouped by the column “member_casual” as much as possible, since that’s the crux of our analysis process.
Let’s try to find out some basic information about the whole riding system, broken down by membership type:
SQL code
SELECT
INITCAP(member_casual) AS M_C,
COUNT(ride_id) AS ride_count,
SUM(round_trip) AS round_trip_count,
(ROUND(SAFE_DIVIDE(SUM(round_trip), COUNT(ride_id)), 2)) * 100 AS percentage_round
FROM
divvy.analysis
GROUP BY
member_casual;
Result
M_C Casual Member
ride_count 1,761,015 2,682,066
round_trip_count 147,720 72,828
percentage_round 8% 3%
We can see that there is a 60:40 split between members and casuals (in favour of members) with respect to rides taken, so they already form a majority of rides. The round trips form only ~5% of all rides, but there the ratio flips, with ⅔ of the all round trips taken by casuals vs members.
Let’s get some statistical data on the rides.
SQL code
SELECT
INITCAP(member_casual) AS M_C,
TIME(TIMESTAMP_SECONDS(CAST(AVG(ride_length) AS INT64))) AS avg_ride_duration,
APPROX_QUANTILES(ride_duration, 2)[
OFFSET
(1)] AS median_ride_duration,
TIME(TIMESTAMP_SECONDS(CAST(STDDEV(ride_length) AS INT64))) AS stddev_ride_duration
FROM
divvy.analysis
GROUP BY
member_casual;
Result
M_C Casual Member
avg_ride_duration 00:23:25 00:12:30
median_ride_duration 00:13:45 00:08:55
stddev_ride_duration 00:42:10 00:18:53
Members take shorter rides than casual on average, but the significant difference between average and median rides means the data is skewed, so we also calculated the standard deviation.
The results imply that the data is highly affected by a few number of very long rides, and there is significant variation between ride durations, and it would seem that ide data is positively skewed.
But we can also calculate skewness and find it directly, which we can check for with the query below (we will be using ride_lengths in seconds for ease of calculation):
SQL code
WITH
summary_stats AS (
SELECT
INITCAP(member_casual) AS M_C,
ROUND(AVG(ride_length),2) AS mean_ride_length,
APPROX_QUANTILES(ride_length, 2)[
OFFSET
(1)] AS median_ride_length,
ROUND(STDDEV(ride_length),2) AS stddev_ride_length
FROM
divvy.analysis
GROUP BY
member_casual )
SELECT
M_C,
mean_ride_length,
median_ride_length,
stddev_ride_length,
ROUND((
SELECT
AVG(POW(ride_length - mean_ride_length, 3)) / POW(stddev_ride_length, 3)
FROM
divvy.analysis
WHERE
INITCAP(member_casual) = summary_stats.M_C ),2) AS skewness_ride_length
FROM
summary_stats;
Result
M_C Casual Member
mean_ride_length 1405.49 750.5
median_ride_length 809 539
stddev_ride_length 2530 1132.73
skewness_ride_length 14.32 32.08
This confirms that the data is significantly positively skewed, and more so for the members than casuals. In our further calculations, we will be focusing on median rather than the mean for average calculations.
We have three different type of bike_types, and the usage across membership types is interesting:
SQL code
SELECT
bike_type,
ROUND((COUNTIF(member_casual = 'casual') / COUNT(ride_id)) * 100, 0) AS casual_share,
ROUND((COUNTIF(member_casual = 'member') / COUNT(ride_id)) * 100, 0) AS member_share
FROM
divvy.analysis
GROUP BY
bike_type;
Result
bike_type casual_share member_share
electric_bike 43% 57%
classic_bike 34% 66%
docked_bike 100% 0%
Docked bikes are seemingly never used by members; this implies that the distinction between docked and undocked (perhaps classic?) bikes is important only for non-member riders and members are charged the same regardless. But otherwise, the majority of the rides are done by members.
Instead of a horizontal analysis, we can also do a vertical analysis, SQL is very flexible in that regard:
SQL code
SELECT
INITCAP(member_casual) as M_C,
ROUND((COUNTIF(bike_type = 'electric_bike') / COUNT(ride_id)) * 100, 0) AS electric_bike_share,
ROUND((COUNTIF(bike_type = 'classic_bike') / COUNT(ride_id)) * 100, 0) AS classic_bike_share,
ROUND((COUNTIF(bike_type = 'docked_bike') / COUNT(ride_id)) * 100, 0) AS docked_bike_share,
FROM
divvy.analysis
GROUP BY
member_casual;
Result
M_C Casual Member
electric_bike_share 41% 36%
classic_bike_share 50% 64%
docked_bike_share 9% 0%
This shows that even with casual riders, docked bikes are clearly a less-used product. Ebike share seems to be higher with casual rather than members, and that can be useful information for our marketing campaign.
SQL can be useful for a lot of basic data analysis, but not all fuctions are supported across all flavours of SQL. For example, Google Bigquery does not support median or mode fuctions, but we can still use other functions, or define our own functions if need be, to get to the answer.
Here we are calculating the median duration of a ride taken on any given day, seperated by membership status:
SQL code
SELECT
started_day,
APPROX_QUANTILES(
IF
(member_casual = 'casual', ride_duration, NULL), 2)[
OFFSET
(1)] AS median_ride_duration_casual,
APPROX_QUANTILES(
IF
(member_casual = 'member', ride_duration, NULL), 2)[
OFFSET
(1)] AS median_ride_duration_member
FROM
divvy.analysis
GROUP BY
started_day
ORDER BY
CASE started_day
WHEN 'Monday' THEN 1
WHEN 'Tuesday' THEN 2
WHEN 'Wednesday' THEN 3
WHEN 'Thursday' THEN 4
WHEN 'Friday' THEN 5
WHEN 'Saturday' THEN 6
WHEN 'Sunday' THEN 7
END
;
Result
started_day median_ride_duration_casual median_ride_duration_member
Monday 00:13:22 00:08:36
Tuesday 00:12:08 00:08:44
Wednesday 00:11:53 00:08:44
Thursday 00:12:06 00:08:52
Friday 00:13:12 00:08:50
Saturday 00:15:41 00:09:55
Sunday 00:16:04 00:09:58
As we can see from the data, median average ride taken by casual riders is longer than members by atleast 5 minutes on any given day; by the weekend the casual rider are taking rides atleast twice as long as the members.
We can figure out the most common trip duration by finding the modal rides:
SQL code
WITH
casual AS (
SELECT
started_day,
ride_duration,
COUNT(*) AS count
FROM
divvy.analysis
WHERE
member_casual = 'casual'
GROUP BY
started_day,
ride_duration),
member AS (
SELECT
started_day,
ride_duration,
COUNT(*) AS count
FROM
divvy.analysis
WHERE
member_casual = 'member'
GROUP BY
started_day,
ride_duration)
SELECT
casual.started_day,
casual.ride_duration AS mode_ride_duration_casual,
member.ride_duration AS mode_ride_duration_member
FROM
casual
JOIN (
SELECT
started_day,
MAX(count) AS max_count
FROM
casual
GROUP BY
started_day) max_casual
ON
casual.started_day = max_casual.started_day
AND casual.count = max_casual.max_count
JOIN (
SELECT
started_day,
MAX(count) AS max_count
FROM
member
GROUP BY
started_day) max_member
ON
casual.started_day = max_member.started_day
JOIN
member
ON
member.started_day = max_member.started_day
AND member.count = max_member.max_count
ORDER BY
CASE casual.started_day
WHEN 'Monday' THEN 1
WHEN 'Tuesday' THEN 2
WHEN 'Wednesday' THEN 3
WHEN 'Thursday' THEN 4
WHEN 'Friday' THEN 5
WHEN 'Saturday' THEN 6
WHEN 'Sunday' THEN 7
END
;
Result
started_day mode_ride_duration_casual mode_ride_duration_member
Monday 00:06:30 00:04:51
Tuesday 00:06:25 00:04:34
Tuesday 00:05:51 00:04:34
Wednesday 00:05:46 00:04:32
Thursday 00:06:36 00:05:35
Friday 00:07:06 00:04:52
Saturday 00:08:08 00:04:21
Sunday 00:08:19 00:04:39
The most common trip duration for members is about 5 minutes; while casual members fluctuate from 6-8 minutes for their most common ride. We can see that even the smallest common trip by casual riders is longer than the longest most common trip duration by members.
Combined with the median data, we can see that casual riders ride for much longer on any given trip.
SQL can also be used for large scale tabular analysis; since it can handle very large data with ease.
We an calculate the casual-to-member ratio of riders on any given day:
SQL code
SELECT
started_day,
ROUND((COUNTIF(member_casual = 'casual') / COUNT(ride_id)) * 100, 0) AS casual_share,
ROUND((COUNTIF(member_casual = 'member') / COUNT(ride_id)) * 100, 0) AS member_share
FROM
divvy.analysis
GROUP BY
started_day
ORDER BY
CASE started_day
WHEN 'Monday' THEN 1
WHEN 'Tuesday' THEN 2
WHEN 'Wednesday' THEN 3
WHEN 'Thursday' THEN 4
WHEN 'Friday' THEN 5
WHEN 'Saturday' THEN 6
WHEN 'Sunday' THEN 7
END
;
Result
started_day casual_share member_share
Monday 35% 65%
Tuesday 32% 68%
Wednesday 33% 67%
Thursday 35% 65%
Friday 41% 59%
Saturday 51% 49%
Sunday 49% 51%
It seems that on weekdays, members form the overwhelming majority of riders, but weekends,they even out with casual riders.
The code above was a split on daily basis, but we could very easily split it on monthly or hourly basis, as we see fit. However, while SQL could crunch and present all those numbers for us... actually interpreting that large wall of text can become cumbersome.
SQL is good for a lot of things, but it can not create graphs or charts; and visualisation is key when we get into large tables of information. for that we move on to our next stage and get into visualisation.