Cyclistic - ANALYSE

Cyclistic

ANALYSE

Now finally we get to the analysis part of data analysis, with the current stage, ANALYSE.

We will be running SQL queries on the divvy.analysis table to get some information from all this data.

First let’s see how many rows of data we have:

SQL code

SELECT

COUNT(*)

FROM

divvy.analysis;

Result

4,443,081

So that’s a 24% reduction of rows, so all that data cleaning was worth it.

The Case Study Roadmap suggests some potential analysis paths, so let’s review those.

But first, we need to correct an oversight from the process stage, and create another column; ride_length is good enough for calculations, but if we wish to have human-understandable information, it’s better to have it in terms of hours:minutes:seconds, instead of a large number of seconds that provide no context for the actual length.

SQL code

ALTER TABLE

divvy.analysis ADD COLUMN ride_duration TIME;

UPDATE

divvy.analysis

SET

ride_duration = TIME(TIMESTAMP_SECONDS(ride_length))

WHERE

1=1;

Result

ride_duration TIME

Now we can choose how to calculate and present our analysis, by picking the appropriate column when needed.

Let’s go back to the first stage, and remember what we were asked:

How do annual members and casual riders use Cyclistic bikes differently?

In other words, this is a hint that our query should be grouped by the column “member_casual” as much as possible, since that’s the crux of our analysis process.

Let’s try to find out some basic information about the whole riding system, broken down by membership type:

SQL code

SELECT

INITCAP(member_casual) AS M_C,

COUNT(ride_id) AS ride_count,

SUM(round_trip) AS round_trip_count,

(ROUND(SAFE_DIVIDE(SUM(round_trip), COUNT(ride_id)), 2)) * 100 AS percentage_round

FROM

divvy.analysis

GROUP BY

member_casual;

Result

M_C Casual Member

ride_count 1,761,015 2,682,066

round_trip_count 147,720 72,828

percentage_round 8% 3%

We can see that there is a 60:40 split between members and casuals (in favour of members) with respect to rides taken, so they already form a majority of rides. The round trips form only ~5% of all rides, but there the ratio flips, with ⅔ of the all round trips taken by casuals vs members.

Let’s get some statistical data on the rides.

SQL code

SELECT

INITCAP(member_casual) AS M_C,

TIME(TIMESTAMP_SECONDS(CAST(AVG(ride_length) AS INT64))) AS avg_ride_duration,

APPROX_QUANTILES(ride_duration, 2)[

OFFSET

(1)] AS median_ride_duration,

TIME(TIMESTAMP_SECONDS(CAST(STDDEV(ride_length) AS INT64))) AS stddev_ride_duration

FROM

divvy.analysis

GROUP BY

member_casual;

Result

M_C Casual Member

avg_ride_duration 00:23:25 00:12:30

median_ride_duration 00:13:45 00:08:55

stddev_ride_duration 00:42:10 00:18:53

Members take shorter rides than casual on average, but the significant difference between average and median rides means the data is skewed, so we also calculated the standard deviation.

The results imply that the data is highly affected by a few number of very long rides, and there is significant variation between ride durations, and it would seem that ide data is positively skewed.

But we can also calculate skewness and find it directly, which we can check for with the query below (we will be using ride_lengths in seconds for ease of calculation):

SQL code

WITH

summary_stats AS (

SELECT

INITCAP(member_casual) AS M_C,

ROUND(AVG(ride_length),2) AS mean_ride_length,

APPROX_QUANTILES(ride_length, 2)[

OFFSET

(1)] AS median_ride_length,

ROUND(STDDEV(ride_length),2) AS stddev_ride_length

FROM

divvy.analysis

GROUP BY

member_casual )

SELECT

M_C,

mean_ride_length,

median_ride_length,

stddev_ride_length,

ROUND((

SELECT

AVG(POW(ride_length - mean_ride_length, 3)) / POW(stddev_ride_length, 3)

FROM

divvy.analysis

WHERE

INITCAP(member_casual) = summary_stats.M_C ),2) AS skewness_ride_length

FROM

summary_stats;

Result

M_C Casual Member

mean_ride_length 1405.49 750.5

median_ride_length 809 539

stddev_ride_length 2530 1132.73

skewness_ride_length 14.32 32.08

This confirms that the data is significantly positively skewed, and more so for the members than casuals. In our further calculations, we will be focusing on median rather than the mean for average calculations.

We have three different type of bike_types, and the usage across membership types is interesting:

SQL code

SELECT

bike_type,

ROUND((COUNTIF(member_casual = 'casual') / COUNT(ride_id)) * 100, 0) AS casual_share,

ROUND((COUNTIF(member_casual = 'member') / COUNT(ride_id)) * 100, 0) AS member_share

FROM

divvy.analysis

GROUP BY

bike_type;

Result

bike_type casual_share member_share

electric_bike 43% 57%

classic_bike 34% 66%

docked_bike 100% 0%

Docked bikes are seemingly never used by members; this implies that the distinction between docked and undocked (perhaps classic?) bikes is important only for non-member riders and members are charged the same regardless. But otherwise, the majority of the rides are done by members.

Instead of a horizontal analysis, we can also do a vertical analysis, SQL is very flexible in that regard:

SQL code

SELECT

INITCAP(member_casual) as M_C,

ROUND((COUNTIF(bike_type = 'electric_bike') / COUNT(ride_id)) * 100, 0) AS electric_bike_share,

ROUND((COUNTIF(bike_type = 'classic_bike') / COUNT(ride_id)) * 100, 0) AS classic_bike_share,

ROUND((COUNTIF(bike_type = 'docked_bike') / COUNT(ride_id)) * 100, 0) AS docked_bike_share,

FROM

divvy.analysis

GROUP BY

member_casual;

Result

M_C Casual Member

electric_bike_share 41% 36%

classic_bike_share 50% 64%

docked_bike_share 9% 0%

This shows that even with casual riders, docked bikes are clearly a less-used product. Ebike share seems to be higher with casual rather than members, and that can be useful information for our marketing campaign.

SQL can be useful for a lot of basic data analysis, but not all fuctions are supported across all flavours of SQL. For example, Google Bigquery does not support median or mode fuctions, but we can still use other functions, or define our own functions if need be, to get to the answer.

Here we are calculating the median duration of a ride taken on any given day, seperated by membership status:

SQL code

SELECT

started_day,

APPROX_QUANTILES(

(member_casual = 'casual', ride_duration, NULL), 2)[

OFFSET

(1)] AS median_ride_duration_casual,

APPROX_QUANTILES(

(member_casual = 'member', ride_duration, NULL), 2)[

OFFSET

(1)] AS median_ride_duration_member

FROM

divvy.analysis

GROUP BY

started_day

ORDER BY

CASE started_day

WHEN 'Monday' THEN 1

WHEN 'Tuesday' THEN 2

WHEN 'Wednesday' THEN 3

WHEN 'Thursday' THEN 4

WHEN 'Friday' THEN 5

WHEN 'Saturday' THEN 6

WHEN 'Sunday' THEN 7

END

;

Result

started_day median_ride_duration_casual median_ride_duration_member

Monday 00:13:22 00:08:36

Tuesday 00:12:08 00:08:44

Wednesday 00:11:53 00:08:44

Thursday 00:12:06 00:08:52

Friday 00:13:12 00:08:50

Saturday 00:15:41 00:09:55

Sunday 00:16:04 00:09:58

As we can see from the data, median average ride taken by casual riders is longer than members by atleast 5 minutes on any given day; by the weekend the casual rider are taking rides atleast twice as long as the members.

We can figure out the most common trip duration by finding the modal rides:

SQL code

WITH

casual AS (

SELECT

started_day,

ride_duration,

COUNT(*) AS count

FROM

divvy.analysis

WHERE

member_casual = 'casual'

GROUP BY

started_day,

ride_duration),

member AS (

SELECT

started_day,

ride_duration,

COUNT(*) AS count

FROM

divvy.analysis

WHERE

member_casual = 'member'

GROUP BY

started_day,

ride_duration)

SELECT

casual.started_day,

casual.ride_duration AS mode_ride_duration_casual,

member.ride_duration AS mode_ride_duration_member

FROM

casual

JOIN (

SELECT

started_day,

MAX(count) AS max_count

FROM

casual

GROUP BY

started_day) max_casual

casual.started_day = max_casual.started_day

AND casual.count = max_casual.max_count

JOIN (

SELECT

started_day,

MAX(count) AS max_count

FROM

member

GROUP BY

started_day) max_member

casual.started_day = max_member.started_day

JOIN

member

member.started_day = max_member.started_day

AND member.count = max_member.max_count

ORDER BY

CASE casual.started_day

WHEN 'Monday' THEN 1

WHEN 'Tuesday' THEN 2

WHEN 'Wednesday' THEN 3

WHEN 'Thursday' THEN 4

WHEN 'Friday' THEN 5

WHEN 'Saturday' THEN 6

WHEN 'Sunday' THEN 7

END

;

Result

started_day mode_ride_duration_casual mode_ride_duration_member

Monday 00:06:30 00:04:51

Tuesday 00:06:25 00:04:34

Tuesday 00:05:51 00:04:34

Wednesday 00:05:46 00:04:32

Thursday 00:06:36 00:05:35

Friday 00:07:06 00:04:52

Saturday 00:08:08 00:04:21

Sunday 00:08:19 00:04:39

The most common trip duration for members is about 5 minutes; while casual members fluctuate from 6-8 minutes for their most common ride. We can see that even the smallest common trip by casual riders is longer than the longest most common trip duration by members.

Combined with the median data, we can see that casual riders ride for much longer on any given trip.

SQL can also be used for large scale tabular analysis; since it can handle very large data with ease.

We an calculate the casual-to-member ratio of riders on any given day:

SQL code

SELECT

started_day,

ROUND((COUNTIF(member_casual = 'casual') / COUNT(ride_id)) * 100, 0) AS casual_share,

ROUND((COUNTIF(member_casual = 'member') / COUNT(ride_id)) * 100, 0) AS member_share

FROM

divvy.analysis

GROUP BY

started_day

ORDER BY

CASE started_day

WHEN 'Monday' THEN 1

WHEN 'Tuesday' THEN 2

WHEN 'Wednesday' THEN 3

WHEN 'Thursday' THEN 4

WHEN 'Friday' THEN 5

WHEN 'Saturday' THEN 6

WHEN 'Sunday' THEN 7

END

;

Result

started_day casual_share member_share

Monday 35% 65%

Tuesday 32% 68%

Wednesday 33% 67%

Thursday 35% 65%

Friday 41% 59%

Saturday 51% 49%

Sunday 49% 51%

It seems that on weekdays, members form the overwhelming majority of riders, but weekends,they even out with casual riders.

The code above was a split on daily basis, but we could very easily split it on monthly or hourly basis, as we see fit. However, while SQL could crunch and present all those numbers for us... actually interpreting that large wall of text can become cumbersome.

SQL is good for a lot of things, but it can not create graphs or charts; and visualisation is key when we get into large tables of information. for that we move on to our next stage and get into visualisation.

Page updated

Google Sites

Report abuse