A few questions come to mind when looking at the data below.
• Which neighborhood has the highest pickup rate?
• Which neighborhood has the highest drop-off rate?
• What could be the possible reasons for these facts?
• Is there a correlation between the amount of customer per ride and the length of the ride?
• What about the price of a ride and the distance? Is this always obvious?
• Can we predict that the number of customers will significantly increase at the same time the following year?
We want to examine and analyze the trends of New York city Green Taxi rides on a random working weekday: we chose Thursday, February 9, 2017, since this is not a major holiday. It happens to be Chocolate Day and if you had too much of those, National Toothache day! Possibly a good reason to grab a Green cab?
Let's investigate...
Data: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
Coding, graphing, etc: https://github.com/alizasp/GreenTaxi
It appears that most of NYC Green Taxi's customers' pick-up and drop-off are in Brooklyn, then Queens, and then Harlem. This makes sense for Brooklyn and Queens, since they have the largest population of NYC's five boroughs. As to Harlem, further inquiry is needed to understand why it is listed third even though its population is much smaller than Brooklyn and Queens: 345,193 according to NYC City Planning
From the box plot and the histogram above, we can observe that the distribution is skewed to the right and leptokurtic, and that there are some outliers.
In fact, the numbers confirm that: in R, we calculated that skewness(GreenT$fare_amount) = 7.098686 and since 7.098686 > 0.004842972 ( which corresponds to: 2*sqrt(6/length(GreenT$fare_amount)). Hence, we confirm that this distribution is indeed skewed to the right.
This distribution is leptokurtic since in R we obtain a kurtosis(GreenT$fare_amount) = 414.7179 > 0.009685945 = 4*sqrt(6/length(GreenT$fare_amount))
Looking above at the left boxplot, we can observe some outliers. The maximum outlier is close to $200 and the minimum outlier is -$50. We can hypothesize and infer that $200 for a taxi ride either means a long distance or a distance with significant traffic. The -$50 might refer to a refund, or some type of loss. Some companies offer a refund or discount if a certain goal has not been achieved (for example: greeting the passenger, arriving within a certain amount of time, etc.)
However, looking above at the right boxplot, which does not reflect the outliers, we observe that the median is $8 dollars, the minimum is -$5, and the maximum is $22. Also, 75% of NYC Green Taxi fares on February 9, 2017 are between $5.5 and $12 dollars which would account for most trip within New York City without or with traffic.
The graphs below reveal the approximate outliers of -$50 and $200. They also demonstrate that that most fares amounts are below $50.
For more precision, the Five Number Summary calculated in R, provides the following measures of spread :
Min Q1 Median Q3 Max
-52.0 5.5 8.0 12.0 190.5
From the box plot and the histogram above, we can observe that the Trip Distances distribution is skewed to the right and leptokurtic, and that there are some outliers.
In fact, the numbers confirm that: in R, we calculated that skewness(GreenT$trip_distance) is 3.609 which is greater than 0.0048 (equals 2*sqrt(6/length(GreenT$trip_distance)). Hence, we confirm that this distribution is indeed skewed to the right.
This distribution is leptokurtic since in R we obtain a kurtosis(GreenT$trip_distance) of 39.93 which is greater than 0.0097= 4*sqrt(6/length(GreenT$trip_distance))
Looking above at the left boxplot, we can observe some outliers. The maximum outlier is close to 50 miles . We can hypothesize that 50 miles might be a trip to the airport or from one borough to another.
However, looking at the right boxplot, which does not reflect the outliers, we observe that the median is 1.5 miles, the minimum is 0, and the maximum is almost 6 miles. Also, 75% of NYC Green Taxi distances per trip on February 9, 2017 are between 0.9 and 2.9 miles which would account for most trip within New York City without or with traffic.
The stripchart demonstrate that most trip distances on February 9 are shorter than 20 miles. While showing the almost 50 miles outlier, both graphs below re show the same trend while reflect the numbers of trips completed on February 9, 2017: over 15,000.
For more precision, the Five Number Summary calculated in r, provides the following measures of spread :
Min Q1 Median Q3 Max
0.00 0.92 1.53 2.89 47.80
From the box plot and the histogram above, we can observe that the Trip Distances distribution is skewed to the right and leptokurtic, and that there are some outliers.
In fact, the numbers confirm that: in R, we calculated
> skewness(GreenT$fare_amount)
[1] 3.567158
> 2*sqrt(6/length(GreenT$fare_amount))
[1] 0.03880266
3.567158 is greater than 0.03880266 . Hence, we confirm that this distribution is indeed skewed to the right.
> kurtosis(GreenT$fare_amount)
[1] 31.87122
> 4*sqrt(6/length(GreenT$fare_amount))
[1] 0.07760531
This distribution is leptokurtic since in R we obtain a kurtosis of 31.87122 which is greater than 0.07760531.
Looking above at the left boxplot, we can observe that the distribution is skewed to the right and that there are some outliers.
The maximum outlier is close to $600 and the minimum outlier is -$50. We have hypothesized that $200 for a taxi ride either means a long distance or a distance with significant traffic. Assuming that the $600 fare total amount pertains to the $200 ride, the additional $400 must account for taxes, fees, tolls. There is a possibility that the $200 fare amount and the $600 total are not related. A further analysis is necessary to determine that.
However, looking above at the right boxplot, which does not reflect the outlierswe observe that the median is $8 dollars, the minimum is -$5, and the maximum is $26. Also, 75% of NYC Green Taxi fares total amount on February 9, 2017 are between $7.5 and $15 dollars which would account for most trip within New York City without or with traffic.
Both strip plot below show the same trend while precising the numbers of trips completed on February 9, 2017: over 15,000.
For more precision, the Five Number Summary calculated in r, provides the following measures of spread :
Min Q1 Median Q3 Max
-52.80 7.56 9.96 14.80 581.39
> mpass <- mean(GreenT$passenger_count)
> mpass
[1] 1.355083
> spass <- sd(GreenT$passenger_count)
> spass
[1] 1.042309
> n <- 15954
> #error <- qnorm(0.975)*spass/sqrt(n) (z-distribution,
but n>30 => t-distribution)
> error <- qt(0.975,df=n-1)*spass/sqrt(n)
> error
[1] 0.01617494
> left <- mpass - error
> right <- mpass + error
> left
[1] 1.338908
> right
[1] 1.371258
>
We can be 97.5% confident that the population mean of passenger count is between 1.34 and 1.37, given the sample mean of passenger count being 1.35, with a degree of error of + or - 0.01617. Of course, since we are dealing with people, the numbers should be integers. Hence we can understand this to mean that on average, each ride has a little more than 1 person. Another way to look at this is considering 100 rides, 134 people will travel. The idea is similar with 1.35 and 1.37, or any decimal portion that relates to living beings (not a live chicken that will be dinner...)
> mdist <- mean(GreenT$trip_distance)
> mdist
[1] 2.383143
> sdist <- sd(GreenT$trip_distance)
> sdist
[1] 2.509089
> n <- 15954
> E_Dist <- qt(0.975,df=n-1)*sdist/sqrt(n) #(t-distribution since n >30)
> E_Dist
[1] 0.03893699
> left <- mdist - E_Dist
> right <- mdist + E_Dist
> left
[1] 2.344206
> right
[1] 2.42208
>
>
We can be 97.5% confident that the population mean of distance traveled by Green Taxis in NYC is between 2.344 and 2.422 miles, given the sample mean of distance being 2.383 miles, with a degree of error of 0.03894 . This means that because of these fairly short distances, we can assume that the population's average ride will be in the same neighborhood .
> mFare<- mean(GreenT$fare_amount)
> mFare
[1] 10.26136
> sFare <- sd(GreenT$fare_amount)
> sFare
[1] 7.81993
> n <- 15954
> E_Fare <- qt(0.975,df=n-1)*sFare/sqrt(n) #(t-distribution since n >30)
> E_Fare
[1] 0.1213526
> left <- mFare - E_Fare
> right <- mFare + E_Fare
> left
[1] 10.14001
> right
[1] 10.38272
>
>
This means that we can be 97.5% confident that the population mean of fare amount per trip is between $10.14 and $10.38, given the sample mean of fare amounts being $10.26, with a degree of error of 0.12135 .
> mean(GreenT$total_amount,na.rm=TRUE)
[1] 12.77411
> mTotal <- mean(GreenT$total_amount,na.rm=TRUE)
> mTotal
[1] 12.77411
> sTotal <- sd(GreenT$total_amount,na.rm=TRUE)
> sTotal
[1] 10.26934
> n <- 15954
> E_Total <- qt(0.975,df=n-1)*sTotal/sqrt(n) #(t-distribution since n >30)
> E_Total
[1] 0.1593635
> left <- mTotal - E_Total
> right <- mTotal + E_Total
> left
[1] 12.61474
> right
[1] 12.93347
>
>
This means that we can be 97.5% confident that the population mean of fare amount per trip is between $12.61 and $12.93, given the sample mean of fare amounts being $12.77, with a degree of error of 0.15936 .
> mean(GreenT$trip_distance)
[1] 2.383143
>
> #t-test with conf. level of 95%:
>
> #one side: with (Ho = 2.38) and (Ha > 2.38)
> t.test(GreenT$trip_distance, mu=2.38, alternative = "greater", conf.level = 0.95)
One Sample t-test
data: GreenT$trip_distance
t = 0.15824, df = 15953, p-value = 0.4371
alternative hypothesis: true mean is greater than 2.38
95 percent confidence interval:
2.350467 Inf
sample estimates:
mean of x
2.383143
Since α = 0.05 and the p-value is 0.4371 < 0.025, we reject the null hypothesis.
However, a one-sided test might be more prone to error, we will conduct a two-sided test:
>
> #two sided: with (Ho = 2.38) and (Ha not= 2.38)
> t.test(GreenT$trip_distance, mu=2.38, alternative = "two.sided", conf.level = 0.95)
One Sample t-test
data: GreenT$trip_distance
t = 0.15824, df = 15953, p-value = 0.8743
alternative hypothesis: true mean is not equal to 2.38
95 percent confidence interval:
2.344206 2.422080
sample estimates:
mean of x
2.383143
>
>
Since this two-sided t-test with α⁄2 = 0.025 and the p-value is 0.8743 > 0.025, we fail to reject the null hypothesis. Thus, we can be 95% confident that the population mean of distance traveled by Green Taxis in NYC is between 2.344 and 2.422 miles, given the sample mean of distance being 2.383 miles, with a degree of error of 0.03894 as shown above . This means that because of these fairly short distances, we can assume that the population's average ride will be in the same neighborhood .
> mean(GreenT$fare_amount)
[1] 10.26136
>
> #t-test with conf. level of 95%:
>
> #one side: with (Ho = 10.26) and (Ha > 10.26)
> t.test(GreenT$fare_amount, mu=10.26, alternative = "greater", conf.level = 0.95)
One Sample t-test
data: GreenT$fare_amount
t = 0.02204, df = 15953, p-value = 0.4912
alternative hypothesis: true mean is greater than 10.26
95 percent confidence interval:
10.15952 Inf
sample estimates:
mean of x
10.26136
Since α = 0.05 and the p-value is 0.4912 < 0.025, we reject the null hypothesis.
However, a one-sided test might be more prone to error, we will conduct a two-sided test:
>
> #two sided: with (Ho = 10.26) and (Ha not= 10.26)
> t.test(GreenT$fare_amount, mu=10.26, alternative = "two.sided", conf.level = 0.95)
One Sample t-test
data: GreenT$fare_amount
t = 0.02204, df = 15953, p-value = 0.9824
alternative hypothesis: true mean is not equal to 10.26
95 percent confidence interval:
10.14001 10.38272
sample estimates:
mean of x
10.26136
>
>
Since this two-sided t-test with α⁄2 = 0.025 and the p-value is 0.9824 > 0.025, we fail to reject the null hypothesis. Thus, we can be 95% confident that the population mean of fare amount per trip is between $10.14 and $10.38, given the sample mean of fare amounts being $10.26, with a degree of error of 0.12135 as shown above .
> mean(GreenT$total_amount)
[1] 12.77411
>
> #t-test with conf. level of 95%:
>
> #one side: with (Ho = 12.77) and (Ha > 12.77)
> t.test(GreenT$total_amount, mu=12.77, alternative = "greater", conf.level = 0.95)
One Sample t-test
data: GreenT$total_amount
t = 0.050528, df = 15953, p-value = 0.4799
alternative hypothesis: true mean is greater than 12.77
95 percent confidence interval:
12.64037 Inf
sample estimates:
mean of x
12.77411
Since α = 0.05 and the p-value is 0.4799 < 0.025, we reject the null hypothesis.
However, a one-sided test might be more prone to error, we will conduct a two-sided test:
>
> #two sided: with (Ho = 12.77) and (Ha not= 12.77)
> t.test(GreenT$total_amount, mu=12.77, alternative = "two.sided", conf.level = 0.95)
One Sample t-test
data: GreenT$total_amount
t = 0.050528, df = 15953, p-value = 0.9597
alternative hypothesis: true mean is not equal to 12.77
95 percent confidence interval:
12.61474 12.93347
sample estimates:
mean of x
12.77411
Since this two-sided t-test with α⁄2 = 0.025 and the p-value is 0.9597 > 0.025, we fail to reject the null hypothesis. Thus, we can be 95% confident that the population mean of fare amount per trip is between $12.61 and $12.93, given the sample mean of fare amounts being $12.77, with a degree of error of 0.15936 as shown above.
> cor(GreenT$passenger_count, GreenT$trip_distance)
[1] 0.01422424
> linearMod <- lm(passenger_count ~ trip_distance, data=GreenT) # build linear regression model on full data
> print(linearMod)
Call:
lm(formula = passenger_count ~ trip_distance, data = GreenT)
Coefficients:
(Intercept) trip_distance
1.341120 0.005915
> summary(linearMod)
Call:
lm(formula = passenger_count ~ trip_distance, data = GreenT)
Residuals:
Min 1Q Median 3Q Max
-1.3414 -0.3563 -0.3488 -0.3450 4.6589
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.341120 0.011390 117.748 <2e-16 ***
trip_distance 0.005915 0.003294 1.796 0.0725 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.043 on 15938 degrees of freedom
Multiple R-squared: 0.0002023, Adjusted R-squared: 0.0001396
F-statistic: 3.225 on 1 and 15938 DF, p-value: 0.07252
The correlation between passenger count and the distance is very weak since R= 0.014, a very negligible amount. Indeed, regardless of the distance most NYC Green Taxi rides only have one passenger. This was confirmed by the mean of passenger count (XBar = 1.36) for a sample of 15000 rides, as stated above.
> cor(GreenT$fare_amount, GreenT$trip_distance)
[1] 0.8960244
> #passenger_count = Intercept + (β ∗ trip_distance)
>
> linearMod <- lm(fare_amount ~ trip_distance, data=GreenT) # build linear regression model on full data
> print(linearMod)
Call:
lm(formula = fare_amount ~ trip_distance, data = GreenT)
Coefficients:
(Intercept) trip_distance
3.606 2.793
>
> summary(linearMod)
Call:
lm(formula = fare_amount ~ trip_distance, data = GreenT)
Residuals:
Min 1Q Median 3Q Max
-55.689 -0.946 -0.408 0.317 91.394
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.60553 0.03791 95.1 <2e-16 ***
trip_distance 2.79313 0.01096 254.8 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.471 on 15938 degrees of freedom
Multiple R-squared: 0.8029, Adjusted R-squared: 0.8028
F-statistic: 6.491e+04 on 1 and 15938 DF, p-value: < 2.2e-16
The correlation between fare amount and the distance is very strong since R= 0.9, almost 1 which is a perfect positive correlation. Of course, this comes as no surprise that the price of a longer trip should be more expensive. One can hypothesize that the remaining 0.1 would account for rides with heavy traffic, where the distance might not be so large compared to its high fare amount.
• Which neighborhood has the highest pickup rate?
Brooklyn then Queens then Harlem.
• Which neighborhood has the highest drop-off rate?
Brooklyn then Queens then Harlem.
• Possible reasons for these data?
Brooklyn and Queens are very vast and the distances between locations within each borough are potentially very large. In addition, the population of these areas is larger than the other boroughs and neighborhoods.
The most cost-efficient way of transportation for New Yorkers is the subway. However, it does not necessarily reach every neighborhood of Brooklyn ( East Flatbush, Borough Park, Sunset Park, or Bensonhurst. It is common for Brooklyn residents to either walk significant distances to get home or take the bus from the subway. February is one of the coldest months of the year and this could explain an elevated rate of taxi pick-up. Further analysis could be conducted and compare with other months to confirm that.
• Is there a correlation between the amount of customer per ride and the length of the ride?
There is not much of a correlation between the amount of customer per ride and its distance. A traveler who is desperate for time, safety, and peace of mind, regardless of the distance, will call a taxi. It all depends on the reason for the trip.
Going to a destination to earn less than $100 for the day, might not justify the use of a taxi. However, when this brings in a seven digit income, calling a cab is warranted. This decision might make the difference between success and failure. In this case, a single New Yorker will hire that green taxi, regardless of the distance. He will not even consider waiting to form a group of people to share the cost this lengthy thus expensive taxi ride.
Conversely, if hiring a taxi will make a good impression on a client, even if the trip is short, one will do so, regardless of the amount of people, especially on a humid summer day or a wet and windy winter day.
• In general, there is an obvious correlation between the fare amount per ride and the distance.
However, interestingly, the most expensive ride, a 10:55 p.m. pick-up from Brooklyn, is $581. This is not due to distance but tolls. The actual fare amount is only $26. It is difficult to comprehend how tolls anywhere to and from Brooklyn could be this amount. It is more probable the extra amount might be due to traffic ticket or any other possible irregularity and this amount is recorded as tolls, tolls being the the closest explanation to the truth.
• Can we predict that the number of customers will significantly increase at the same time the following year?
On February 9, 2017, 15953 rides were recorded and on Tuesday, February 9, 2018 almost the double amount was recorded: 31350 rides. This could possibly mean that more customers are choosing to hire green taxis, or that more data was collected on 2018. Neither the distance mean nor the fare amount mean has dramatically changed. Further analysis is needed to confirm any customers increase in 2018.