Interesting

Findings

Thus one can observe that most pickups and drops occur in the evening. While the least drops and pickups occur during the morning.

The below graphs show how the passenger count varies by time and day of the week.

time of the day

clustering results

Upon comparing the NYU Land usage with the taxi data, we can find similarities about the pick-up and drop-off points.

These similarities are very evident in the clustering results.

Fig: Left Image is the NYU Land Usage with respect to Pick-up points coincided on the same map.

Tthe results of K-Means for cluster sizes of 5. The results coincide with elbow and silhouette methods calculated above.

DBSAN was also performed since this research was mostly dealing with spacial data.

Association Rule Mining

The above visualization shows the top rules in the form of a graph/network. One can pick and play for getting a better understanding of the rules.

As the trip distance is short, the fare amount is also less, it is observed that the number of passengers is also very less i.e. these are mostly single passengers.
When the fare amount is less, and when most of the payment types are done using credit cards, the trip distance is usually short.

Decision Trees

This ML technique was used to predict the total amount of the ride by using various features from the dataset.

Tree_Record_small_3.pdf

Feature Importance

As discussed above in terms of the root node, the decision tree is heavily reliant on a few of the features from the dataset. They are base fare amount, tip amount and extra charges.

Other 2-minute features which impact the tree are trip time and trip distance. Even though these features don't have such a high impact on the tree, from the previous analysis, it is clear that the higher the trip distance, the higher the trip time and base fare amount.

The decision tree shows that the most important factors in predicting the total fare amount are the base fare amount, tip amount, and extra charges, followed by trip time, trip distance, passenger count, and payment type.
Based on these results, it can be concluded that the selected features are useful in predicting the total fare amount for new trips and that the decision tree classifier can be an effective tool for this task. However, further analysis may be required to improve the accuracy of the model and to identify any potential limitations or biases in the data.

Support Vector Machine and Linear Regression

Support Vector Machines, sometimes known as SVMs, are supervised machine learning models. By using SVM, this research tried to predict trip times based on various parameters like the movement of traffic, pickup and dropoff points, and total trip distance. As a whole SVM achieved a 60% accuracy whereas multiple linear regression achieved an accuracy of 90%.

4. Passengers picked up from Newark Airport, JFK, and Flushing Meadows Corona Park are more likely to give a higher amount of tips. Contrarily, passengers dropped off at Great Kills and Oakwood on Staten Island tend to give a higher amount of tips, which well reflects Staten Island’s status as one of the most well-off boroughs. However, since some zones only have a few trips, the statistics might be affected by the insufficient amount of samples.

Overall, the analysis of NYC taxi data using SVM for predicting trip times and tip amount can provide valuable insights into factors that affect trip durations and can inform decision-making for transportation planning, resource allocation, and other related areas.

Page updated

Report abuse