Results

Decision Trees on Sample Data

Since we have around 1 million records for the month of January, we are taking sample data for visualization purposes only.

1.

Tree_Record_small_3.pdf

2.

Tree_Record_small_2.pdf

Decision Trees on The Actual Data

1.

Tree_Record_whole_data.pdf

One can download the above output here for better readability.

From the above results, one can infer that

The top node of the tree represents the most important feature for predicting the total fare amount, which in this case is trip distance and base fare amount. We have seen from the correlation analysis done in Naive Bayes Analysis that trip distance is directly correlated to the base fare amount. Thus, it's fair to say that the total fare amount is either dependent on one of these features as a root node for the Decision tree.
The leaf nodes of the tree represent the predicted fare amount based on the combination of features in that branch.

Feature Importance

As discussed above in terms of the root node, the decision tree is heavily reliant on a few of the features from the dataset. They are base fare amount, tip amount and extra charges.

Other 2-minute features which impact the tree are trip time and trip distance. Even though these features don't have such a high impact on the tree, from the previous analysis, it is clear that the higher the trip distance, the higher the trip time and base fare amount.

decision tree performance

The performance of the Decision Tree can be calculated using accuracy and finding the error rate. One way to visualize the accuracy is through a confusion matrix as shown below.

The accuracy of the model is:

Train accuracy: 86.20%
Test accuracy: 81.79%

The model has been trained on a dataset with a sizeable amount of accuracy, as the training accuracy is 86.20%. This means the model is able to predict the fare amount for trips in the training dataset with reasonable accuracy.
The test accuracy is 81.79%, which is also comparable to the training accuracy. This is a good sign as it indicates that the model is able to generalize well to new, unseen data.
However, we would need to perform additional evaluation measures such as precision, recall, and F1 score to have a better understanding of the model's performance. If the precision, recall, and F1 score of the model are also high, it would mean that the model is performing well in predicting the fare amount.
It is important to carefully evaluate the performance of a model on both the training and test data to ensure that it is able to generalize well to new data and provide accurate predictions in real-world scenarios.

Page updated

Report abuse