Finding Optimal Positions for Ambulances

For dataset of accidents.csv

We have done the following preprocessing and data exploration:

First we have combined the dataset for the road locations and the crash sites.
Then to find the optimal number of clusters we have used elbow method and silhouette method
We have clustered the data according to the latitude, longitude, distance from the city center and the elevation of the sun.
Then using kmeans clustering we have found the coordinates of the 5 points for the optimal positioning of the ambulances.
We also applied PAM clustering and calculated the silhouette coefficients.
Then we generated synthetic dataset for no accident positions and predicted the areas where accidents can occur based on the geographical and environmental conditions using decision trees and naive bayes classifier.

Using information gain to find the attribute importance

We applied information gain and found that the attributes day, distance_from_the_centre, elevation, latitude, longitude, precipitation, relative_humidity, specific_humidity, temperature, wind_u, wind_v has significance importance

Elbow plot - The elbow method looks at the percentage of variance explained as a function of the number of clusters .

For each k, calculate the total within-cluster sum of square (wss)
The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters

elbow plot for optimal clusters

Average silhouette method - the average silhouette approach measures the quality of a clustering.

it determines how well each object lies within its cluster.
A high average silhouette width indicates a good clustering.

silhouette plot for optimal clusters

Inferences

From both the above plots we can see that the optimal cluster size would be three for the given data.

Models 1 for clusters for k=3, 4, 5 and 7

cluster 7 has the least sum of squares hence, it gives the best result.

Analysis of the frequency of crashes every 3 hours for each day of the week

Analysis of the frequency of crashes each month for each day of the week

Analysis of the frequency of crashes every 3 hours for each day of the week

Analysis of the frequency of crashes each month for each day of the week

Analysis of the frequency of crashes every 3 hours for each day of the week

Analysis of the frequency of crashes each month for each day of the week

Clustering model 2

model 2 for clusters k=3,4,5,7 considering geographical locations only

Optimal coordinates for ambulance

The optimal coordinates for positioning the ambulances in terms of latitude and longitude

Coordinates of ambulances

PAM Clustering method - PAM stands for “partition around medoids”. The algorithm is intended to find a sequence of objects called medoids that are centrally located in clusters.

PAM cluster 1 considering all the important features (geographical and environmental) for k=7
method used: euclidian distance

PAM cluster 2 considering only geographical features for k=7
method used: euclidian distance

silhouette coefficient : 0.34

silhouette coefficient: 0.37

Inferences

From K-means Clustering and PAM clustering we can see that although the results from some metrics are better for model 2, but more information is gathered by model 1
In model 2, from the silhouette plot we can see that there are many negative values which indicates that these points are in the wrong cluster. Hence, many such points are misclassified while there are very less such points in model 1
Also, the size of clusters varies considerably in model 2 while it is more consistent in model 1

Models for predicting crash probability for a particular location based on the Geographical and Environmental Factors

Decision Tree Model 1

Decision Tree for all important features.
We can see that distance from centre is the single deciding factor here

Decision Tree Model 2

Decision Tree considering only environmental factors
We can see that temperature, relative humidity and u component of wind are the deciding factors here

Decision Tree Models

Decision Tree removing distance from the centre
We can see that elevation, latitude and longitude are the deciding factors here

Decision Tree Model Metrics

Naive Bayes Classifier - Model Metrics

Naive Bayes model 1

Naive Bayes Model 2

Naive Bayes Model 3

Conclusion

The optimal points for finding the location of ambulances is best founded by the cluster which contains features regarding the geographic conditions of a location. The environmental variables do not play a major factor.
Using the decision tree and naive bayes model, we have looked at the significant factors which contributes to road accident and also built a model to predict whether there can be a crash at a possible location or not.

Future Scopes

XgBoost can be used and the features given in the kmeans classifier can be better optimized.
The outliers can be removed from the dataset for better clustering results.