Feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.
Why Feature Selection is Important:
Removing a redundant variable helps to improve accuracy. Similarly, inclusion of a relevant variable has a positive effect on model accuracy.
Too many variables might result to overfitting which means model is not able to generalize pattern
Too many variables leads to slow computation which in turns requires more memory and hardware.
Boruta Algorithm
We used Boruta package in R to do the feature selection.
Boruta is an all relevant feature selection wrapper algorithm, capable of working with any classification method that output variable importance measure. By default, Boruta uses Random Forest. The method performs a top-down search for relevant features by comparing original attributes' importance with importance achievable at random, estimated using their permuted copies, and progressively eliminating irrelevant features to stabilize that test.
Blue boxplots correspond to minimal, average and maximum Z score of a shadow attribute. Red, yellow and green boxplots represent Z scores of rejected, tentative and confirmed attributes respectively.
After doing Feature Selection, only 17 important variables left, which are listed below:
Start_Lat: Shows latitude in GPS coordinate of the start point.
Start_Lng: Shows longitude in GPS coordinate of the start point.
Distance(mi): The length of the road extent affected by the accident.
Temperature(F): Shows the temperature (in Fahrenheit).
Humidity(%): Shows the humidity (in percentage).
Pressure(in): Shows the air pressure (in inches).
Visibility(mi): Shows visibility (in miles).
Wind_Direction: Shows wind direction.
Wind_Speed(mph): Shows wind speed (in miles per hour).
Weather_Condition: Shows the weather condition (rain, snow, thunderstorm, fog, etc.)
Crossing: A POI annotation which indicates preence of crossing in a nearby location.
Junction: A POI annotation which indicates presence of junction in a nearby location.
Station: A POI annotation which indicates presence of station in a nearby location.
Traffic_Signal: A POI annotation which indicates presence of traffic_signal in a nearby location.
Month: Shows the month when the car accident happened.
Weekday: Shows the weekday when the car accident happened.
Hour: Shows the hour when the car accident happened.