We first downloaded the data from FARS website as .csv files, then we imported them into MySQL database as raw data. In particular, we chose the fields such as state ids, types of vehicles, weather, dates and times, geo-coordinates, light conditions, the number of fatalities, speed limit, alcohol volume, speed of the vehicles and ages and genders of the persons.
Then we mapped the code of each field into English clusters using scripts or manually. For example, for weather, we mapped for every single year the codes into Rain, Hail, Fog, etc. We did similar to alcohol, light condition, gender, speed and other fields.
[An example of scripts we used to map code into strings]
Then we deleted all rows that don't have latitude and longitude information. And we uniquely identified crashed building up a key called crash id , containing the concatenation of original crash id, year, and state id. Then we aggregated all the crashes have same crash id into one row.
[The query for aggregating rows]
Then we integrated everything into a single table to avoid join operations.
[The single table we finally got]
To increase the performance of our application, we built two full text indexes over sex and vehicle type and other traditional b-trees.
We also added some other useful information like state zoom level and state centers and state population.