The aim of this project is to investigate the efficacy of machine learning (ML) classification models on traffic crash data. The data is from Kaggle and contains a wide range of data on a large number of crashes, such as the road condition, lighting condition, crash type, and trafficway type [1]. Most of the data are in categorical format, which poses various challenges in applying ML methods such as the need for encoding, the reduction of compatible ML model options, and the reduction of compatible performance metric options. Nonetheless, applying multiple ML methods to this dataset will enhance understanding of how to effectively work with qualitative and categorical data. This understanding is valuable as qualitative and categorical data are often more intuitive for humans to interpret than numerical data. After initial pre-processing, tree-based models and support vector machine (SVM) models were compared using metrics such as accuracy and computation time. Then, feature engineering was conducted to reduce the dimensionality of the data and the same classifiers were run again. This project explores the efficacy of tree-based models and SVM without feature engineering, and then explores whether or not different types of feature engineering are successful in improving the performance of these models.
The feature engineering done for this project include:
binning the quantitative times of day into qualitative categories to improve human interpretability and reduce noise;
performing one-hot encoding to encode nominal features;
performing label encoding to encode ordinal features;
manually combining apparently correlated features to try to reduce dimensionality without causing over- or underfitting.
and performing principal component analysis (PCA) to find the optimal number of components to ultimately reduce the dimensionality of the model.
Random Forest, Support Vector Machine (SVM), and Decision Tree classifiers were used to investigate the efficacy of these models before and after feature engineering.
The table on the right below compares the performance of Random Forest, SVM, and Decision Tree classifiers before and after each method of feature engineering. The results highlight that tree-based models naturally handle categorical data well without the need for feature scaling or dimensionality reduction, whereas SVM benefits from careful preprocessing. While manual feature engineering significantly reduced accuracy for both Random Forest and Decision Tree, SVM maintained accuracy but saw a 15.5% improvement in computational efficiency after applying PCA. This finding reinforces that feature engineering should be chosen based on model type, as PCA is effective for SVM but detrimental to tree-based classifiers. Additionally, the comparison between classifiers demonstrates that simpler models like Decision Tree can achieve high accuracy with minimal computation time, making them strong candidates for real-world deployment when efficiency is a priority.
This project was completed with one other group member.
[1] Oktay Ördekçi, “Traffic Accidents,” Kaggle.com, Jan. 19, 2025, doi: https://www.kaggle.com/datasets/oktayrdeki/traffic-accidents/data [accessed Mar. 9, 2025].
Left: Binning scheme for categorizing times into qualitative labels.
Right: Final results of each model.
Left: Results of principal component analysis (PCA)