This project aims to analyze two classification algorithms by using the sort type category in the sort tests data.
Through the use of confusion matrices, this project evaluates and analyzes various classification models by comparing their accuracy, precision, sensitivity, and specificity statistics for categorizing the test times data by sort type for bubble and selection sorts.
The data contains two qualitative columns of which one was converted into a dummy variable beforehand. Then, since bubble and insertion sort performed similarly with different ordered data in the analysis, the two of them were grouped together, where group zero represents bubble and insertion sort and group one represents selection sort. The dependent variable was stored in a Pandas series while the independent variables were stored in a data frame as required by Scikit-learn. They were then split into two datasets consisting of 70% training and 30% testing data. The source code can found on Github.
The confusion matrices are evaluated by comparing the accuracy, precision, sensitivity, and specificity for each model.
Accuracy refers to the frequency of correct predictions. The ratio of the true cases to all the cases.
Precision refers to the frequency of positive events that turned out to be true (positive).
Sensitivity refers to the ratio of correct positive events to total positive events.
Specificity refers to the ratio of correct negative events to total negative events.
For our data, this means the following:
If the observations and predictions are both zero, the model correctly classified bubble sort as bubble sort (or true negative).
If the observations is zero and the predictions is one, the model incorrectly classified bubble sort as selection sort (or false positive).
If the observations is one and the predictions is zero, the model incorrectly classified selection sort as bubble sort (or false negative).
If the observations and predictions are both one, the model correctly classified selection sort as selection sort (or true positive).
Classification Models
Through the use of parameter search, the decision tree classifier was fitted to the training data for maximum depths ranging from one to nine. The accuracy, precision, sensitivity, and specificity was computed for each of these models to find the best maximum depth to use. The results can be observed below.
Accuracy = 75.00%
Precision = 71.43%
Sensitivity = 100.00%
Specificity = 33.33%
Accuracy = 87.50%
Precision = 83.33%
Sensitivity = 100.00%
Specificity = 66.67%
Accuracy = 87.50%
Precision = 83.33%
Sensitivity = 100.00%
Specificity = 66.67%
Accuracy = 100.00%
Precision = 100.00%
Sensitivity = 100.00%
Specificity = 100.00%
Since the data is small and simple, these results may not be the most reliable, but max depth of four yielded the best results.
In this visualization of the decision tree we can see there is an over-reliance on the average time variable. The test times is being given precedence over all the other variables.
Results
The algorithms mostly depended on the time to determine the classifications. Although this makes sense to an extent, from the analysis of the data, we know that the order the array was in, or the array type, before performing the sort has a meaningful impact on the test times. However, the algorithms I tested in this or the other project suggested that the array type has little to no impact on the predictions. This is most notably demonstrated by the linear and logistic regression models which assigned the array type dummy variables a high P- value. Of course, granted that the the P-value alone is not a reliable metric for determining the statistical significance of a variable, the results of the other models agreed with the P-value. Furthermore, in the decision tree, I was able to visualize the model's over-reliance on the average time variable, even after dropping all the other test times. These findings contradicts our understanding of the data which leads me to two beliefs and a question. (1) This data is no good to make predictions with due to the inherit nature of it or (2) there is something wrong with my construction of the data set. If the latter is true, how can I change or organize the data in a way such that the algorithms can recognize the array type variable as statistically significant. My suggestion is to perform a hypothesis tests where the null hypothesis states that the array type variable is not statistically significant while the alternative hypothesis states that the array type variable is statistically significant. If the null hypothesis is proven true, it means that there is nothing wrong with the data and thus (1) would be true. On the other hand, if the alternative hypothesis is true, the algorithms failed to recognize the array type as statistically significant because of some other problem with data or possibly with the code.