This project aims to recommend a learning algorithm for predicting the average time for the sort tests data.
Through the use of evaluation techniques such as R-squared , plots of the residuals, and the mean squared error, this project recommends a model by comparing and evaluating the predictions of the k-nearest neighbors and linear regression models for predicting the average time variable.
The data contains two qualitative columns which were converted into dummy variables beforehand. The dependent variable was stored in a Pandas series while the independent variables were stored in a data frame as required by Scikit-learn. They were then split into two datasets consisting of 80% training and 20% testing data. The source code can found on Github.
The k-nearest neighbors regression models were evaluated by comparing the mean squared error of the training data with that of the testing data's predictions. The mean squared error, or MSE, is not meaningful its own, but by comparing it with other MSEs it is possible to evaluate the predictions of a model. For instance, if the MSE for the testing data predictions turns out to be significantly higher than that of the training data's, this means the model is over-fitting the data. Mathematical judgement must be used to determine the significance between the difference of two MSE by accounting for the magnitude of the error and trying out different parameters for the model.
On the other hand, the linear regression models were evaluated through the R-squared value and plots of the residuals. R-squared is the proportion of variance in the dependent variable that is explained by the independent variables. The closer the proportion is to 1.0 the better the data is explained by the model. However, it is worth noting that R-squared is not a holistic metric, thus, the histogram and scatter plot of the residuals is necessary to identify other faults with the model. The distribution of the residuals should ideally be normal and the scatter plots should not contain any patterns, such as groupings, lines, parallels, outliers, etc. If the scatter plot contains patterns, this means that for any predictions that fall within the patterns, the linear model is either under- or over-estimating predictions in that range.
Models
The test times in the data have different orders of magnitude, to prevent the columns with the highest times from skewing the results, they were normalized (or scaled) to fit between zero and one.
To find the best number of neighbors to use for the model, a sequence of parameters were tested using a for-loop. The for-loop computed the mean squared error for k neighbors ranging from zero to ten for both the training and testing data predictions. The results can be observed below.
The line plots represents the MSE of the predictions made using the training and testing data for different number of neighbors, k. As illustrated by the line plots, the MSE gets larger as k increases. This means that the k-nearest neighbors algorithm is not a good fit for this data because the MSE for the testing data predictions is always significantly higher than that of the training data for any number of neighbors. The model is too complex for the data even for high values of k, which is clear evidence of over-fitting. Hence, a different learning algorithm is advisable.
Mean Squared Error
To more closely observe the differences between the mean squared errors, a table with percentage differences is shown below with the computations for the values of k shown in the graph above.
MSE train: 0.0 | MSE train: 206,467,731.3 | MSE train: 408,929,970.1
MSE test: 786,054,904.7 | MSE test: 1,710,259,945.4 | MSE test: 2,656,276,737.3
Difference: 786,054,904.7 | Difference: 1,503,792,214.1 | Difference: 2,247,346,767.2
Percent Diff: 200.00% | Percent Diff: 156.91% | Percent Diff: 146.64%
Neighbors: 1 | Neighbors: 2 | Neighbors: 3
MSE train: 681,410,416.5 | MSE train: 887,520,250.1 | MSE train: 1,103,442,123.1
MSE test: 3,837,435,302.3 | MSE test: 4,867,173,938.7 | MSE test: 5,532,695,100.9
Difference: 3,156,024,885.8 | Difference: 3,979,653,688.5 | Difference: 4,429,252,977.8
Percent Diff: 139.68% | Percent Diff: 138.31% | Percent Diff: 133.49%
Neighbors: 4 | Neighbors: 5 | Neighbors: 6
MSE train: 1,249,955,027.3 | MSE train: 1,395,073,503.7 | MSE train: 1,509,012,245.4
MSE test: 6,122,528,901.5 | MSE test: 6,584,636,165.2 | MSE test: 6,958,231,189.7
Difference: 4,872,573,874.1 | Difference: 5,189,562,661.6 | Difference: 5,449,218,944.3
Percent Diff: 132.18% | Percent Diff: 130.07% | Percent Diff: 128.71%
Neighbors: 7 | Neighbors: 8 | Neighbors: 9
When the number of neighbors is one, the MSE of the training data predictions is zero, which means the model memorized the data completely. This is not the desired effect because if the model memorizes the data, it is only useful for making predictions about the data it memorized and not any other data as demonstrated by the mean squared error of the testing data predictions. Note that the training and testing data are a random subset of the original data.
As the number of neighbors increases, the percentage different decreases, however, the number it increases by gets incrementally smaller while the model gets incrementally more complex. For this reason, the k-nearest neighbor algorithm will always over-fit the data for any number of neighbors, k.
The first linear regression model was tested using all the features, including the dummy variables for the sort type and array type qualitative data.
Unlike the k-nearest neighbors algorithm, linear regression seems to be a much better choice.
The first column to the right lists the name of all the independent variables, while the P>|t| column shows their P-value. The P-value is the probability that a coefficient appears in our data due to random chance. Since we want all the coefficients to be representative of the data, P should be as close to zero as possible. In this manner, P is often used to determine the statistical significance of a variable in machine learning. A common threshold for P is 0.05, meaning that coefficients with a P-value of 0.05 or higher are usually dropped from the model because they are considered statistically insignificant. However, it is worth noting that this alone is not a reliable metric for evaluating our model. This dateset is a good example as to why. But first, let us observe the histogram and scatter plot of this model.
Although the histogram is not exactly normally distributed and the scatter plot shows a pattern, the order of magnitude of the residuals is in the 1 × 10⁻⁷ range, therefore the difference between the actual - predicted average time is negligible.
For the second linear regression model, the independent variables with a P-value higher than or equal to the threshold of 0.05 are dropped from the model. Or in other words, by dropping the independent variables with high P-values, the regression model can be simplified without affecting the model.
As expected, the R-squared value did not change.
These results demonstrate that the test times are the only parameters needed to accurately predict the average time, and since the average time is the average of all the test times, it makes sense. However, as demonstrated in the analysis, the sort order and array type are crucial to the test times, which if judging by R-squared and P, one would think otherwise.
Since the k-nearest neighbors model is out of the question, I would recommend the first linear model with all the independent variables because the sort and array type are essential to the test times despite the linear regression algorithm computing a low correlation. Furthermore, by only keeping the test times, it is not possible make predictions using the sort and array type, which might be desirable. In conclusion, the results of this evaluation illustrates the importance of analyzing and understanding the data in order to correctly interpret the metrics used to evaluate the learning algorithms. This is because having an understanding makes it possible to gauge whether or not the metrics match our understanding and expectations of the data.