Machine Learning Models
With modeling our goal is to be able to derive the Avg AADT(average daily traffic) from other information to better understand the causes of traffic and to better predict demand. Being able to predict or estimate demand is very important, especially if the process works on small roads where there aren’t enough resources to measure on a yearly basis. OLS Multivariable Linear Regression was used first and then Random Forest.
Model 1 - All Data
The first model includes all the data. That was accomplished by adding the population demographic and income data to every measured traffic point. So, each measured traffic spot on the interstate has the associated county income and demographic data. For the set-up text columns were removed and blank (NaN) values were removed. The NaN values were present since the Census bureau or specifically the American Community Survey does not collect population and income information for all the California counties annually. With the high values of some variables like population, the numbers were scaled with the scikit learn standard scaler function. The model was then calculated in Python with the scikit learn package.
Results:
The full statistical information can be seen below. The R-squared of this model was 0.993 which means the model could explain 99.3% of the Avg AADT value with the variables chosen. This is an incredibly good fit for a model. The model needed additional investigation. Interestingly, there were quite a few very statistically relevant variables that had P scores less than 0.05 and specifically longitude was relevant, and latitude was not.
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
const 6.053e+05 3.32e+04 18.245 0.000 5.4e+05 6.7e+05
Long -185.9884 36.879 -5.043 0.000 -258.271 -113.706
Lat -60.7316 35.874 -1.693 0.090 -131.045 9.582
ObjectID 0.0387 0.020 1.893 0.058 -0.001 0.079
Dist 49.2678 17.835 2.762 0.006 14.311 84.225
Route 0.5235 0.262 1.995 0.046 0.009 1.038
Postmile -13.5835 1.203 -11.287 0.000 -15.942 -11.225
Year -311.2258 16.234 -19.171 0.000 -343.044 -279.407
Avg.Peak.Hour 0.8537 0.010 87.123 0.000 0.835 0.873
Avg.Peak.Month 0.8826 0.001 1132.488 0.000 0.881 0.884
Mean Income 0.0698 0.006 10.943 0.000 0.057 0.082
Median Income -0.0783 0.009 -8.673 0.000 -0.096 -0.061
Total population 0.0379 0.018 2.151 0.031 0.003 0.072
Under 5 years 0.0490 0.024 2.051 0.040 0.002 0.096
5 to 9 years -0.1009 0.018 -5.729 0.000 -0.135 -0.066
10 to 14 years -0.0369 0.019 -1.949 0.051 -0.074 0.000
15 to 19 years -0.0569 0.023 -2.470 0.014 -0.102 -0.012
20 to 24 years -0.0402 0.018 -2.250 0.024 -0.075 -0.005
25 to 29 years -0.0099 0.018 -0.563 0.573 -0.044 0.025
30 to 34 years -0.0777 0.023 -3.324 0.001 -0.124 -0.032
35 to 39 years 0.0135 0.020 0.672 0.502 -0.026 0.053
40 to 44 years -0.0480 0.021 -2.244 0.025 -0.090 -0.006
45 to 49 years -0.0169 0.026 -0.653 0.514 -0.068 0.034
50 to 54 years -0.0297 0.021 -1.404 0.160 -0.071 0.012
55 to 59 years -0.0585 0.021 -2.815 0.005 -0.099 -0.018
60 to 64 years -0.0247 0.022 -1.115 0.265 -0.068 0.019
65 to 69 years -0.1383 0.020 -6.771 0.000 -0.178 -0.098
70 to 74 years 0.0366 0.020 1.872 0.061 -0.002 0.075
75 to 79 years -0.0330 0.021 -1.595 0.111 -0.074 0.008
80 to 84 years 0.0874 0.023 3.787 0.000 0.042 0.133
85 years and over -0.2458 0.023 -10.848 0.000 -0.290 -0.201
==============================================================
Omnibus: 42725.748 Durbin-Watson: 0.959
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2426086685.746
Skew: 1.336 Prob(JB): 0.00
Kurtosis: 984.267 Cond. No. 4.98e+09
Model 2 - Filtered
This model had the same set up as model 1 but some variables were removed to leave only the economic and population information. These hypothesized to be important factors.
Results:
The full statistical results are below. The R-squared value was only 0.348 for this. Multiple iterations of this were tested and it is not that surprising that dropping the variables Avg Peak Hour and Avg Peak Month made the biggest difference. Still, some variables are found to be statistically relevant.
OLS Regression Results
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
const -6.431e+05 3.25e+04 -19.781 0.000 -7.07e+05 -5.79e+05
Long -7699.8462 343.695 -22.403 0.000 -8373.490 -7026.203
Lat -8127.3332 317.313 -25.613 0.000 -8749.267 -7505.399
Mean Income -0.9626 0.062 -15.637 0.000 -1.083 -0.842
Median Income 1.9392 0.087 22.395 0.000 1.769 2.109
Total population -0.0649 0.172 -0.377 0.706 -0.402 0.272
Under 5 years 1.6700 0.232 7.203 0.000 1.216 2.124
5 to 9 years 0.7119 0.170 4.184 0.000 0.378 1.045
10 to 14 years 0.3522 0.185 1.903 0.057 -0.011 0.715
15 to 19 years -2.2991 0.222 -10.346 0.000 -2.735 -1.864
20 to 24 years 0.6066 0.174 3.491 0.000 0.266 0.947
25 to 29 years -1.6570 0.168 -9.872 0.000 -1.986 -1.328
30 to 34 years 0.7274 0.225 3.233 0.001 0.286 1.168
35 to 39 years -0.6401 0.194 -3.292 0.001 -1.021 -0.259
40 to 44 years -0.5157 0.207 -2.497 0.013 -0.921 -0.111
45 to 49 years 1.3083 0.248 5.283 0.000 0.823 1.794
50 to 54 years -0.0564 0.205 -0.276 0.783 -0.457 0.344
55 to 59 years 0.7429 0.202 3.680 0.000 0.347 1.139
60 to 64 years 0.4551 0.215 2.117 0.034 0.034 0.877
65 to 69 years -0.3751 0.198 -1.890 0.059 -0.764 0.014
70 to 74 years 0.5359 0.189 2.834 0.005 0.165 0.906
75 to 79 years 0.3311 0.202 1.642 0.101 -0.064 0.726
80 to 84 years 0.8520 0.225 3.786 0.000 0.411 1.293
85 years and over -0.0338 0.221 -0.153 0.878 -0.467 0.400
===================================================================
Omnibus: 2639.215 Durbin-Watson: 0.068
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3068.385
Skew: 0.507 Prob(JB): 0.00
Kurtosis: 3.437 Cond. No. 4.98e+08
===================================================================
Filtered but Data as Percent Change
Model 3 – Filtered with Percent Change
Dealing with changes in large numbers is sometimes difficult to quantify as numbers in model especially when working with data like population since the number it starts with is complex. Therefore, this model the numbers were changed to percent difference while grouped on ObectID. No scalar was used since percent change is a way of scaling.
Results:
The full statistical results are below. The R-square value for the model is 0.012. This did not perform as expected and potentially shows that the StandardScalar function in scikit learn performs better than percent change.
OLS Regression Results
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
const -0.9094 0.165 -5.496 0.000 -1.234 -0.585
Mean Income -0.0176 0.030 -0.590 0.555 -0.076 0.041
Median Income 0.0251 0.026 0.947 0.344 -0.027 0.077
Total population -0.0105 0.223 -0.047 0.962 -0.447 0.426
Under 5 years -0.0074 0.029 -0.255 0.799 -0.065 0.050
5 to 9 years 0.0399 0.026 1.526 0.127 -0.011 0.091
10 to 14 years 0.0783 0.026 3.054 0.002 0.028 0.129
15 to 19 years 0.0398 0.025 1.578 0.115 -0.010 0.089
20 to 24 years -0.0745 0.021 -3.594 0.000 -0.115 -0.034
25 to 29 years -0.0162 0.026 -0.624 0.533 -0.067 0.035
30 to 34 years -0.0870 0.027 -3.245 0.001 -0.140 -0.034
35 to 39 years -0.0251 0.025 -0.986 0.324 -0.075 0.025
40 to 44 years -0.0017 0.022 -0.077 0.938 -0.046 0.042
45 to 49 years -0.1500 0.028 -5.447 0.000 -0.204 -0.096
50 to 54 years -0.2206 0.033 -6.621 0.000 -0.286 -0.155
55 to 59 years 0.1199 0.025 4.734 0.000 0.070 0.170
60 to 64 years 0.1780 0.025 7.143 0.000 0.129 0.227
65 to 69 years 0.2244 0.021 10.882 0.000 0.184 0.265
70 to 74 years 0.1804 0.015 12.359 0.000 0.152 0.209
75 to 79 years 0.0030 0.011 0.263 0.793 -0.019 0.025
80 to 84 years 0.0290 0.009 3.195 0.001 0.011 0.047
85 years and over 0.0140 0.009 1.481 0.139 -0.005 0.033
====================================================================
Omnibus: 112622.890 Durbin-Watson: 2.497
Prob(Omnibus): 0.000 Jarque-Bera (JB): 640801827.327
Skew: 17.380 Prob(JB): 0.00
Kurtosis: 532.670 Cond. No. 45.5
====================================================================
Model 4a and 4b – Data by County
The small scope of points on the interstate was not very successful. Modeling the macro scope of whole counties may be more successful. The data frame used for this analysis was the same one used for model 1 but with some manipulation. First, unnecessary columns were removed and then everything was grouped by county. While grouping the AVG.AADT was aggregated as a mean (model 4a) and sum (model 4b). The StandardScalar function was used for this as well.
For the 4a model, the R-squared value was 0.823. This is an improvement over the smaller scope models. Interestingly the median income is more statistically significant than the mean income. Most population demographics were not relevant except for a few very specific age groups like ages 25-29. The 4b model performed even better with an R-squared value of 0.991. The importance of the variables is switched between the two models with model 4b finding population variables being statistically significant and income not being relevant.
Model 4a. OLS Regression Results (Mean(AVG.AADT))
OLS Regression Results (Mean(AVG.AADT))
==============================================================================
Dep. Variable: Avg.AADT R-squared: 0.823
Model: OLS Adj. R-squared: 0.813
Method: Least Squares F-statistic: 83.52
Date: Thu, 14 Dec 2023 Prob (F-statistic): 7.78e-128
Time: 00:54:17 Log-Likelihood: -4461.9
No. Observations: 400 AIC: 8968.
Df Residuals: 378 BIC: 9056.
Df Model: 21
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
const -2.275e+04 3801.666 -5.983 0.000 -3.02e+04 -1.53e+04
Mean Income -0.3044 0.162 -1.878 0.061 -0.623 0.014
Median Income 1.1663 0.236 4.948 0.000 0.703 1.630
Total population 0.6632 0.816 0.813 0.417 -0.942 2.268
Under 5 years 0.9687 0.962 1.007 0.315 -0.924 2.861
5 to 9 years 0.0139 0.868 0.016 0.987 -1.693 1.720
10 to 14 years -0.8192 0.860 -0.953 0.341 -2.510 0.872
15 to 19 years -2.5767 0.982 -2.623 0.009 -4.508 -0.645
20 to 24 years -0.5315 0.823 -0.646 0.519 -2.149 1.086
25 to 29 years -2.1035 0.822 -2.560 0.011 -3.719 -0.488
30 to 34 years 0.9573 0.978 0.978 0.328 -0.966 2.881
35 to 39 years -2.1620 0.884 -2.446 0.015 -3.900 -0.424
40 to 44 years -1.8496 0.930 -1.990 0.047 -3.677 -0.022
45 to 49 years 0.9730 1.072 0.908 0.365 -1.135 3.081
50 to 54 years -0.2508 0.909 -0.276 0.783 -2.038 1.536
55 to 59 years -0.2217 0.925 -0.240 0.811 -2.041 1.597
60 to 64 years -0.5428 0.948 -0.572 0.567 -2.408 1.322
65 to 69 years -0.9597 0.888 -1.080 0.281 -2.706 0.787
70 to 74 years -0.3357 0.885 -0.379 0.705 -2.076 1.405
75 to 79 years -0.0402 0.956 -0.042 0.966 -1.919 1.839
80 to 84 years 0.2171 1.060 0.205 0.838 -1.867 2.301
85 years and over -1.7241 1.042 -1.654 0.099 -3.774 0.325
==============================================================================
Omnibus: 26.277 Durbin-Watson: 0.450
Prob(Omnibus): 0.000 Jarque-Bera (JB): 34.352
Skew: 0.529 Prob(JB): 3.47e-08
Kurtosis: 3.970 Cond. No. 8.68e+06
==============================================================================
Model 4b OLS Regression Results (Sum (AVG.AADT))
==============================================================================
Dep. Variable: Avg.AADT R-squared: 0.991
Model: OLS Adj. R-squared: 0.991
Method: Least Squares F-statistic: 2059.
Date: Thu, 14 Dec 2023 Prob (F-statistic): 0.00
Time: 01:12:34 Log-Likelihood: -6337.5
No. Observations: 400 AIC: 1.272e+04
Df Residuals: 378 BIC: 1.281e+04
Df Model: 21
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
const -1.539e+06 4.14e+05 -3.722 0.000 -2.35e+06 -7.26e+05
Mean Income -8.3152 17.633 -0.472 0.638 -42.986 26.356
Median Income 20.4112 25.636 0.796 0.426 -29.996 70.818
Total population -195.8158 88.772 -2.206 0.028 -370.365 -21.266
Under 5 years 531.0799 104.677 5.073 0.000 325.257 736.903
5 to 9 years 46.9593 94.404 0.497 0.619 -138.663 232.582
10 to 14 years 153.7182 93.538 1.643 0.101 -30.201 337.637
15 to 19 years -174.7530 106.829 -1.636 0.103 -384.807 35.301
20 to 24 years 450.8921 89.464 5.040 0.000 274.982 626.802
25 to 29 years -30.6133 89.386 -0.342 0.732 -206.369 145.143
30 to 34 years 160.8047 106.419 1.511 0.132 -48.443 370.053
35 to 39 years 253.3709 96.136 2.636 0.009 64.343 442.399
40 to 44 years 289.5896 101.104 2.864 0.004 90.793 488.386
45 to 49 years 359.5059 116.603 3.083 0.002 130.234 588.777
50 to 54 years 61.2153 98.857 0.619 0.536 -133.164 255.595
55 to 59 years 326.8902 100.624 3.249 0.001 129.037 524.743
60 to 64 years 421.2147 103.156 4.083 0.000 218.383 624.046
65 to 69 years 163.8490 96.628 1.696 0.091 -26.147 353.845
70 to 74 years 398.9021 96.267 4.144 0.000 209.616 588.188
75 to 79 years 19.4331 103.946 0.187 0.852 -184.951 223.817
80 to 84 years 252.9351 115.277 2.194 0.029 26.272 479.599
85 years and over -94.9807 113.376 -0.838 0.403 -317.908 127.946
==============================================================================
Omnibus: 22.503 Durbin-Watson: 0.727
Prob(Omnibus): 0.000 Jarque-Bera (JB): 60.200
Skew: 0.177 Prob(JB): 8.47e-14
Kurtosis: 4.867 Cond. No. 8.68e+06
==============================================================================
Model 5 – Data by County with Percent Change
Percent did not perform well with the smaller scope, but it was still attempted at the larger scale. The set up was the same as model 4 except data was grouped by county and year to find the percent change over the years. No scalar was used.
Results:
The full statistics can be found below. The R-squared value was 0.439. This result was an improvement of the percentage change on the micro level but also performed worse than model 4. Again, the StandardScaler function in Scikit Learn seems to just be better than manually using percent change.
OLS Regression Results
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
const 9094.0968 3093.430 2.940 0.004 3009.101 1.52e+04
Mean Income 0.0886 0.049 1.798 0.073 -0.008 0.186
Median Income 0.0278 0.045 0.621 0.535 -0.060 0.116
Total population -0.0608 0.331 -0.184 0.854 -0.713 0.591
Under 5 years -0.1600 0.049 -3.297 0.001 -0.255 -0.065
5 to 9 years -0.0492 0.043 -1.148 0.252 -0.134 0.035
10 to 14 years -0.0428 0.043 -0.997 0.319 -0.127 0.042
15 to 19 years 0.0644 0.037 1.737 0.083 -0.009 0.137
20 to 24 years -0.0239 0.028 -0.860 0.390 -0.078 0.031
25 to 29 years -0.0195 0.040 -0.488 0.626 -0.098 0.059
30 to 34 years 0.0429 0.039 1.103 0.271 -0.034 0.119
35 to 39 years 0.0089 0.039 0.230 0.818 -0.067 0.085
40 to 44 years 0.0114 0.035 0.328 0.743 -0.057 0.080
45 to 49 years -0.0722 0.042 -1.738 0.083 -0.154 0.010
50 to 54 years -0.0877 0.051 -1.706 0.089 -0.189 0.013
55 to 59 years 0.0592 0.041 1.433 0.153 -0.022 0.140
60 to 64 years 0.1397 0.041 3.438 0.001 0.060 0.220
65 to 69 years 0.2256 0.038 5.930 0.000 0.151 0.300
70 to 74 years 0.1451 0.026 5.541 0.000 0.094 0.197
75 to 79 years 0.0658 0.018 3.608 0.000 0.030 0.102
80 to 84 years 0.0668 0.015 4.590 0.000 0.038 0.095
85 years and over 0.0527 0.015 3.591 0.000 0.024 0.082
====================================================================
Omnibus: 22.127 Durbin-Watson: 2.428
Prob(Omnibus): 0.000 Jarque-Bera (JB): 52.808
Skew: -0.264 Prob(JB): 3.41e-12
Kurtosis: 4.800 Cond. No. 3.19e+12
====================================================================
Model 6 – Random Forest Models
With the linear regression not being effective at the smallest scale a Random Forest model was tried as an alternative. The data setup used was the same at model 2 with some variables filtered out.
Results:
R-squared value was 0.98. The model was very effective even without Avg Peak Hour and Avg Peak Month unlike the OLS Regression. Without those variables latitude and longitude became the most important factors. Surprisingly the next most important was the age group of 85 years and older. However, when latitude and longitude are removed the model did not perform well with only an R-squared value of 0.37.
Mean Squared Error: 122751015.56604156
R-squared: 0.9804802690247113
Without Lat and Long
Mean Squared Error: 3947098097.7435
R-squared: 0.37233681818644504
Conclusion
The accuracy of some of the models were much better than expected. The assumption a linear regression model makes with the variables being independent is a serious limitation especially on the micro level. The complexities of the relationship between the variables needs to have a more complicated model to reflect this. There was also the limitations on the data we collected with the traffic data being only on the interstates and well as the economic and population factors being only at the county level. This idea shows promise with using a more complex model with having more data collected and more variables included.