Modeling

Machine Learning Models

With modeling our goal is to be able to derive the Avg AADT(average daily traffic) from other information to better understand the causes of traffic and to better predict demand. Being able to predict or estimate demand is very important, especially if the process works on small roads where there aren’t enough resources to measure on a yearly basis. OLS Multivariable Linear Regression was used first and then Random Forest.

Model 1 - All Data

The first model includes all the data. That was accomplished by adding the population demographic and income data to every measured traffic point. So, each measured traffic spot on the interstate has the associated county income and demographic data. For the set-up text columns were removed and blank (NaN) values were removed. The NaN values were present since the Census bureau or specifically the American Community Survey does not collect population and income information for all the California counties annually. With the high values of some variables like population, the numbers were scaled with the scikit learn standard scaler function. The model was then calculated in Python with the scikit learn package.

Results:

The full statistical information can be seen below. The R-squared of this model was 0.993 which means the model could explain 99.3% of the Avg AADT value with the variables chosen. This is an incredibly good fit for a model. The model needed additional investigation. Interestingly, there were quite a few very statistically relevant variables that had P scores less than 0.05 and specifically longitude was relevant, and latitude was not.

coef std err t P>|t| [0.025 0.975]

---------------------------------------------------------------------------------------------

const 6.053e+05 3.32e+04 18.245 0.000 5.4e+05 6.7e+05

Long -185.9884 36.879 -5.043 0.000 -258.271 -113.706

Lat -60.7316 35.874 -1.693 0.090 -131.045 9.582

ObjectID 0.0387 0.020 1.893 0.058 -0.001 0.079

Dist 49.2678 17.835 2.762 0.006 14.311 84.225

Route 0.5235 0.262 1.995 0.046 0.009 1.038

Postmile -13.5835 1.203 -11.287 0.000 -15.942 -11.225

Year -311.2258 16.234 -19.171 0.000 -343.044 -279.407

Avg.Peak.Hour 0.8537 0.010 87.123 0.000 0.835 0.873

Avg.Peak.Month 0.8826 0.001 1132.488 0.000 0.881 0.884

Mean Income 0.0698 0.006 10.943 0.000 0.057 0.082

Median Income -0.0783 0.009 -8.673 0.000 -0.096 -0.061

Total population 0.0379 0.018 2.151 0.031 0.003 0.072

Under 5 years 0.0490 0.024 2.051 0.040 0.002 0.096

5 to 9 years -0.1009 0.018 -5.729 0.000 -0.135 -0.066

10 to 14 years -0.0369 0.019 -1.949 0.051 -0.074 0.000

15 to 19 years -0.0569 0.023 -2.470 0.014 -0.102 -0.012

20 to 24 years -0.0402 0.018 -2.250 0.024 -0.075 -0.005

25 to 29 years -0.0099 0.018 -0.563 0.573 -0.044 0.025

30 to 34 years -0.0777 0.023 -3.324 0.001 -0.124 -0.032

35 to 39 years 0.0135 0.020 0.672 0.502 -0.026 0.053

40 to 44 years -0.0480 0.021 -2.244 0.025 -0.090 -0.006

45 to 49 years -0.0169 0.026 -0.653 0.514 -0.068 0.034

50 to 54 years -0.0297 0.021 -1.404 0.160 -0.071 0.012

55 to 59 years -0.0585 0.021 -2.815 0.005 -0.099 -0.018

60 to 64 years -0.0247 0.022 -1.115 0.265 -0.068 0.019

65 to 69 years -0.1383 0.020 -6.771 0.000 -0.178 -0.098

70 to 74 years 0.0366 0.020 1.872 0.061 -0.002 0.075

75 to 79 years -0.0330 0.021 -1.595 0.111 -0.074 0.008

80 to 84 years 0.0874 0.023 3.787 0.000 0.042 0.133

85 years and over -0.2458 0.023 -10.848 0.000 -0.290 -0.201

==============================================================

Omnibus: 42725.748 Durbin-Watson: 0.959

Prob(Omnibus): 0.000 Jarque-Bera (JB): 2426086685.746

Skew: 1.336 Prob(JB): 0.00

Kurtosis: 984.267 Cond. No. 4.98e+09

Model 2 - Filtered

This model had the same set up as model 1 but some variables were removed to leave only the economic and population information. These hypothesized to be important factors.

Results:

The full statistical results are below. The R-squared value was only 0.348 for this. Multiple iterations of this were tested and it is not that surprising that dropping the variables Avg Peak Hour and Avg Peak Month made the biggest difference. Still, some variables are found to be statistically relevant.

OLS Regression Results

coef std err t P>|t| [0.025 0.975]

---------------------------------------------------------------------------------------------

const -6.431e+05 3.25e+04 -19.781 0.000 -7.07e+05 -5.79e+05

Long -7699.8462 343.695 -22.403 0.000 -8373.490 -7026.203

Lat -8127.3332 317.313 -25.613 0.000 -8749.267 -7505.399

Mean Income -0.9626 0.062 -15.637 0.000 -1.083 -0.842

Median Income 1.9392 0.087 22.395 0.000 1.769 2.109

Total population -0.0649 0.172 -0.377 0.706 -0.402 0.272

Under 5 years 1.6700 0.232 7.203 0.000 1.216 2.124

5 to 9 years 0.7119 0.170 4.184 0.000 0.378 1.045

10 to 14 years 0.3522 0.185 1.903 0.057 -0.011 0.715

15 to 19 years -2.2991 0.222 -10.346 0.000 -2.735 -1.864

20 to 24 years 0.6066 0.174 3.491 0.000 0.266 0.947

25 to 29 years -1.6570 0.168 -9.872 0.000 -1.986 -1.328

30 to 34 years 0.7274 0.225 3.233 0.001 0.286 1.168

35 to 39 years -0.6401 0.194 -3.292 0.001 -1.021 -0.259

40 to 44 years -0.5157 0.207 -2.497 0.013 -0.921 -0.111

45 to 49 years 1.3083 0.248 5.283 0.000 0.823 1.794

50 to 54 years -0.0564 0.205 -0.276 0.783 -0.457 0.344

55 to 59 years 0.7429 0.202 3.680 0.000 0.347 1.139

60 to 64 years 0.4551 0.215 2.117 0.034 0.034 0.877

65 to 69 years -0.3751 0.198 -1.890 0.059 -0.764 0.014

70 to 74 years 0.5359 0.189 2.834 0.005 0.165 0.906

75 to 79 years 0.3311 0.202 1.642 0.101 -0.064 0.726

80 to 84 years 0.8520 0.225 3.786 0.000 0.411 1.293

85 years and over -0.0338 0.221 -0.153 0.878 -0.467 0.400

===================================================================

Omnibus: 2639.215 Durbin-Watson: 0.068

Prob(Omnibus): 0.000 Jarque-Bera (JB): 3068.385

Skew: 0.507 Prob(JB): 0.00

Kurtosis: 3.437 Cond. No. 4.98e+08

===================================================================

Filtered but Data as Percent Change

Model 3 – Filtered with Percent Change

Dealing with changes in large numbers is sometimes difficult to quantify as numbers in model especially when working with data like population since the number it starts with is complex. Therefore, this model the numbers were changed to percent difference while grouped on ObectID. No scalar was used since percent change is a way of scaling.

Results:

The full statistical results are below. The R-square value for the model is 0.012. This did not perform as expected and potentially shows that the StandardScalar function in scikit learn performs better than percent change.

OLS Regression Results

coef std err t P>|t| [0.025 0.975]

---------------------------------------------------------------------------------------------

const -0.9094 0.165 -5.496 0.000 -1.234 -0.585

Mean Income -0.0176 0.030 -0.590 0.555 -0.076 0.041

Median Income 0.0251 0.026 0.947 0.344 -0.027 0.077

Total population -0.0105 0.223 -0.047 0.962 -0.447 0.426

Under 5 years -0.0074 0.029 -0.255 0.799 -0.065 0.050

5 to 9 years 0.0399 0.026 1.526 0.127 -0.011 0.091

10 to 14 years 0.0783 0.026 3.054 0.002 0.028 0.129

15 to 19 years 0.0398 0.025 1.578 0.115 -0.010 0.089

20 to 24 years -0.0745 0.021 -3.594 0.000 -0.115 -0.034

25 to 29 years -0.0162 0.026 -0.624 0.533 -0.067 0.035

30 to 34 years -0.0870 0.027 -3.245 0.001 -0.140 -0.034

35 to 39 years -0.0251 0.025 -0.986 0.324 -0.075 0.025

40 to 44 years -0.0017 0.022 -0.077 0.938 -0.046 0.042

45 to 49 years -0.1500 0.028 -5.447 0.000 -0.204 -0.096

50 to 54 years -0.2206 0.033 -6.621 0.000 -0.286 -0.155

55 to 59 years 0.1199 0.025 4.734 0.000 0.070 0.170

60 to 64 years 0.1780 0.025 7.143 0.000 0.129 0.227

65 to 69 years 0.2244 0.021 10.882 0.000 0.184 0.265

70 to 74 years 0.1804 0.015 12.359 0.000 0.152 0.209

75 to 79 years 0.0030 0.011 0.263 0.793 -0.019 0.025

80 to 84 years 0.0290 0.009 3.195 0.001 0.011 0.047

85 years and over 0.0140 0.009 1.481 0.139 -0.005 0.033

====================================================================

Omnibus: 112622.890 Durbin-Watson: 2.497

Prob(Omnibus): 0.000 Jarque-Bera (JB): 640801827.327

Skew: 17.380 Prob(JB): 0.00

Kurtosis: 532.670 Cond. No. 45.5

====================================================================

Model 4a and 4b – Data by County

The small scope of points on the interstate was not very successful. Modeling the macro scope of whole counties may be more successful. The data frame used for this analysis was the same one used for model 1 but with some manipulation. First, unnecessary columns were removed and then everything was grouped by county. While grouping the AVG.AADT was aggregated as a mean (model 4a) and sum (model 4b). The StandardScalar function was used for this as well.

For the 4a model, the R-squared value was 0.823. This is an improvement over the smaller scope models. Interestingly the median income is more statistically significant than the mean income. Most population demographics were not relevant except for a few very specific age groups like ages 25-29. The 4b model performed even better with an R-squared value of 0.991. The importance of the variables is switched between the two models with model 4b finding population variables being statistically significant and income not being relevant.

Model 4a. OLS Regression Results (Mean(AVG.AADT))

OLS Regression Results (Mean(AVG.AADT))

==============================================================================

Dep. Variable: Avg.AADT R-squared: 0.823

Model: OLS Adj. R-squared: 0.813

Method: Least Squares F-statistic: 83.52

Date: Thu, 14 Dec 2023 Prob (F-statistic): 7.78e-128

Time: 00:54:17 Log-Likelihood: -4461.9

No. Observations: 400 AIC: 8968.

Df Residuals: 378 BIC: 9056.

Df Model: 21

Covariance Type: nonrobust

=============================================================================================

coef std err t P>|t| [0.025 0.975]

---------------------------------------------------------------------------------------------

const -2.275e+04 3801.666 -5.983 0.000 -3.02e+04 -1.53e+04

Mean Income -0.3044 0.162 -1.878 0.061 -0.623 0.014

Median Income 1.1663 0.236 4.948 0.000 0.703 1.630

Total population 0.6632 0.816 0.813 0.417 -0.942 2.268

Under 5 years 0.9687 0.962 1.007 0.315 -0.924 2.861

5 to 9 years 0.0139 0.868 0.016 0.987 -1.693 1.720

10 to 14 years -0.8192 0.860 -0.953 0.341 -2.510 0.872

15 to 19 years -2.5767 0.982 -2.623 0.009 -4.508 -0.645

20 to 24 years -0.5315 0.823 -0.646 0.519 -2.149 1.086

25 to 29 years -2.1035 0.822 -2.560 0.011 -3.719 -0.488

30 to 34 years 0.9573 0.978 0.978 0.328 -0.966 2.881

35 to 39 years -2.1620 0.884 -2.446 0.015 -3.900 -0.424

40 to 44 years -1.8496 0.930 -1.990 0.047 -3.677 -0.022

45 to 49 years 0.9730 1.072 0.908 0.365 -1.135 3.081

50 to 54 years -0.2508 0.909 -0.276 0.783 -2.038 1.536

55 to 59 years -0.2217 0.925 -0.240 0.811 -2.041 1.597

60 to 64 years -0.5428 0.948 -0.572 0.567 -2.408 1.322

65 to 69 years -0.9597 0.888 -1.080 0.281 -2.706 0.787

70 to 74 years -0.3357 0.885 -0.379 0.705 -2.076 1.405

75 to 79 years -0.0402 0.956 -0.042 0.966 -1.919 1.839

80 to 84 years 0.2171 1.060 0.205 0.838 -1.867 2.301

85 years and over -1.7241 1.042 -1.654 0.099 -3.774 0.325

==============================================================================

Omnibus: 26.277 Durbin-Watson: 0.450

Prob(Omnibus): 0.000 Jarque-Bera (JB): 34.352

Skew: 0.529 Prob(JB): 3.47e-08

Kurtosis: 3.970 Cond. No. 8.68e+06

==============================================================================

Model 4b OLS Regression Results (Sum (AVG.AADT))

==============================================================================

Dep. Variable: Avg.AADT R-squared: 0.991

Model: OLS Adj. R-squared: 0.991

Method: Least Squares F-statistic: 2059.

Date: Thu, 14 Dec 2023 Prob (F-statistic): 0.00

Time: 01:12:34 Log-Likelihood: -6337.5

No. Observations: 400 AIC: 1.272e+04

Df Residuals: 378 BIC: 1.281e+04

Df Model: 21

Covariance Type: nonrobust

=============================================================================================

coef std err t P>|t| [0.025 0.975]

---------------------------------------------------------------------------------------------

const -1.539e+06 4.14e+05 -3.722 0.000 -2.35e+06 -7.26e+05

Mean Income -8.3152 17.633 -0.472 0.638 -42.986 26.356

Median Income 20.4112 25.636 0.796 0.426 -29.996 70.818

Total population -195.8158 88.772 -2.206 0.028 -370.365 -21.266

Under 5 years 531.0799 104.677 5.073 0.000 325.257 736.903

5 to 9 years 46.9593 94.404 0.497 0.619 -138.663 232.582

10 to 14 years 153.7182 93.538 1.643 0.101 -30.201 337.637

15 to 19 years -174.7530 106.829 -1.636 0.103 -384.807 35.301

20 to 24 years 450.8921 89.464 5.040 0.000 274.982 626.802

25 to 29 years -30.6133 89.386 -0.342 0.732 -206.369 145.143

30 to 34 years 160.8047 106.419 1.511 0.132 -48.443 370.053

35 to 39 years 253.3709 96.136 2.636 0.009 64.343 442.399

40 to 44 years 289.5896 101.104 2.864 0.004 90.793 488.386

45 to 49 years 359.5059 116.603 3.083 0.002 130.234 588.777

50 to 54 years 61.2153 98.857 0.619 0.536 -133.164 255.595

55 to 59 years 326.8902 100.624 3.249 0.001 129.037 524.743

60 to 64 years 421.2147 103.156 4.083 0.000 218.383 624.046

65 to 69 years 163.8490 96.628 1.696 0.091 -26.147 353.845

70 to 74 years 398.9021 96.267 4.144 0.000 209.616 588.188

75 to 79 years 19.4331 103.946 0.187 0.852 -184.951 223.817

80 to 84 years 252.9351 115.277 2.194 0.029 26.272 479.599

85 years and over -94.9807 113.376 -0.838 0.403 -317.908 127.946

==============================================================================

Omnibus: 22.503 Durbin-Watson: 0.727

Prob(Omnibus): 0.000 Jarque-Bera (JB): 60.200

Skew: 0.177 Prob(JB): 8.47e-14

Kurtosis: 4.867 Cond. No. 8.68e+06

==============================================================================

Model 5 – Data by County with Percent Change

Percent did not perform well with the smaller scope, but it was still attempted at the larger scale. The set up was the same as model 4 except data was grouped by county and year to find the percent change over the years. No scalar was used.

Results:

The full statistics can be found below. The R-squared value was 0.439. This result was an improvement of the percentage change on the micro level but also performed worse than model 4. Again, the StandardScaler function in Scikit Learn seems to just be better than manually using percent change.

OLS Regression Results

coef std err t P>|t| [0.025 0.975]

---------------------------------------------------------------------------------------------

const 9094.0968 3093.430 2.940 0.004 3009.101 1.52e+04

Mean Income 0.0886 0.049 1.798 0.073 -0.008 0.186

Median Income 0.0278 0.045 0.621 0.535 -0.060 0.116

Total population -0.0608 0.331 -0.184 0.854 -0.713 0.591

Under 5 years -0.1600 0.049 -3.297 0.001 -0.255 -0.065

5 to 9 years -0.0492 0.043 -1.148 0.252 -0.134 0.035

10 to 14 years -0.0428 0.043 -0.997 0.319 -0.127 0.042

15 to 19 years 0.0644 0.037 1.737 0.083 -0.009 0.137

20 to 24 years -0.0239 0.028 -0.860 0.390 -0.078 0.031

25 to 29 years -0.0195 0.040 -0.488 0.626 -0.098 0.059

30 to 34 years 0.0429 0.039 1.103 0.271 -0.034 0.119

35 to 39 years 0.0089 0.039 0.230 0.818 -0.067 0.085

40 to 44 years 0.0114 0.035 0.328 0.743 -0.057 0.080

45 to 49 years -0.0722 0.042 -1.738 0.083 -0.154 0.010

50 to 54 years -0.0877 0.051 -1.706 0.089 -0.189 0.013

55 to 59 years 0.0592 0.041 1.433 0.153 -0.022 0.140

60 to 64 years 0.1397 0.041 3.438 0.001 0.060 0.220

65 to 69 years 0.2256 0.038 5.930 0.000 0.151 0.300

70 to 74 years 0.1451 0.026 5.541 0.000 0.094 0.197

75 to 79 years 0.0658 0.018 3.608 0.000 0.030 0.102

80 to 84 years 0.0668 0.015 4.590 0.000 0.038 0.095

85 years and over 0.0527 0.015 3.591 0.000 0.024 0.082

====================================================================

Omnibus: 22.127 Durbin-Watson: 2.428

Prob(Omnibus): 0.000 Jarque-Bera (JB): 52.808

Skew: -0.264 Prob(JB): 3.41e-12

Kurtosis: 4.800 Cond. No. 3.19e+12

====================================================================

Model 6 – Random Forest Models

With the linear regression not being effective at the smallest scale a Random Forest model was tried as an alternative. The data setup used was the same at model 2 with some variables filtered out.

Results:

R-squared value was 0.98. The model was very effective even without Avg Peak Hour and Avg Peak Month unlike the OLS Regression. Without those variables latitude and longitude became the most important factors. Surprisingly the next most important was the age group of 85 years and older. However, when latitude and longitude are removed the model did not perform well with only an R-squared value of 0.37.

Mean Squared Error: 122751015.56604156

R-squared: 0.9804802690247113

Without Lat and Long

Mean Squared Error: 3947098097.7435

R-squared: 0.37233681818644504

Conclusion

The accuracy of some of the models were much better than expected. The assumption a linear regression model makes with the variables being independent is a serious limitation especially on the micro level. The complexities of the relationship between the variables needs to have a more complicated model to reflect this. There was also the limitations on the data we collected with the traffic data being only on the interstates and well as the economic and population factors being only at the county level. This idea shows promise with using a more complex model with having more data collected and more variables included.