Objective:
In Lab 4, we explore ArcGIS regression tools. Regression is often used as an exploratory analysis at the beginning of a project, to identify potential relationships between dependent and independent variables. In this case, we perform an analysis of 911 Call Volumes in Portland, OR, to determine factors leading to high call volumes and inform policies to reduce the volumes in the future. This is done using both Ordinary Least Squares Regression (OLS) and Geographically Weighted Regression (GWR) tools in ArcGIS. In addition, geographically specific predictions of future 911 calls is carried out using the GWR tool.
Regression analysis is a method of exploring spatial relationships that allows for the prediction of outcomes based on improved understanding of the factors behind spatial patterns (ArcGIS, 2012). In other words, regression analysis can help us to understand relationships between different elements in a geographical space, like the relationship between income per capita and access to healthy food sources, for example. Several types of regression exist. Ordinary Least Squares regression (OLS) and Geographically Weighted Regression (GWR) are two main regression strategies, both found in the Spatial Statistics tool-set. In general, regression involves several variables (ArcGIS, 2012):
The regression tool essentially develops an equation to predict your dependent variable Y, based on the other factors presented above. The equation looks something like this:
Figure 1: Regression Equation (ArcGIS, 2012)
Ordinary Least Squares (OLS) Regression is a linear regression model. In OLS, the sum of the square of differences between observed and predicted values is minimized. It does not take into account any potential geographic variance in data, and performs more poorly when variables above an optimal number are used, which can make it less suitable for some GIS applications (xlstat.com, 2019).
Geographically Weighted Regression (GWR) is another linear regression model that fits a regression equation to features in a dataset. It differs from the OLS method in that only dependent and explanatory variables of feathers within the given bandwith of each target feature are incorporated, resulting in geographically specific output. The user defines the shape and size of the bandwidth by selecting Kernel Type, Bandwidth Method, Distance, and Number of Features (esri.com. 2019).
In the tutorial, we are presented with a scenario in which Portland is spending too many public resources on responding to 911 emergency calls. A growing population is pressuring the municipality to improve access to police and fire stations and medical services in strategic areas in order to reduce call volumes, thus reducing costs while improving livability for residents.
A hot spot analysis can help answer the question of where call volumes are highest, which is the first step to determining the factors contributing to high volumes of 911 calls.
Figure 2: Hot Spot Analysis of Portland, showing data from over 2000 911 emergency calls.
Next, we run the OLS tool to determine if there may be a correlation between the number of 911 calls and population. Figure 3 shows the aggregate number of calls per census tract, which is useful as we also have access to population data by census tract, which allows us to draw meaningful conclusions from our analysis results. Figure 4 demonstrates the results of the OLS Regression Analysis with calls/census tract as the Dependent Variable, and population/census tract as the Explanatory Variable. Red areas represent areas where the model under-predicted (the number of calls is higher than the model predicted), while the blue areas represent over-predictions, or areas where actual call volumes are lower than predicted (ArcGIS, 2012).
Visual analysis of the results of an OLS Regression Analysis can give an initial indication if our hypothesis was correct. Since red and blue areas represent outliers (under and over predictions, respectively), if these ares are clustered together spatially, this can indicate that there is a pattern in our data that our initial OLS analysis did not pick up on. Additional explanatory variables are needed to account for these patterns. Figure 4 shows definite clustering of both red and blue census tracts, suggesting that we have not yet accurately predicted the factors contributing to our dependent variable.
The numeric data produced when running the OLS tool can also tell us if our model is predicting the patterns in the data satisfactorily. The R-Squared value (or Pearsons Correlation Coefficient) indicates the correlation between our dependent and independent variables, with an R-Squared of 1 being perfect correlation, and 0 being no correlation at all. In this case, the OLS Tool produces an Adjusted R-Squared value of 0.393, which tells us that 39.3% of call volume (the dependent variable) can be explained by population density (chosen explanatory variable). Figure 5 shows the data output from the OLS Tool.
Figure 3: Call Volume by Census Tract, Portland, OR
Figure 4: Results of OLS Regression Analysis - Calls vs. Population
Figure 5: OLS Results Summary
Since our initial analysis revealed that the population variable does not explain our call volume sufficiently, we must look at other potential explanatory variables. The Scatter Plot Matrix Graph can give us a clue to which variables might be influencing our dependent variable. Figure 5 shows a series of Scatter Plot Matrices comparing various combinations of explanatory and dependent variables. If a correlation exists, a pattern should be visible in the data distribution. "Candidate models" can be developed and run, based on the scatter plot matrices, and levels of correlation assessed, until a more suitable OLS is developed. In this case, Low Education, Distance to Urban Centres, Population, and Jobs were found to sufficiently explain the dependent variable.
Figure 5: Scatter Plot Matrix showing correlations between various factors that may influence call volume.
Figure 6 shows the new OLS map. Notice that clustering is not apparent, suggesting that our outliers are not related by some factor that we have not considered. We can support this visual analysis using the Spatial Autocorrelation Tool. Figure 7 shows the results of the Atutocorrelation report, indicating that our clustering is not significant, as the residuals have a normal correlation.
Figure 6: Results of refined OLS Regression Analysis
Figure 7: Spatial Autocorrelation Results
Figure 8 shows the OLS data. According to the ArcGIS tutorial (2012), there are 6 significant data points that will help us to determine if our model is well specified. We must check that:
1. Coefficients: a positive coefficient indicates a positive relationship between the explanatory and independent variables, a negative coefficient indicates an negative relationship. In the current project, population, jobs, and low education have positive relationships with 911 calls, meaning that, as population, jobs, and the number of people with low education goes up, the number of 911 calls increases. These relationships seem reasonable. Since the coefficient for distance to urban centers is negative, this means that as the distance to urban centers increases, the number of 911 calls goes down, and vice versa.
2. Redundancy: redundancy among explanatory variables can be checked using the VIF (variance inflation factor) value. It should be smaller than 7.5 for a well devleoped model. A VIF larger than 7.5 suggests that one or more variables are explaining the same trend, which can contribute to over-count bias. To adjust, variables with large VIFs can be removed, one by one, until the VIF values are all below the threshold. In our case, population and low education have the highest VIF values, at 1.733935 and 1.727065 respectively. These are not over the 7.5 threshold, so no variables will be removed.
3. Statistically significant coefficients: by observing the Probability and Robust Probability columns, we can gauge the statistical significance of our variables. An asterisk (*) beside a probability value indicates significance. Small probabilities indicate higher significance. In the example below, all of the variables are significant, but Low Education and Distance to Urban Centres have the highest significance. Note that, when the Koenker (BP) Statistic [f] is statistically significant (as it is in this case), only the Robust Probability column can be used as a guarantee of significance. A significant Koenker test indicates that the relationship between some or all explanatory variables and the dependent variable vary geographically, or that, while one variable may be a good predictor in some locations, it is weak in others. *In cases like this, model results can be improved by performing a GWR, in place of an OLS*
4. Normally distributed residuals: the Jarque-Bera test indicates whether the residuals have a normal distribution. Normal distribution reflects a random spatial pattern - so no clustering of residuals. If the Jarque-Bera tests is NOT significant, the residuals are random, and the model is sound. However, if the Jarque-Bera test IS significant, that suggests bias in the model, and calls for the addition of one or more key explanatory variables. In the current project, there is no asterisk on the results for the Jarque-Bera Statistic, indicating that it is not significant, and that our residuals are normally distributed.
5. Strong R-Squared value: the adjusted R-Squared value measures the performance of a model, with values near 0 being poor, and values near 1.0 being the best. After the addition of our three additional explanatory variables, our R-Squared is now 0.831, indicating that our equation explains 83.1% of the dependent variable - an improvement over the R-Squared value of 0.393 that was obtained when population was the only explanatory variable. The Akaike's Information Criterion (AIC) value can also be used to gauge model performance, with a lower value being desirable. When only population was used to explain the call volume, the AIC value was 788.762573. In the current model iteration, the value is 683.470629, an improvement on our first version.
6. Spatial auto-correlation of residuals: as mentioned previously, the residuals should not be spatially auto-correlated. The results of the Spatial Auto-correlation test, shown in Figure 7, indicate that our residuals are not clustered, and the given model can be said to pass this test.
Figure 8: Output data from refined OLS Regression Analysis
While our model passes each of the assessments presented above and is a fair explanation of the 911 call volume variable, the statistically significant Koenker (BP) result suggests that the model could be made more robust by transiting to a GWR model. This next section details the transition from OLS to GWR.
2.3.1 Setting up the GWR Model
We can use the same explanatory variables to create our new GWR model. Using the Geographically Weighted Regression tool, we input the shapefile containing the number of 911 calls per census tract, set Calls as our dependent variable, Population, Jobs, Low Education, and Distance to Urban Centers as our explanatory variables. The Kernel type is set to ADAPTIVE, and the Bandwidth method is set to AICs, which allows the tool to find the optimal number of nieghbors for minimizing bias and maximizing model fit. The resulting map is shown in Figure 9, and the data results are presented in Figure 10.
2.3.2 Result Analysis
Observing the data results, we can see that the tool chose 46 neighbors as optimal. Our AIC value is now 674.478216, a difference of 24 from the value provided with the OLS model. Note that a difference greater than 3 indicates an improvement in model performance (ArcGIS, 2012). Importantly, our Adjusted R-Squared value is now 0.87052, a significant improvement over the 0.83108 given by the OLS model. To check the residual distribution, we can perform the Spatial Auto-correlation analysis, as we did with the OLS model. Upon first glance, the residuals are relatively evenly distributed (ie. not clustered). The Spatial Auto-correlation results confirm this, as seen in Figure 11.
Figure 9: GWR Output Map
Figure 10: Output Table from GWR tool application
Figure 11: Spatial Auto-correlation Results for GWR Output
The relative strength of an explanatory variable as a predictor of the dependent variable can be mapped for further geographic analysis. Figures 12-15 show the relative prediction strength of each coefficient, where red indicates strong predictive ability, and blue indicates a week relationship between the explanatory variable and 911 call volume. This type of analysis allows for the tailoring of solutions to regions where they will have the greatest impact, saving resources. For example, the population explanatory variable has the greatest correlation in the top north of the state. It could be that simply adding more services in this area would help to address the issue there. Likewise, a program to encourage students to stay in school may have the most impact on 911 call volume in the red district in Figure 14. This is an important feature of the GWR - it allows for targeting of policies to areas where they will do the most good.
Figure 12: Population coefficient analysis
Figure 13: Job coefficient analysis
Figure 14: Low education coefficient analysis
Figure 15: Distance to urban centers coefficient analysis
2.3.3. GWR Predictions
Predictions can be used when values are available for explanatory variables, but not for dependent variables. In the current example, GWR is used to predict future call volumes. Figure 16 shows the model developed for this purpose. The output map is shown in Figure 17. This final map provides useful data for anticipating future demand for 911 services, and can be used as a base to analyse the effectiveness of present day policies designed to address the 911 Call Volume issue.
Figure 16: GWR Prediction Model
Figure 17: GWR Prediction map of future 911 calls.
References:
ArcGIS. (2012). Tutorial: Regression Analysis in ArcGIS. Retrieved from: http://www.arcgis.com/home/item.html?id=71a65d35688a4502b123cbdfc99afdee. Accessed on: 30 January 2019.Done for Advanced GIS for Natural Resource Management, in the McGill University Department of Natural Resource Sciences, Professor Jeffrey Cardille