Lab 4 - Regression Analysis

Lab Overview

Objective:

In Lab 4, we explore ArcGIS regression tools. Regression is often used as an exploratory analysis at the beginning of a project, to identify potential relationships between dependent and independent variables. In this case, we perform an analysis of 911 Call Volumes in Portland, OR, to determine factors leading to high call volumes and inform policies to reduce the volumes in the future. This is done using both Ordinary Least Squares Regression (OLS) and Geographically Weighted Regression (GWR) tools in ArcGIS. In addition, geographically specific predictions of future 911 calls is carried out using the GWR tool.

1.0 What is Regression Analysis?

Regression analysis is a method of exploring spatial relationships that allows for the prediction of outcomes based on improved understanding of the factors behind spatial patterns (ArcGIS, 2012). In other words, regression analysis can help us to understand relationships between different elements in a geographical space, like the relationship between income per capita and access to healthy food sources, for example. Several types of regression exist. Ordinary Least Squares regression (OLS) and Geographically Weighted Regression (GWR) are two main regression strategies, both found in the Spatial Statistics tool-set. In general, regression involves several variables (ArcGIS, 2012):

Dependent variable (Y): what you are trying to predict
Explanatory variables (X): variables believed to influence the dependent variable Y
Coefficients (β): values reflecting the strength of influence each explanatory variable has on the dependent variable (computed by the regression tool)
Residuals (ε): the portion of the dependent variable not explained by the model

The regression tool essentially develops an equation to predict your dependent variable Y, based on the other factors presented above. The equation looks something like this:

Figure 1: Regression Equation (ArcGIS, 2012)

Note: the sign (+/-) associated with each coefficient determines if the relationship of the explanatory variable with the dependent variable is positive or negative (ArcGIS, 2012).

1.1 Ordinary Least Squares Regression (OLS)

Ordinary Least Squares (OLS) Regression is a linear regression model. In OLS, the sum of the square of differences between observed and predicted values is minimized. It does not take into account any potential geographic variance in data, and performs more poorly when variables above an optimal number are used, which can make it less suitable for some GIS applications (xlstat.com, 2019).

1.2 Geographically Weighted Regression (GWR)

Geographically Weighted Regression (GWR) is another linear regression model that fits a regression equation to features in a dataset. It differs from the OLS method in that only dependent and explanatory variables of feathers within the given bandwith of each target feature are incorporated, resulting in geographically specific output. The user defines the shape and size of the bandwidth by selecting Kernel Type, Bandwidth Method, Distance, and Number of Features (esri.com. 2019).

2.0 Analyzing 911 Calls in Portland, Oregon

In the tutorial, we are presented with a scenario in which Portland is spending too many public resources on responding to 911 emergency calls. A growing population is pressuring the municipality to improve access to police and fire stations and medical services in strategic areas in order to reduce call volumes, thus reducing costs while improving livability for residents.

2.1 Hot Spot Analysis

A hot spot analysis can help answer the question of where call volumes are highest, which is the first step to determining the factors contributing to high volumes of 911 calls.

Figure 2: Hot Spot Analysis of Portland, showing data from over 2000 911 emergency calls.

Regions with relatively high call volumes are in red. This map can be used in conjunction with the locations of police/fire/medical service stations to assess which areas may require improved access to these services.

2.2 OLS Regression Analysis

Next, we run the OLS tool to determine if there may be a correlation between the number of 911 calls and population. Figure 3 shows the aggregate number of calls per census tract, which is useful as we also have access to population data by census tract, which allows us to draw meaningful conclusions from our analysis results. Figure 4 demonstrates the results of the OLS Regression Analysis with calls/census tract as the Dependent Variable, and population/census tract as the Explanatory Variable. Red areas represent areas where the model under-predicted (the number of calls is higher than the model predicted), while the blue areas represent over-predictions, or areas where actual call volumes are lower than predicted (ArcGIS, 2012).

2.2.1 Clustering

Visual analysis of the results of an OLS Regression Analysis can give an initial indication if our hypothesis was correct. Since red and blue areas represent outliers (under and over predictions, respectively), if these ares are clustered together spatially, this can indicate that there is a pattern in our data that our initial OLS analysis did not pick up on. Additional explanatory variables are needed to account for these patterns. Figure 4 shows definite clustering of both red and blue census tracts, suggesting that we have not yet accurately predicted the factors contributing to our dependent variable.

2.2.2 Numeric Data from the OLS Tool

The numeric data produced when running the OLS tool can also tell us if our model is predicting the patterns in the data satisfactorily. The R-Squared value (or Pearsons Correlation Coefficient) indicates the correlation between our dependent and independent variables, with an R-Squared of 1 being perfect correlation, and 0 being no correlation at all. In this case, the OLS Tool produces an Adjusted R-Squared value of 0.393, which tells us that 39.3% of call volume (the dependent variable) can be explained by population density (chosen explanatory variable). Figure 5 shows the data output from the OLS Tool.

Figure 3: Call Volume by Census Tract, Portland, OR

Figure 4: Results of OLS Regression Analysis - Calls vs. Population

Figure 5: OLS Results Summary

2.2.3 Identifying other Explanatory Variables

Since our initial analysis revealed that the population variable does not explain our call volume sufficiently, we must look at other potential explanatory variables. The Scatter Plot Matrix Graph can give us a clue to which variables might be influencing our dependent variable. Figure 5 shows a series of Scatter Plot Matrices comparing various combinations of explanatory and dependent variables. If a correlation exists, a pattern should be visible in the data distribution. "Candidate models" can be developed and run, based on the scatter plot matrices, and levels of correlation assessed, until a more suitable OLS is developed. In this case, Low Education, Distance to Urban Centres, Population, and Jobs were found to sufficiently explain the dependent variable.

Figure 5: Scatter Plot Matrix showing correlations between various factors that may influence call volume.

Figure 6 shows the new OLS map. Notice that clustering is not apparent, suggesting that our outliers are not related by some factor that we have not considered. We can support this visual analysis using the Spatial Autocorrelation Tool. Figure 7 shows the results of the Atutocorrelation report, indicating that our clustering is not significant, as the residuals have a normal correlation.

Figure 6: Results of refined OLS Regression Analysis

Population, Low Education, Jobs, and Distance from Urban Centers are used as the explanatory variables for call volume

Figure 7: Spatial Autocorrelation Results

2.2.4 Assessing the suitability of our model

Figure 8 shows the OLS data. According to the ArcGIS tutorial (2012), there are 6 significant data points that will help us to determine if our model is well specified. We must check that:

the coefficients have the expected sign
there is no redundancy among the explanatory variables
the coefficients are statistically significant
the residuals are normally distributed
the Adjusted R-Squared value is strong
the residuals are not spatially auto-correlated

1. Coefficients: a positive coefficient indicates a positive relationship between the explanatory and independent variables, a negative coefficient indicates an negative relationship. In the current project, population, jobs, and low education have positive relationships with 911 calls, meaning that, as population, jobs, and the number of people with low education goes up, the number of 911 calls increases. These relationships seem reasonable. Since the coefficient for distance to urban centers is negative, this means that as the distance to urban centers increases, the number of 911 calls goes down, and vice versa.

2. Redundancy: redundancy among explanatory variables can be checked using the VIF (variance inflation factor) value. It should be smaller than 7.5 for a well devleoped model. A VIF larger than 7.5 suggests that one or more variables are explaining the same trend, which can contribute to over-count bias. To adjust, variables with large VIFs can be removed, one by one, until the VIF values are all below the threshold. In our case, population and low education have the highest VIF values, at 1.733935 and 1.727065 respectively. These are not over the 7.5 threshold, so no variables will be removed.

3. Statistically significant coefficients: by observing the Probability and Robust Probability columns, we can gauge the statistical significance of our variables. An asterisk (*) beside a probability value indicates significance. Small probabilities indicate higher significance. In the example below, all of the variables are significant, but Low Education and Distance to Urban Centres have the highest significance. Note that, when the Koenker (BP) Statistic [f] is statistically significant (as it is in this case), only the Robust Probability column can be used as a guarantee of significance. A significant Koenker test indicates that the relationship between some or all explanatory variables and the dependent variable vary geographically, or that, while one variable may be a good predictor in some locations, it is weak in others. *In cases like this, model results can be improved by performing a GWR, in place of an OLS*

4. Normally distributed residuals: the Jarque-Bera test indicates whether the residuals have a normal distribution. Normal distribution reflects a random spatial pattern - so no clustering of residuals. If the Jarque-Bera tests is NOT significant, the residuals are random, and the model is sound. However, if the Jarque-Bera test IS significant, that suggests bias in the model, and calls for the addition of one or more key explanatory variables. In the current project, there is no asterisk on the results for the Jarque-Bera Statistic, indicating that it is not significant, and that our residuals are normally distributed.

5. Strong R-Squared value: the adjusted R-Squared value measures the performance of a model, with values near 0 being poor, and values near 1.0 being the best. After the addition of our three additional explanatory variables, our R-Squared is now 0.831, indicating that our equation explains 83.1% of the dependent variable - an improvement over the R-Squared value of 0.393 that was obtained when population was the only explanatory variable. The Akaike's Information Criterion (AIC) value can also be used to gauge model performance, with a lower value being desirable. When only population was used to explain the call volume, the AIC value was 788.762573. In the current model iteration, the value is 683.470629, an improvement on our first version.

6. Spatial auto-correlation of residuals: as mentioned previously, the residuals should not be spatially auto-correlated. The results of the Spatial Auto-correlation test, shown in Figure 7, indicate that our residuals are not clustered, and the given model can be said to pass this test.

Figure 8: Output data from refined OLS Regression Analysis

Note the adjusted R-Squared value of 0.831080, or 83.1%. This is sufficient to suggest that our OLS model accounts for most of the factors affecting call volume.

2.3 Geographically Weighted Regression (GWR)

While our model passes each of the assessments presented above and is a fair explanation of the 911 call volume variable, the statistically significant Koenker (BP) result suggests that the model could be made more robust by transiting to a GWR model. This next section details the transition from OLS to GWR.

2.3.1 Setting up the GWR Model

We can use the same explanatory variables to create our new GWR model. Using the Geographically Weighted Regression tool, we input the shapefile containing the number of 911 calls per census tract, set Calls as our dependent variable, Population, Jobs, Low Education, and Distance to Urban Centers as our explanatory variables. The Kernel type is set to ADAPTIVE, and the Bandwidth method is set to AICs, which allows the tool to find the optimal number of nieghbors for minimizing bias and maximizing model fit. The resulting map is shown in Figure 9, and the data results are presented in Figure 10.

2.3.2 Result Analysis

Observing the data results, we can see that the tool chose 46 neighbors as optimal. Our AIC value is now 674.478216, a difference of 24 from the value provided with the OLS model. Note that a difference greater than 3 indicates an improvement in model performance (ArcGIS, 2012). Importantly, our Adjusted R-Squared value is now 0.87052, a significant improvement over the 0.83108 given by the OLS model. To check the residual distribution, we can perform the Spatial Auto-correlation analysis, as we did with the OLS model. Upon first glance, the residuals are relatively evenly distributed (ie. not clustered). The Spatial Auto-correlation results confirm this, as seen in Figure 11.

Figure 9: GWR Output Map

Figure 10: Output Table from GWR tool application

Figure 11: Spatial Auto-correlation Results for GWR Output

The relative strength of an explanatory variable as a predictor of the dependent variable can be mapped for further geographic analysis. Figures 12-15 show the relative prediction strength of each coefficient, where red indicates strong predictive ability, and blue indicates a week relationship between the explanatory variable and 911 call volume. This type of analysis allows for the tailoring of solutions to regions where they will have the greatest impact, saving resources. For example, the population explanatory variable has the greatest correlation in the top north of the state. It could be that simply adding more services in this area would help to address the issue there. Likewise, a program to encourage students to stay in school may have the most impact on 911 call volume in the red district in Figure 14. This is an important feature of the GWR - it allows for targeting of policies to areas where they will do the most good.

Figure 12: Population coefficient analysis

Figure 13: Job coefficient analysis

Figure 14: Low education coefficient analysis

Figure 15: Distance to urban centers coefficient analysis

2.3.3. GWR Predictions

Predictions can be used when values are available for explanatory variables, but not for dependent variables. In the current example, GWR is used to predict future call volumes. Figure 16 shows the model developed for this purpose. The output map is shown in Figure 17. This final map provides useful data for anticipating future demand for 911 services, and can be used as a base to analyse the effectiveness of present day policies designed to address the 911 Call Volume issue.

Figure 16: GWR Prediction Model

Figure 17: GWR Prediction map of future 911 calls.

References:

ArcGIS. (2012). Tutorial: Regression Analysis in ArcGIS. Retrieved from: http://www.arcgis.com/home/item.html?id=71a65d35688a4502b123cbdfc99afdee. Accessed on: 30 January 2019.
Esri.com. (2019). Geographically Weighted Regression (GWR). Retrieved from: http://resources.esri.com/help/9.3/arcgisengine/java/gp_toolref/spatial_statistics_tools/how_gwr_regression_works.htm. Accessed on: 02 February 2019.
xlstat.com. (2019). Ordinary Least Squares Regression (OLS). Retrieved from: https://www.xlstat.com/en/solutions/features/ordinary-least-squares-regression-ols. Accessed on: 30 January 2019

Done for Advanced GIS for Natural Resource Management, in the McGill University Department of Natural Resource Sciences, Professor Jeffrey Cardille

Google Sites

Report abuse