By Matthew Martin
Homelessness is a growing problem for many different states and cities in the US. California struggles with this problem as it has one of the highest rates of homelessness per capita. California also has a "rate of homelessness higher than the national rate of 23 people experiencing homelessness per 10,000 (48 per 10,000 in CA)" (AHAR) . This project aims to understand what factors contribute to the growing rate of homelessness throughout the state of California. What are the biggest factors that lead to homelessness?
The study area for this study is the State of California. It is the most populous state in the US with a population of around 38.9 million as of 2023. In order to understand homelessness counts, the state is broken up into polygons that represent Continuums of Care. A continuum of Care is " is a regional or local planning body that coordinates housing and services funding for homeless families and individuals" (LAHSA). California represents 44 Continuums of Care, each ensuring that everyone in the state is represented in each geographic area.
Demographics Data like population and estimated housing costs come from the ESRI Demographics feature layer from ArcGIS Living Atlas. This layer uses information from the US Census Bureau. This data was then spatially joined to the Continuums of Care polygons.
Map of California Continuums of Care
Image credit: Hub for Urban Initiatives, https://homelessstrategy.com/map-of-california-continuums-of-care-by-region/
Since this project focuses on the whole state of California, the projection used was the Albers Equal Area Conic projection California Coordinate System. This will preserve the area of the state and avoid distortion from other coordinate systems. It is also a great projection for analyzing homelessness across all the Continuums of Care in California.
The dependent variable for this study is the homelessness rate in California. The explanatory variables that attempt to describe the homelessness rate are the Housing Affordability Index, the median household income, total population, the percent of the population that is urban, probation counts, and percent of the population that is Black or African American.
Exploratory Regression Analysis evaluates all possible combination of explanatory variables to attempt to find a model that best explains the dependent variable. It aims to find the best set of explanatory variables by looking at how well they predict the dependent variable as well as whether the result aligns with Ordinary Least Squares Regression. It is measured by an adjusted R-squared value where a higher Adjusted R-squared value represents a better explained model by the explanatory variables. If an Adjusted R-Squared value is around 0.50, then that model explains 50% of the dependent variable.
Ordinary Least Squares (OLS) is a linear regression analysis function that attempts to generate predictions about a phenomenon (dependent variable) in terms of its relationships to a set of explanatory variables. It minimizes the sum of the squared differences between the observed values and the predicted values. It is used to find a linear relationship between a dependent variable and explanatory variables. The formula used to predict the dependent variable (Homelessness) is:
Where Y is the dependent Variable, X are the Explanatory Variables, ß represents coefficients that describe the weight of each explanatory variable, and ε represents the random error or the residuals.
Geographically Weighted Regression, or GWR, is an alternate Linear Regression analysis method that accounts for nonstationarity. In other words, it looks to see if the regression model varies over a distance. It will run the analysis model spatially while varying the relationships and create equations for each feature using nearby features. It is best used when the Ordinary Least Squares has a statistically significant nonstationarity, using the Koenker statistic to help determine. It can be used to answer questions like "Do certain illness or disease occurrences increase with proximity to water features?" (ESRI). It can also be used to make predictions about how the dependent variable may change over time.
The relationship between my dependent variable and explanatory variables are nonstationary. It would make sense that the rate of homelessness should change depending on what areas of the state we are looking at. Since continuums of care contains different geographic and demographic data, this would be different across parts of my study area.
For my Exploratory Regression analysis, I chose to use a total of 7 explanatory variables, Housing Affordability Index, the median household income, total population, the percent of the population that is urban, probation counts, and percent of the population that is Black or African American, and the poverty rate. I chose an R-Squared cutoff of 0.5 for the tool in order to find a model with a decent result. I had the tool search for combinations of up to 5 explanatory variables and a minimum of 1 variable in order to possible find the variable that explains homeless rate the best. For the Geographically Weighted Regression Analysis, I set it to search by distance band and the golden search in order to find the best values for the distance band parameter.
Above are the top 3 models of my Exploratory Regression Analysis. None of my models pass the minimum cutoff of R-Squared larger than 0.5. These models all used the Housing Affordability Index (HAI_CY), Median Household Income (MEDINC_CY), Median Home Value (MEDVAL_CY), and the count of population on probation (ALL_REC_TOTAL). The 2nd and 3rd model also uses the % of population that are Black or African American (F_Black) and the Percent of Population that is Urban (P002_CALC_PCT0002) respectively. The variables that are constant through each model show that as the Home Affordability Index Increases, then the Homeless Rate Increases. If the Median Household income decreases, the Homeless Rate Increases. As Median Home Value Increases, then the homeless rate increases. as the count of population on probation decreases, then the homeless rate increases.
My top model, which uses the explanatory variables Housing Affordability Index (HAI_CY), Median Household Income (MEDINC_CY), Median Home Value (MEDVAL_CY), and the count of population on probation (ALL_REC_TOTAL), shows a low adjusted R-Squared value of 0.15. this means that these variables account for explaining only 15% of the reason behind the homelessness rate. This weak result is consistent with the rest of the statistics of the model. For the Jarque-Bera Statistic, I receive 0, which indicates a misspecification in the model, or that I am missing a key variable. Since this model results in a large Koenker (K(BP)) statistic of 0.35, then it suggests that the relationships are not consistent across the study are. This means that this model has nonstationarity. Further, the VIF for this model is much greater than 7.5, with a value of 31.7. This suggests a high amount of multicellularity between my explanatory variables and that I should remove or replace them with others. Finally, with a Spatial Autocorrelation statistic being close to +1, the model shows a strong positive spatial autocorrelation, or that they cluster together. This means that there is a high amount of misspecification in my variables.
The results of the Geographically Weighted Regression Analysis show consistency with having another weak result for the model. Using the combination of variables from the top model for the Exploratory Regression analysis ran into multicollinearity issues with the GWR. For the GWR to produce a result, I used the Home Affordability Index, the total population on probation, the percent of population that is Black or African American and the percent in poverty (Value_Percent_). This combination of variable still produced a result with a very low Adjusted R-Squared result of 0.0628, suggesting that this model only explains 6% of the homeless rate.
This map Emphasizes the coefficient of the Housing Affordability Index explanatory variable. Red represents a lower coefficient and yellow represents a higher coefficient. Since this map shows a gradient from Southern to Northern California, and the coefficient differences are very small, I would assume that either the data does not change over distances, or that my model could be improved.
This map emphasizes the coefficient of the Poverty Rate explanatory variable. Similarly to the Housing Affordability Coefficient Map, the data shows another gradient from South to North. I would assume that since both of these models sow similar characteristics, I should go back and revisit my model.
Throughout the analysis, both analysis methods consistently that I am missing several key explanatory variables that would better explain the homelessness rate throughout California. So far, Home Affordability Index, Median Household Income, Median Home Value , and the count of population on probation show that they do account for some part of the factors that contribute to homelessness. However, since no combination of these variables create a passing model they create a very weak explanation to explain the homelessness rate. I need to either find better variables, or more variables. I did run into several points of multicollinearity with some of the variables chose. I feel that using all of the Median Home data types (Median Home Value, Home Affordability Index, Median Income) lead to using the same data, so maybe only using one of those three in conjunction to other explanatory variables could lead to better results. Both of the different types of analysis generally found errors in the data. the GWR analysis found several errors for the multicollinearity the the exploratory analysis did not find. The results from the exploratory analysis, with having very low R-Squared values is further emphasized in the GWR results where the coefficient maps give little explanation or meaning.
Since none of the models that were tested in the Exploratory Analysis, as well as the model from the Geographically weighted analysis resulting in weak models, I fell like I was not able to answer the question, "what are the biggest factors that lead to homelessness?" More tests need to be done to find better models that pass the analysis cutoffs.
Throughout this analysis, the limitation of having few explanatory variables to describe homelessness caused poor results and no passing models. I would conclude from these analyses that describing homelessness requires different variables and possibly many more. It also makes sense that the reason behind homelessness can be due to many different factors not taken into consideration in these analyses. This can include factors like Veteran Status, more details about incarcerated populations being released after serving sentences, or how the education of the population affects the homelessness rate. Further study and more diverse variables may create much stronger models to explain homelessness rate.
“The 2024 Annual Homelessness Assessment Report ( ...” The 2024 Annual Homelessness Assessment Report (AHAR) to Congress , www.huduser.gov/portal/sites/default/files/pdf/2024-AHAR-Part-1.pdf. Accessed 21 Aug. 2025.
“How Geographically Weighted Regression Works.” How Geographically Weighted Regression Works-ArcGIS Pro | Documentation, ESRI, pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-statistics/how-geographicallyweightedregression-works.htm. Accessed 21 Aug. 2025.
Kennedy, Marc. “Maps for California Continuums of Care.” Homeless and Housing Strategies for California, 13 Feb. 2020, homelessstrategy.com/maps-for-california-continuums-of-care/.
“Los Angeles Continuum of Care.” Los Angeles Continuum of Care, www.lahsa.org/coc/#:~:text=A%20Continuum%20of%20Care%20(CoC,for%20homeless%20families%20and%20individuals. Accessed 21 Aug. 2025.
“Ordinary Least Squares (OLS) (Spatial Statistics).” Ordinary Least Squares (OLS) (Spatial Statistics)-ArcGIS Pro | Documentation, ESRI, pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-statistics/ordinary-least-squares.htm. Accessed 21 Aug. 2025.
“State of California Department of Justice.” OpenJustice, openjustice.doj.ca.gov/data. Accessed 21 Aug. 2025.