Using Machine Learning to Predict Urban Development

Where are the next real estate developments likely to happen?

In this project, I utilized Machine Learning algorithms to predict the future development potentials across the City of San Francisco, CA. The model takes into consideration a wide range of parcel information and geospatial features from the City's publicly available development pipeline data in the past 10 years.

Feature Engineering - the Independent Variables

Representative features about parcels were selected, including parcel size, year of structure built, assessed values, zoning regulations, building square footages, etc. These features were common factors to consider when a real estate developer investigates sites for potential development. These form the basis of the parcel data set that the Machine Learning model will further train on.

Then, a series of geospatial analyses were conducted to generate parcel-based accessibility values to different types of urban amenities (transit stops, parks, schools, retails, etc.). These geospatial features were then added to the main parcel dataset as additional geospatial features.

To automate the spatial analysis workflow, the python package OSMNx was utilized in several of the accessibility analyses performed. The following images show a couple of sample walk networks generated by OSMNx and the Pandana libraries.

In fall 2023, some of the walk accessibility analysis components were rerun using an updated pedestrian network for the City of San Francisco. Below are some neighborhood scale maps showing walk accessibility to the nearest parks, in Mission Bay, South of Market, and Potrero Hill, respectively.

Development Susceptibility - the Dependent/Target Variable

After all the feature variables have been joined to the master parcel file, it's time for some data labeling!

All parcels that have been included in the development pipeline over the past 10 years were marked as "Susceptible", and all the other parcels were marked as "Not Susceptible". A first look at this existing parcel data reveals some interesting patterns, for example, parcels located near BART stations are more likely to appear on the development pipeline than far away, yet parcels with mid-distance to parks and open spaces made the most frequent appearances on the development pipeline.

Exploratory Data Analysis

In a more recent revisit to this study, the properties included in the pipeline and also with a higher than 1.0 Improvement-to-Land Ratio are considered as "susceptible to development". With this new criteria, a smaller set of pipeline parcels are flagged as the positive label for the new machine learning exercise.

In addition to modified flagging threshold, a more rigorous variable selection process was undertaken as well. A Pearson Correlation analysis was conducted on the numerical variables of the joined parcel dataset, to identify those that are highly correlated, and potentially could be removed from further analysis due to overlapping effects. An updated Correlation Matrix is below, which shows that there aren't significant correlations after the removal of those overlapping variables.

ML Models and Selection

To create the Machine Learning models, the City's parcel dataset was randomly shuffled and split into two subsets: the training dataset (80%) and the testing/validation dataset (20%). Three different Machine Learning algorithms were trained and compared, namely, Logistic Regression, Naive Bayes, and Random Forest. The logistic regression was implemented with LASSO regularization with cost set to unity, and the random forest was implemented with 5 trees and a number of attributes at each split equal to 5.

The Confusion matrix calculated in terms of model prediction relative to actual classes in the validation set is a measure for studying the performance of different models. With respect to the Logistic Regression model, parcels with a greater than 0.5 predicted probability of redevelopment are labeled as "susceptible" or '1', and the remaining were labeled as "not susceptible" or '0'. For parcels that were actually designated as '0' in the validation set, the logistic regression model was able to successfully identify 89.8% of them correctly as '0' and misclassified the remaining 10.2%. Similarly, for those designated as ‘1.0’, the model was able to successfully classify 93.9% and misclassified for 6.1% of the dataset. In other words, the false positive rate was 10.2%, and the false negative rate was 6.1%.

Comparing all three models, the Naive Bayes model has the lowest false positive rate, and the Logistic Regression model has the lowest false negative rate. From a real estate development perspective, an opportunistic real estate developer would hardly want to miss the opportunity of developing a good candidate site because it has been falsely predicted as "unsusceptible to development", or the false negative error. Therefore, we wanted to minimize the false negative rate and the Logistic Regression model was used for making predictions.

In addition, the Logistic Regression model also has two other important advantages: 1) being able to explain the various factors driving the predicted development potential, and due to its parametric approach, we can also 2) explicitly fine-tune the parameters to paint "what-if" scenarios. These are some of the most helpful properties given by a Logistic Regression Model. The following diagram shows the data and modeling process.

Feature Importance Ranking and Prediction Results (mapped) from the Logistic Regression Model

Mapbox Interactive Visualization of Model Predictions

Finally, the prediction results were visualized, using a combination of Mapbox Studio, Mapbox GL, and JavaScript codes, in an interactive map with customized toggles and additional layers of information for references.

Explore the Map here!

P.S. the 2023 additional analysis selected a Random Forest Model as the new predictor for development potential. The screenshot below shows an overlay of 2022 Quarter 4 pipeline sites on top of the predicted development probabilities from this new model. It can be observed that the majority of the pipeline sites are located on top of a parcel with a predicted probability over 0.5, which is classified as "susceptible to new development" by the machine learning model.

Lastly, part of the result of this research project was published as an AI case study in the The Routledge Companion to Artificial Intelligence in Architecture.

Interested to learn more? Feel free to connect me at wenhaowu92@gmail.com!

Next project

Page updated

Google Sites

Report abuse