Using Machine Learning to Predict Urban Development

Where are the next real estate developments likely to happen?

 In this project, I utilized Machine Learning algorithms to predict the future development potentials across the City of San Francisco, CA.  The model takes into consideration a wide range of parcel information and geospatial features from the City's publicly available development pipeline data in the past 10 years.

Representative features about parcels were selected, including parcel size, year of structure built, assessed values, zoning regulations, building square footages, etc. These features were common factors to consider when a real estate developer investigates sites for potential development. These form the basis of the parcel data set that the Machine Learning model will further train on.

Then, a series of geospatial analyses were conducted to generate parcel-based accessibility values to different types of urban amenities (transit stops, parks, schools, retails, etc.). These geospatial features were then added to the main parcel dataset as additional geospatial features.

To automate the spatial analysis workflow, the python package OSMNx was utilized in several of the accessibility analyses performed. The following images show a couple of sample walk networks generated by OSMNx and the Pandana libraries.

In fall 2023, some of the walk accessibility analysis components were rerun using an updated pedestrian network for the City of San Francisco. Below are some neighborhood scale maps showing walk accessibility to the nearest parks, in Mission Bay, South of Market, and Potrero Hill, respectively.

After all the feature variables have been joined to the master parcel file, it's time for some data labeling! 

All parcels that have been included in the development pipeline over the past 10 years were marked as "Susceptible", and all the other parcels were marked as "Not Susceptible". A first look at this existing parcel data reveals some interesting patterns, for example, parcels located near BART stations are more likely to appear on the development pipeline than far away, yet parcels with mid-distance to parks and open spaces made the most frequent appearances on the development pipeline.

In a more recent revisit to this study, the properties included in the pipeline and also with a higher than 1.0 Improvement-to-Land Ratio are considered as "susceptible to development". With this new criteria, a smaller set of pipeline parcels are flagged as the positive label for the new machine learning exercise.

In addition to modified flagging threshold, a more rigorous variable selection process was undertaken as well. A Pearson Correlation analysis was conducted on the numerical variables of the joined parcel dataset, to identify those that are highly correlated, and potentially could be removed from further analysis due to overlapping effects. An updated Correlation Matrix is below, which shows that there aren't significant correlations after the removal of those overlapping variables.

To predict the parcels' future development susceptibility, the City's parcel dataset was randomly shuffled and split into three subsets: the training dataset,  the testing/validation dataset, and the prediction dataset. Three different Machine Learning algorithms were trained and compared, namely, Logistic Regression, Naive Bayes, and Random Forest. Among the three, the Random Forest model had the lowest classification error. Then a separate prediction was made using the prediction dataset with the Logistic Regression model, since we were interested in explicitly tuning the subset of features that would influence the development susceptibility significantly. The modeling was implemented using Orange, an open-source data mining and analytics tool.

Feature Importance Ranking and Prediction Results (mapped) from the Random Forest Model

Finally, the prediction results were visualized, using a combination of Mapbox Studio, Mapbox GL, and JavaScript codes, in an interactive map with customized toggles and additional layers of information for references. 

Explore the Map here!

P.S. the 2023 additional analysis selected a Random Forest Model as the new predictor for development potential. The screenshot below shows an overlay of 2022 Quarter 4 pipeline sites on top of the predicted development probabilities from this new model. It can be observed that the majority of the pipeline sites are located on top of a parcel with a predicted probability over 0.5, which is classified as "susceptible to new development" by the machine learning model.

 

Interested to learn more? Feel free to connect me at wenhaowu92@gmail.com!