Dissertation 

A Spatial Extension of the Random Forest Algorithm

Abstract

Although commonly used in spatial data, the random forest algorithm does not account for the dependent structure among spatial observations. By splitting the training and test data with a simple random sample, the algorithm introduces predictive bias towards more clustered regions.  Further, the algorithm treats observations independently and does not draw any connections between neighbors. In this dissertation, we investigated three spatial modifications of the random forest methodology. First, we implement a geographically stratified sampling approach to separate the training and test data in order to ensure that the algorithm learns from as much data as possible from the less populated regions. Secondly, we developed the Augmented Random Forest (A-RF) in which we utilize an augmented predictor set of both observation-specific (OS) and neighborhood-based (NB) predictors in order to acknowledge similarities among neighbors. Lastly, we explored a Distance Informed Random Forest (DI-RF) in which we use a distance informed split criterion for generating decision rules that balances minimizing the impurity of the outcome (as in the traditional random forest) and minimizing the distance among observations within nodes. We assess these methods in both simulated data and real data, including Philadelphia Police Department pedestrian investigation data and ChristianaCare Newark Campus neonatal intensive care unit data.

We found that the geographically stratified sampling approach successfully improves predictive performance of less observed regions with the tradeoff of worsening predictive performance in more clustered regions. The A-RF improves mean square error in data with continuous outcomes and sensitivity in data with imbalanced binary outcomes; it has little impact on data with balanced binary outcomes. The NB predictors utilized in the A-RF are consistently more important for prediction than the OS predictors in the simulated data although the OS predictors tend to be more important in the application data. The DI-RF worsens mean square error in data with continuous outcomes, inconsistently impacts sensitivity in data with imbalanced binary outcomes, and has little impact on data with balanced binary outcomes. Overall, this dissertation demonstrates the importance of acknowledging the characteristics of spatial data when applying the random forest algorithm and examining predictive performance.

Random Forest

The random forest algorithm splits data into a training and test set. It generates a set of independent decision trees from unique bootstrapped samples of the data and builds trees by creating sets of nodes defined by decision rules that are produced by evaluating a random subset of predictors to identify which predictor optimally splits the observations into child nodes. The set of decision trees can then be used to generate predictions on the test data to determine how well the algorithm performs for 'new' or 'unseen' data.