This project has been completed for The Warren Group company; a Boston-based organization formed in 1872. It analyzes the property data from the past twenty-two years. The Warren Group, since 1872, have been collecting and storing real estate sales and ownership data. Over the years, the company has built a large variety of services in the real estate industry and is now recognized throughout New England for its sturdy property and transaction database.
The scope of this project is to develop an analytical tool that can estimate the future value of a property in Massachusetts. The value is defined as the dollar amount the property is estimated to be sold at.
The analysis is composed of the explanatory attributes of the four counties chosen: Middlesex, Worcester, Suffolk and Essex. It consists of property characteristics, amenities, ownership and most importantly property price. The two external datasets cover the interest rates of set time periods and macroeconomic financial indices.
Statistical analysis was conducted on the table to determine any skewedness, missing values and other relevant findings between the variables. Transformation on several variables were performed to normalize the data. Properties with a description of land were excluded as there were no physical buildings on the property to be sold. Variables with high missing percentage were rejected or replaced based on their respected sub-descriptions given by the company. Furthermore, the target variable, price was filtered to only observe property values greater than $25,000.
Different machine learning techniques were applied to predict property price in Middlesex, Worcester, Suffolk and Essex counties. In total, four models were built: Linear Regression, Decision Tree, Interactive Decision Tree and High-Performance (HP) Forest. The models were partitioned at 60/40; 60 percent of the data was assigned as the training set and the remaining 40 percent was used for the validation set. This was performed on SAS EMC.
Average Squared Error (ASE) criterion was used to assess the models. The lower the validation ASE, the better the model predicts property price. A Model Comparison node was introduced to compare these models. HP Forest model has the lowest ASE of 0.0338 making it the best model.
Based on the results from the four models, a list of the top ten predictors of property price is created. Below figure shows the battle tested variables. Even though the political or economic environment may change, these factors will continue to impact the value of a property significantly.
Authors: Derkach Viktoriya, Enkhbat Degi, Jin Greyson, Kansra Tunisha, Khanijo Ripun, Soloveva Anna, Virelaude Camille