Predictive
All of the following analysis depends on the assumption that the data we collected is correct.
We decided to use an Ordinary Least Squares Regression model to predict a property value because it works well for numerical prediction and is easy to to interpret. We used combined data set A, which has 5180 records and 29 attributes (Appendix A). This data set contains a sample of Arlington’s real estate properties found on Zillow. It contains basic information for each property and its surrounding traffic conditions, convenience, and local food-related businesses. We will now explore possible independent variables for this model.
Figure 5. This shows the differences in Zestimate, square footage, price for each square foot, number of bathrooms, number of bedrooms, and the bike score for different properties by longitude and latitude.
From Figure 5 above, properties in northern Arlington seem have a higher value than southern properties which is mostly likely due to northern properties being significantly larger in size than southern properties on average. This is supported by the third subgraph where the median rent price per square foot seems uniformly distributed in both parts. Properties with a greater number of bedrooms also seem to correlate with a greater number of bathrooms.
We also used a network analysis to create two new variables for each property. We used red nodes for food businesses and blue nodes for each property. We created a edge between every pair of red and blue nodes if their distance is less than a certain number of miles. We used the degree as a new variable to represent how many businesses are within 0.1 miles of a property. We created similar variable “degree2” by adjusting the to within 0.2 miles. And the network graphs are shown below:
Figure 6.1 This shows the location of different food-related businesses found on Yelp and their price levels for Arlington, VA. Figure 6.2. This shows the network of properties and food businesses within 0.1 miles of one another. Figure 6.3. This shows the network of properties and food businesses within 0.2 miles of one another.
We chose bedroom number, bathroom number, total finished square feet, year built, walk score, bike score, as well as mean price and mean rating of local food-related businesses along with the total number, variety of food places, and degree as key independent variables to predict Zestimate, Zillow’s estimation of the property value. The correlation between these variables is shown below.
Figure 7. This shows the correlation between variables bedrooms, bathrooms, square footage, year built, Zestimate, walk score, bike score, along with mean price and ratings of food-related businesses, number of food places with different price levels, number of food places with a rating of 3 or higher, total number and variety of food places, median price per square foot, latitude, longitude, degrees, and the types of properties.
From the correlation graph (Figure 7) above, we can see that Zestimate is highly positively correlated with the number of bedrooms and bathrooms and the square footage. Zestimate is also highly negatively correlated with type 1 and latitude which is also shown in the map graphs from earlier. There other unusual findings, for example, Zestimate is negatively correlated with the walk score and bike score, which is the opposite of what we would expect. However this kind of convenience may be less important to those who can afford more expensive homes and would be more likely to be able to be able to afford a car. The median rent price per square foot is not highly correlated with any of the other variables. It is likely that there is a bias due to omitting variables.
From the previous analysis, we suspect the accuracy of the data collected from Walkscore.com. By comparing two maps in Figure 8, there are some discrepancies. The appearance of red points and green points are not entirely consistent with the bike trail map from Google maps.
Figure 8. On the left, is a map of bike trails from Google.map and on the right, is a map of bike scores.
Next we built several regression models using different combinations of independent variables.
First, we compared the performances of using Zestimate and median rent price per square foot as the dependent variable in the regression model. We used all the other variables as independent variables. The adjusted R squared value for the regression line using median rent price per square foot is 0.274 and the adjusted R squared value for the regression line using Zestimate is 0.867. We then decided to improve the model for predicting Zestimate.
We tried several combinations to get the best regressor combination. Below is a table of results for the different models.
Table 1. This is a table of independent variables used in the different linear regression models and the adjusted R values.
Regression models 0 through 4 were built without the information found from the correlation graph. Regression model 5 was built using all the variable with a correlation of 0.50 or higher or -0.50 or less with Zestimate. Compared to the other 5 models, model 5 uses fewer regressors, only 5 variables, which can help avoid overfitting and correlated regressor bias. Model 5 also has similar performance compared to the others which may be overfitting and unlike the other models, all the regressors for model 5 is significant at 0.01 level.
Predictive
We will now discuss and interpret the results of linear regression model 5. In this model, an increase of one square foot results in an increase of $228 for the Zestimate. Increasing food-related businesses mean ratings by one results in an increase of $61,880 for the Zestimate. The Zestimate is $99,640 less for a condominium. Single family homes are worth $91,150 more. Although latitude is significant at 0.001 level and is highly correlated with Zestimate, its effect on Zestimate is minimal, only -3.727.
Lastly, using this model, we will look at whether our teammate is being overcharged for his home. The place is about 1200 square foot. It is a Condominium. The latitude is 38.9. The mean rating for food-related businesses in the area is 3.72.
So the Zestimate for this apartment unit is 1200*228+3.72*61880-99640+0-3.727*38.9 = $404,008.
Typically, rent is charged between 0.8% and 1.1% of the property’s value. So the rent should be between $3,232 and $4,444 per month. The actual rent is $3,174 per month. So based on this model, our teammate is not being overcharged for the apartment.
Limitations
First Zestimate is not the final price of a sale, it is just Zillow’s estimation of the property value.
Second, we used the same mean rating for every property within the same zip code. This could be improved if we calculated ratings for food-related businesses by distance from each property.
There are several major limitations with the Housing data. First, there were missing values in bedroom, bathroom, and lotSizeSqFt. Instead of using variable imputation, we just deleted any records with missing values. Second, Zillow does not offer a method to focus a area when web scraping. There was no way for us to collect all house data in the DC area from Zillow. The sample we collected, which only contains around 3500 records, may not have been big enough to represent all of Arlington’s properties. Our sample of housing data may not be random.