Predicting DC Residential Housing Prices

Dominique Brown

GitHub Repository for your capstone project: Insert Link https://github.com/ddbilh/Project-606)

Introduction:

The District of Columbia's housing market is notorious for being one of the most expensive and competitive cities in the country to buy . An article from NPR says one of the main reasons is due to the price of land in the District (Schweitzer). Understanding the different factors that make a house more marketable can have a personal impact when looking to sell or buy in the area for potential fair asking price, but it also can have public interest in understanding how neighborhoods and quality of houses can be invested in without pricing the existing people out.

Literary Review:

The article, "Machine Learning Project: Predicting Boston Housing Prices with Regression" by Victor Roman, is about predicting Boston Housing Prices with the Decision Tree Regressor model. This study used 3 features (poverty of neighborhood, number of rooms, and pupil to student ratio of nearest school) to run a model to predict housing prices. They used a GridSearchCV to find the correct parameters for the model. A big takeaway from the author was the limited feature to predict the prices. From this project, I will increase the features in determining housing prices in DC.
The article, "Estimating Residential Property Values on the Basis of Clustering and Geostatistics" by Beata Calka, looks at clustering housing prices with K-means. The paper develops a 2 stage method to predicting housing prices in Poland. The first stage takes into account structural features to cluster the properties. This isolates "local property markets where properties have similar structural attributes." The second stage estimates the impact of the spatial factor (location) on property value by "performing an interpolation for each cluster separately using ordinary kriging (Gaussian process regression)." This results in property value maps, drawn up separately for each of the clusters. This article gives a lot of information on different ways to perform K-means clustering to predict prices.

What are the Gaps?

DC Residential Properties has had limited analysis done past EDA besides limited machine learning models constructed on the dataset. Regression has been done with limited features, similar to the first literary review feedback and that is somewhere this project can improve upon. This project can also compare models which has not been done limited on this data set. This project will also include K-mean clustering which has not been done on this dataset and can add more dimension the overall analysis. Broadly, predicting housing markets models has been performed on other markets but each market is unique and will not necessarily have the same important features. This project will also be able to produce a visual map on different features and prices of the housing market.

Data Set

The dataset I am using is called DC Residential Properties. It is aggregated from four different raw datasets. The first dataset is the Census Tracts 2010 data which includes selected geographic, cartographic and demographic data. The second dataset is Computer Assisted Mass Appraisal - Residential which includes attribution on housing characteristics for residential properties. The third dataset is the Computer Assisted Mass Appraisal- Condominium which includes attribution of housing characters for Condominium properties. The second and third datasets were created as part of the DC Geographic Information System for the DC office of Chief Technology Officer and participating DC government agencies. The last dataset is the address points that contain locations and attributes of Address points as of July 2018. This dataset is from the Master Address Repository for the DC Office of the Chief Technology Officer and DC Department of Consumer and Regulatory Affairs.

Dataset Characteristics:

49 columns, 158955 rows, 52.81 MB

Dataset Sources:

DC Residential Properties Dataset: https://www.kaggle.com/christophercorrea/dc-residential-properties

Raw data:

Raw Census Tracts 2010 data: https://opendata.dc.gov/datasets/census-tracts-in-2010/explore

Raw Residential data: https://opendata.dc.gov/datasets/computer-assisted-mass-appraisal-residential/explore

Raw Condominium data: https://opendata.dc.gov/datasets/computer-assisted-mass-appraisal-condominium/explore

Raw Address Points data: https://opendata.dc.gov/datasets/address-residential-units/explore

Hypothesis / Research Question(s)

Which model can predict DC residential Housing Prices the most accurately?
What features are important to predicting DC Residential Housing Prices?
Is it really all about Location? What are the most expensive locations?

Implementation (Model)

Part 1: Create predictive model for DC Residential Properties

Linear Regression
Decision Tree Model
Random Forest Regressor Model

Part 2: Location of Housing Prices

K-Means Clustering
Create map to show different levels of housing prices

Cleaning Data

Used info(), describe(), and isnull().sum() to get an understanding of the columns, data type, and missing data.
Decided to delete columns Unnamed:0','NUM_UNITS','AYB','YR_RMDL','SALEDATE','SALE_NUM','GBA','USECODE','GIS_LAST_MOD_DTTM','CMPLX_NUM','LIVING_GBA','NATIONALGRID','ASSESSMENT_SUBNBHD','CENSUS_TRACT', 'CENSUS_BLOCK', and 'QUALIFIED' because they were not relevant for predicting and/ or they had to much missing data to be relevant
I noticed a common theme that Condominium data was missing so I decided to focus on residential and get rid of condominium rows

4. I decided to dropna() the rest of the missing data because it was mainly missing price data which is the target of the model

5. I also changed square footage from object to float64.

6. Used duplicates() and found no duplicated data

7. Used box plot to show “Price” and then used code to normalize outliers

8. Added unemployment rate based on ward

Made the column based on the unemployment rate data for each ward
Using in replacement for wards in the model
January 2019 unemployment data which lines up with current data

9. Add Dummy variables for Ward, Quadrant, AC, and Neighborhood

Housing Prices boxplot with outliers

Housing Prices normalized

Cleaning Data

Used info(), describe(), and isnull().sum() to get an understanding of the columns, data type, and missing data.
Decided to delete columns Unnamed:0','NUM_UNITS','AYB','YR_RMDL','SALEDATE','SALE_NUM','GBA','USECODE','GIS_LAST_MOD_DTTM','CMPLX_NUM','LIVING_GBA','NATIONALGRID','ASSESSMENT_SUBNBHD','CENSUS_TRACT', 'CENSUS_BLOCK', and 'QUALIFIED' because they were not relevant for predicting and/ or they had to much missing data to be relevant
I noticed a common theme that Condominium data was missing so I decided to focus on residential and get rid of condominium rows

4. I decided to dropna() the rest of the missing data because it was mainly missing price data which is the target of the model

5. I also changed square footage from object to float64.

6. Used duplicates() and found no duplicated data

7. Used box plot to show “Price” and then used code to normalize outliers

8. Added unemployment rate based on ward

Made the column based on the unemployment rate data for each ward
Using in replacement for wards in the model
January 2019 unemployment data which lines up with current data

9. Add Dummy variables for Ward, Quadrant, AC, and Neighborhood

EDA: Correlation Matrix

-Decided to drop Stories, BLD_NUM, and Kitchens because they had no significant correlation with price
-The highest positive correlation to price was # of bathrooms and # of fireplaces
- The highest negative correlation is unemployment rate

EDA: Median EYB for each Ward

EYB= Estimated Year Built which is based on the actual year built and the last year it was remodeled.
Ward 2 and 3 have the most updated houses
Ward 4, 5, and 7 are tied for oldest EYB

EDA: Median price of residential houses in each quadrant

The NW is the highest median price. SE and SW are the lowest median price.

EDA: Median Price of Ward

The most expensive ward is Ward 2 and the least expensive wards are Ward 7 and 8.

EDA: Median Price of Ward

The highest median price neighborhoods are Kalorama and Massachusetts Avenue Heights. The lowest median price neighborhoods are Barry Farms, Congress Heights, Deanwood, and Fort Dupont Park.

EDA: Median Price of Ward

The highest median price neighborhoods are Kalorama and Massachusetts Avenue Heights. The lowest median price neighborhoods are Barry Farms, Congress Heights, Deanwood, and Fort Dupont Park.

Phase 2: Model Implementation

Part 1. Predicting Housing Prices

Linear regression Model

The model performance for testing set

--------------------------------------

RMSE is 231642.12164237443

R2 score is 0.6319817920057262

Decision Tree Model

RMSE: 283735.6764555141

The R2 score is 0.45020555638415716

Random Forest Regressor Model

The Root Mean Squared Error of Random Forest Regression is 220563.81627704547

The R2 value of Random Forest Regression is 0.6663410543109358

Feature importance below:

Part 2: Clustering prices with KMEANS

Use the Elbow method to determine the amount of clusters to use for KMEANS model
Chose 3 clusters

Price Cluster on DC map

Clustered the Prices in longitude, latitude map
Clusters into three distinct sections on the map with a few outliers

Non-clustered map of House Price ranges

has a similar structure to the cluster

Clustering of LANDAREA and PRICE features

3 clusters
shows that landarea and price has little correlation

Phase 3: Execution and Interpretation

Model Interpretation: Part 1

I used R Squared to determine the accuracy of the models in predicting the DC housing prices. R Squared is a statistical measurement to see how close the data is fitted to the models predicted line. The closer to 1, the better the model is at predicting the data. I tried three models to determine which model would predict DC housing prices the best. The Linear Regression Model had an R2 of .6319. The Random Forest Regressor Model had an R2 of .6663. The Decision Tree model had an R2 of .4502. Based of the R Squared, the best model to predict DC housing prices is the Random Forest Regressor Model. From this model, the most important features were Longitude, Estimated Year Built (EYB), Bathrooms, Square Footage of House, Land area and Latitude.

Model Interpretation: Part 2

For part 2, I used the Kmeans clustering to cluster the housing prices in a longitude and latitude map. It showed how the housing prices clustered in relation to location. The prices had 3 clusters that fell on the map. The map showed that there were clear clusters showing NW, NE, and southern part of DC. Historically, this makes sense because NW is the most expensive and desirable places to live in DC and the southern part of DC is historically more disinvested and a river separates it form the rest of the rest of the city. NE is a middle ground in terms of prices and desirability. Many places in the NE are considered up and coming. I then used a mapped out he pricing not using clustering and with the price ranges being color-coded. The clustered map and the non-clustered map had a similar structure in terms of the higher prices in the NW and the lower prices in the southern quadrants.I also clustered the features price and land area to see if land area played a big factor since the previous research shared that DC housing prices were high because DC land was expensive. The graph of this cluster shows there was no strong correlation with Prices and landarea. This concludes that location of quadrant and more important the actual square footage of land. Not all land in DC is equal in terms of price.

Final Outcomes:

Hypothesis / Research Question(s) Outcomes

Which model can predict DC residential Housing Prices the most accurately?

The Random Forest Regressor Model predicts the DC residential Housing Prices the most accurately because it has the highest R Squared.

What features are important to predicting DC Residential Housing Prices?

From the Random Forest Regressor Model, the most important features were Longitude, Estimated Year Built (EYB), Bathrooms, Square Footage of House, Land area and Latitude. (See full chart above)

Is it really all about Location? What are the most expensive locations?

Yes, Location is very important to determining price which is showed in Longitude being the most important feature which shows west or east which important in terms of location. Latitude is also an important feature which determines north or south in DC. Whether you are on the east or west side of DC seems to determine more about price than north and south which is in line with the clustered and color-coded maps. The clustering of prices on a DC map also showed the disparity in housing prices depending on where the location was on the map. There are other features that are more important than latitude which shows its not just about location but how big your house is, when it was built and number of bathrooms. The most expensive locations are in NW quadrant of DC. The most expensive neighborhoods are Massachusetts Avenue Heights, Kalorama, Berkley, Spring Valley and Georgetown.

Reference:

https://www.npr.org/local/305/2020/03/03/811551102/luxury-amenities-aren-t-why-housing-is-so-expensive-in-the-d-c-area

https://towardsdatascience.com/machine-learning-project-predicting-boston-house-prices-with-regression-b4e47493633d

" Estimating Residential Property Values on the Basis of Clustering and Geostatistics" GeoSciences

https://www.analyticsvidhya.com/blog/2020/08/exploratory-data-analysiseda-from-scratch-in-python/

https://does.dc.gov/sites/default/files/dc/sites/does/page_content/attachments/Ward_2019_BM.pdf (January 2019 data)

Links (Optional)

Your Linkedin Profile: https://www.linkedin.com/in/dominiquebrown1234/

Your GitHub Repo: https://github.com/ddbilh/Project-606

Page updated

Report abuse