Data
We collected data from various sources to get more information about rent, housing prices, and food-related businesses in the DC Metro area.
Rent Data
The rent data set, extracted with variables, zip code, date, median rent per square foot, and data source is fairly clean. The data was retrieved by querying with 737 zip codes, but only 68 returned historical information about the median rent per square foot. As a result, there are missing values in all other attributes for 12.5% of the rows. The number of historical values varies for each zip code, with some zip codes having 7 years’ worth of data points. As for the most important attribute, median rent per square foot, there are a lot of outliers. Of the 4,683 records excluding missing values, there are 278 outliers. These outliers belong to 7 zip codes for Washington and Arlington. It did not make sense to change these outliers as we were interested in the differences in rent in different areas, so we did not adjust them. For the data source attribute, this gives us information about where the data came from. This data was extracted using Quandl’s API, but originated from Zillow’s research database. This attribute also gives us information about the city associated with the zip code.
After cleaning and formatting the rent data, all missing values were removed leaving us with 87.5% of the original data or a total of 4,683 records from a total of 5,352. The outliers remain unchanged because they can provide insight on rent in different zip codes. A majority of the analysis was done using the most recent median rent per square foot data for each of the 68 zip codes. Additionally, we used an equal-width binning method to bin the median rent price per square foot into 4 bins. We believed this was the most intuitive strategy for this data we wanted to maintain a similar data distribution. With this new attribute, we can more easily interpret the results of a hypothesis testing like ANOVA.
Figure 1. This shows the distribution of median rent price per square foot for 68 zip codes after binning. (https://plot.ly/~ll950/41/rent-bin/)
Housing Data
We used Zillow’s API to get data about individual houses in Arlington County. Although Zillow does not offer a method to focus on a specific area, we found a pattern by exploring the “zpid”(Zillow Property ID) in a data set Zillow has on Kaggle.com. All the houses and apartments in Arlington County had a zpid in the interval [11976993,12586038]. We initially had approximately 5000 individual observations. We used the following attributes for our analysis - zipcode, latitude/longitude, type, bedroom number, bathroom number, total square footage, building completion year and Zillow’s value estimate. We removed around 1600 records which had missing values in any of these attributes. There were some outliers, such as a record of a house with 78 bathrooms. We manually checked Zillow’s website to verify the validity of these anomalous records. After the cleaning process, only 3573 rows remained.
We also used Walkscore.com API to collect data about how convenient it is to bike or walk from each of theses properties. The data we collected in here was quite clean. The new variables, walk_score and bike_score were added to the housing data set.
Yelp Data
To get data about food businesses in the DC Metro area, we used Yelp’s API. We queried for food-related business data for 68 zip codes in the area. We initially got approximately 10,500 records. We used the following attributes for our analysis - price level, rating, review count, zip code, categories, and latitude/longitude - we removed any records with missing values in any of these variables. Lower rent areas had a disproportionate number of records with missing values, so slightly more records were removed from low rent areas. This may be a limitation of Yelp’s data for one of our data science questions: is there a relationship between rent or housing prices and restaurant prices in a given zip code?
For the categories variable of each business Yelp lists adjectives such as bar, restaurant, coffee shop, etc. These labels were important for our analysis, as one of our goals was to find out how different types of food businesses are spread over the different 68 zip codes. The food business types are shown below, as well as a map of their distribution.
Common food-related business types in the DC Metro area.