Cleaning

Outliers

Rent

For the rent data set, there is only one attribute that may contain outliers, median rent price per square foot. To find the outliers, we looked for points less than the 25th percentile minus 1.5 times the interquartile range and points greater than the 75th percentile plus 1.5 times the interquartile range. Of the 4,683 records excluding missing values, there are 278 outliers. The 25th percentile for this attribute is $1.28/sq. ft. and the 75th percentile is $1.92/sq. ft. Upon doing some more analysis, the outliers were much greater than the other prices, ranging from $2.87/sq. ft. to $3.92/sq. ft. These outliers belong to 7 out of 68 zip codes for Washington and Arlington. It does not make sense to change or remove these outliers as we are interested in the differences in rent in different areas, so we did not adjust them. The other attributes like zip code, date, city and state are unlikely to contain outliers.

Yelp

The Yelp dataset had some outliers, which were businesses with very low review counts, near 0 or 1. These were businesses that may have closed or moved, so their Yelp page had very limited activity. They also typically had NA’s in lots of columns, so we could remove of them simply by removing the rows with NA’s.

The only outlier in the positive direction was the Founding Farmers restaurant by the Georgetown Waterfront. This restaurant had more than 9,000 reviews, while the next highest restaurant had around 3,000 reviews. When making a histogram of the review count variable, this value was making it difficult, so we decided to remove it.

Housing

Since the dimensions are not very high and it is difficult to interpret the meaning of LOF, ABOD and top down binning, we define an apartment as an outlier if any one of its values is an outlier. So instead of using the outliers’ detectors mentioned above, we used the simplest function “checkOutliers”, which uses IQR to check for some extreme outliers.

After conducting outlier detection on each column, we found some anomalous records. There are outliers in bedroom, bathroom, zestimate and finishedSqFt.

One record shows a single-family apartment with 78 bedrooms and 22 bathrooms. Three records show a finishedSqft of no more than 5. There was also another record with an estimated value of 7168468 with 1472 finishedSqFt.

These outliers are not good for machine learning algorithms. But they may have some valuable meaning. Since we have the information about these outliers, we checked them through the Zillow website one by one to see if they are anomalous or if they are just a correct by extreme case. After checking online, we chose to delete the outliers manually which proved to be error values.

Missing Values

Rent

From the cleaning phase of Project 1, the rent data was collected by querying with 737 zip codes, but only 68 returned historical information about the median rent per square foot. As a result, there are missing values in all other attributes for 12.5% of the rows. Since we did get any additional information for these missing zip codes, we removed these records from the data set.

Yelp

We removed rows that had missing values for either category or location. We also dropped a record if it had missing values in all of the following columns ‘url, phone, name, street_address’’. The price variable initially took on values of $, $$, $$$, $$$$, and we converted these to a numeric scale of 1-4. Before using the hierarchical clustering algorithm, we removed records with missing values for price also.

Housing

After doing some numerical research, we found an interval for Zillow’s Zpid which only contains the apartment in Arlington. By using the GetUpdatedPropertyDetails API, we collect over 5000 observations for apartment details like bedroom number, yearbuilt, finished square and so on.

In data set, “Household.csv”, there are some missing values in bedroom, bathroom and lotSizeSqFt. Instead of using variable estimation or between variables estimation, we just deleted the any observations that contained missing values. We have a substantial number of observations and the percentage of the rows with missing values is not large, around 20%.

In dataset, “value.csv”, each row will either have a no missing values or all the variables will have missing values due to the failure of API request. So, for this data, we deleted all the rows with missing values too.

In the dataset “score.csv”, after deleting irrelevant and redundant variables, we kept walk score, walk description, transport score, transport description, bike score and bike description. There are no missing values for walk score, walk description, bike score and bike description. But all the rows are missing values for transport score and transport description, which is not normal. But this may be due to using a free API trial. So, we deleted these two variables.

Google Sites

Report abuse