This milestone extends the exploratory data analysis from Milestone 1 by developing machine learning models to predict Airbnb listing prices in New York City.
Two prediction models were created:
Linear Regression
Decision Tree Regression
These models use listing features such as location, room type, availability, and review activity to estimate the nightly price of an Airbnb listing.
Model performance was evaluated using Mean Squared Error (MSE) to compare prediction accuracy and determine which model produces better predictions.
The target variable selected for prediction is price, which represents the nightly cost of an Airbnb listing.
Price is a quantitative variable, making it appropriate for regression modeling.
Several independent variables were used to predict price, including:
Latitude
Longitude
Minimum Nights
Number of Reviews
Availability (365 days)
Room Type
Neighbourhood Group
Before building the models, categorical variables such as room_type and neighbourhood_group were converted into dummy variables so they could be used in regression models.
Next, the dataset was split into:
Training data (80%)
Testing data (20%)
This allows the model to learn patterns from the training data and then be evaluated using unseen testing data.
An initial linear regression model was created using multiple independent variables, including:
Latitude
Longitude
Number of Reviews
Reviews per Month
Availability
Host Listing Count
Room Type
Neighbourhood Group
The initial model produced:
R² = 0.098
This indicates that the model explained only about 9.8% of the variation in price, meaning prediction accuracy was very low.
This poor performance occurred because the price distribution was highly skewed, with some listings costing up to $10,000.
These extreme values distorted the regression model.
Initial linear regression model showing poor performance due to skewed price distribution.
After fitting the first linear regression model, the statistical summary showed that some variables were not significant predictors of price. In particular, the variable neighbourhood_group_Queens had a high p-value (p = 0.578), which indicates that it was not statistically significant in predicting Airbnb prices.
To improve the model, a second linear regression model was created by removing the neighbourhood_group_Queens variable. Removing variables with high p-values can simplify the model and reduce unnecessary complexity without reducing predictive performance.
After removing this variable, the model was re-fitted using the remaining predictors:
The results showed that the overall model performance remained nearly the same, with an R² value of approximately 0.098, indicating that the removed variable did not meaningfully contribute to prediction accuracy.
This step helped refine the model by keeping only meaningful predictors.
To improve model performance, extreme price outliers were removed.
Listings with:
price > 1000
were removed from the dataset.
These listings represented luxury properties that were very rare and not representative of typical Airbnb listings.
After removing these outliers, the linear regression model was rebuilt using the same independent variables.
After removing extreme price values, model performance improved significantly.
The updated model produced:
R² = 0.313
This means the model now explains about 31.3% of the variation in price, which is a major improvement compared to the original model.
Linear regression model after removing price outliers showing improved performance.
Model accuracy was evaluated using Mean Squared Error (MSE).
Results:
Training MSE: 9596
Testing MSE: 8433
Since the training and testing errors are similar, this indicates that the model is not overfitting and performs consistently on new data.
Scatter plot comparing actual and predicted prices using linear regression. The model captures general trends but shows variation in predictions.
Overall, removing extreme price outliers significantly improved model performance. However, the scatter plot shows that predictions still contain noticeable error. This suggests that linear regression captures general trends but may not fully represent complex relationships between variables and price.
A second prediction model was created using Decision Tree Regression.
Decision trees are useful because they can model nonlinear relationships, which linear regression may not capture effectively.
The same independent variables used in the linear regression model were also used for the decision tree model.
Different tree depths were tested to determine the best model complexity.
The following depths were evaluated:
max_depth = 3
max_depth = 5
max_depth = 7
max_depth = 9
Model performance was evaluated using Mean Squared Error (MSE).
Results:
Depth Training MSE Testing MSE
3 9875 8663
5 9120 8059
7 8246 7760
9 7345 7910
As the tree depth increased, training error decreased. However, when the depth increased to 9, testing error increased. This indicates overfitting, where the model learns the training data too closely and performs worse on new data.
Therefore:
max_depth = 7
was selected as the best decision tree model.
The decision tree model captured nonlinear relationships between variables and price more effectively than linear regression.
However, increasing the tree depth too much resulted in overfitting. The best performance occurred at:
max_depth = 7
Decision tree showing how features are split to predict Airbnb prices.
Both models were evaluated to determine which produced better predictions.
Results:
Model Testing MSE
Linear Regression 8433
Decision Tree 7760
The decision tree model produced lower testing error, meaning it predicted prices more accurately than the linear regression model.
Based on model performance results, the:
Decision Tree Regression Model (max_depth = 7)
is recommended as the final model for predicting Airbnb listing prices.
This model produced the lowest prediction error while avoiding overfitting.