Predicting California House Prices: A Machine Learning Comparison

Accurately predicting house prices is more than a technical challenge—it's a real-world problem with big implications for lenders, buyers, investors and policy makers. For this project, I explored how five different machine learning models performed on this task, using both traditional housing data and enriched location-based features.

The goal? To build a model that not only fits well on historical data but could also scale to support smarter property decisions in the real world.

Project Overview

Using a dataset of residential properties across California, I designed and evaluated five machine learning regression models to predict median house value. The original dataset included key structural and demographic features such as:

Median income
Housing age
Rooms and population per household
Latitude and longitude

To strengthen the model’s spatial awareness and realism, I engineered additional features including:

Crime index per city (joined externally using geolocation)
Demographic enrichment (e.g. population density, income brackets)

The full pipeline—from data ingestion to evaluation—was built using Python, with pandas, scikit-learn, matplotlib and seaborn libraries.

Models Evaluated

I trained and tuned the following models:

Linear Regression – A benchmark to gauge base performance.
Random Forest Regressor – To handle non-linearities and interactions.
XGBoost Regressor – For optimised, gradient-boosted decision trees.
LightGBM Regressor – Lightweight gradient boosting, tuned for speed and scalability.
KNN Regressor – Simpler but spatially aware through proximity-based logic.

All models were evaluated using Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and R² score on a held-out test set (20%).

Model Performance Summary

The model comparison showed that:

Linear Regression explained 64.1% of the variance (R² = 0.641), offering a decent baseline but with wide prediction spread, indicating limited precision.
Random Forest improved significantly with an R² of 0.719 and a low cross-validated MSE standard deviation (0.06), suggesting strong and stable performance. Key predictors: Median Income (importance: 0.33) and Ocean Proximity (Inland) (0.31).
XGBoost initially scored an R² of 0.747, which increased to 0.757 after hyperparameter tuning. While accurate, its MSE variability was slightly higher than Random Forest's.
LightGBM performed best overall, achieving an R² of 0.764 post-tuning. However, like XGBoost, its model stability across folds (MSE SD: 0.08) was slightly behind Random Forest.
KNN Regressor lagged behind the ensemble models with an R² of 0.686, only marginally outperforming Linear Regression.

Conclusion

Ensemble and boosting methods clearly outperformed simpler models, with LightGBM and XGBoost delivering the most accurate predictions. Random Forest stood out for its consistent performance and valuable insights into feature importance. Overall, this project highlights how enriched location and demographic features—paired with the right model—can significantly boost the predictive power of real estate valuation tools.

Key Takeaways

Model robustness matters: Tree-based models (Random Forest, XGBoost) outperformed others, handling complex feature interactions without heavy preprocessing.

Speed vs performance: LightGBM was slightly faster during training but required more tuning to match XGBoost’s accuracy.

Baseline models still matter: Linear Regression helped validate improvements from more advanced techniques.

Location features are critical: Geo-coordinates and crime indices were among the most important predictors, reinforcing how place-based features drive price.

This project helped sharpen my understanding of how different models behave when applied to real estate datasets—and how small changes in data quality and feature design can drive major gains in prediction accuracy. It’s also reinforced the importance of reproducibility, clear pipelines, and thoughtful performance evaluation when deploying machine learning in real-world settings.

Read the full report on Medium

Page updated

Google Sites

Report abuse