Capstone Project
Done by: Tea Hong Liang
Start the project by scraping data, cleaning, modelling, and deploying the model
To identify the factors/drivers that affecting the HDB resale prices in Singapore.
To scrape and engineer additional features from online public datasets that might also influence resale prices.
The HDB resale price data was downloaded from Data.gov.sg, containing 197,012 resale transactions from January 2017 to December 2024.
The names of schools, supermarkets, hawkers, shopping malls, parks and MRTs were downloaded/scraped from Data.gov.sg and fed through a function that uses OneMap.sg api to get their coordinates (latitude and longitude) and postal code. These coordinates were then fed through other functions that use the geopy package to get the distance between locations. By doing this, the nearest distance of each amenity from each house can be computed, as well as the number of each amenity within the 2km radius of each flat.
From the Empathy Map, various infomation for location of flat with respective proximity to certain location will be essential for buyer decision. Therefore to incorporate the Location factor into the existing data sets and to see the impact of the information to the resales flat. Singapore location files was obtain from kaggle website('https://www.kaggle.com/'), which was perviously retrieved from one map Singapore website ('https://docs.onemap.sg/'). The address information which have the blk and street which can be utilised for the HDB dataset. The following information of data was not in tally with the HDB data sets and required to be modified. The HDB data set street_name is using abbreviaton, but SGZip data set is using actual which is actual word.eg AVE vs Avenue, ST vs Street, RD vs Road, BK vs Buikit, TG vs Tanjong,... The block and street_name are separate in both datasets and this needs to combine for postal identify which can be link to coordinates.
The highest average resale price is an Executive Flat located in Queenstown at S$1,060,092.31; The lowest mean resale price is a 1 Room Flat in Bukit Merah at S$208,002.05.
1 Room Flats are only transacted in the town of Bukit Merah.
Multi-Generation flats are sold in towns of Tampines, Yishun and Bishan.
Central Area has the highest average resale price for the 5 Room Flat among the 26 towns.
The town having the highest number of resale flats sold is Sengkang with the total number of 16,252 units sold.
The town having the lowest number of resale flats sold is Bukit Timah with the total number of 488 units sold.
The highest number of resale flat type units sold is the 4-room flat type with the total number of 83,258 units.
The lowest number of resale flats sold is 1 Room flat (76 units) and Multi-Generation flat (80 units).
From bar chart of number of resale flats sold per lease commence date (data is from past HDB resales transaction from Jan 2017 to Nov 2024)
Resale flats sold per lease commence date varies.
Top 2 highest number of resale flats sold based on their lease commence years in 1985 and 2015. The latter could possibly because a lot of flats reached their minimum occupation period.
Bukit Timah town has the highest average resale price while Yishun town has the lowest average resale price.
Central region (Bukit Timah, Bishan, Bukit Merah, Central area and Queenstown) generally has higher average resale price among other towns.
Bukit Timah has the highest median resale price while Ang Mo Kio has the lowest median resale price.
The blue dotted line represents the nation-wide average resale price.
Above the line means that the flats in that town is more expensive than the nation-wide average resale price.
Multi-generation flat type has the highest average resale price, while 1 room flat type has the lowest average resale price.
There is not much significant difference between the average and median distributions for all flat types.
From the chart as the storey range increases, the resale prices also increases.
For 1 Room Flat to 5 Room Flat Resale prices in 2018 and 2019 decreases as compared to year 2017. In 2020, the resale price increases again and for 4 and 5 Room the price overtakes the one in year 2017.
For Executive Flat Resale price fluctuates from 2017 to 2020. In year 2018, the resale price went up. It went down in 2019 and up again in 2020.
For Multi-Generation Flat Resale price initially increases from 2017 to 2019, but in 2020, the resale price goes below that of 2017.
There are similarities increasing trend of resale prices from year 2021 across all flat types.
The reason for the decline in the resale prices from 1 room to executive flats in 2019 could be due to lesser demand during Covid19 pandemic.
R squared score of 0.9029 indicates that there is a very strong positive linear relationship between resale price and floor area. The P value is 0.0 which means that there is statistically significance correlation between the resale price and floor area.
Based on the scatter plot, we can derive that larger area flats are sold at a higher resale price.
R squared score of 0.9882 indicates that there is a very strong positive linear relationship between resale price and storey range. The P value is 0.0 which means that there is statistically significance correlation between the resale price and storey range.
Based on the scatter plot, we can derive that higher floor flats are sold at a higher resale price.
From this chart, we can see that the relationship is negative. Those flats that are farther from Dhoby Ghaut MRT station (Central) are having lower resale prices.
Most amenities (e.g., hawker, parks, malls, MRT and supermarkets) show a negative correlation with distance, meaning properties closer to these amenities are sold at higher resale prices.
MRT stations have the strongest impact on resale prices, reflecting the importance of convenient public transport access.
Unlike other amenities, distance from schools shows a weak positive correlation, suggesting proximity to schools may not significantly to influence the resale prices.
Most amenities in 2KM radius show weak relationships with resale prices. Distance to hawker, parks, MRT stations, and malls appear to have a slightly positive effect on property prices, whereas the presence of schools and supermarkets may have slightly negative or negligible impact.
The results suggest that amenities play a role in property resale prices, but the influence is not very significant.
Having the R2 score of 0.3304 indicates that there is a moderate positive linear relationship for the two variables and it seems that the consumer price index (CPI) does not affect much on the resale prices.
Correlation between Factors and Resale Prices
Floor Area (0.59): A larger floor area is strongly positively correlated with higher resale prices.
Lease Commencement Year (0.37): Newer properties tend to have higher resale prices.
Number of MRT Stations within 2 km (0.14): Proximity to more MRT stations has a slight positive impact on resale prices.
Distance to Hawker Centers (-0.22): Properties closer to hawker centers tend to have higher resale prices.
Distance to Dhoby Ghaut (-0.22): Being closer to Dhoby Ghaut station is associated with higher resale prices.
Schools within 2 km (0.26): Newer properties are often located in areas with more schools.
Parks within 2 km (0.44): Newer properties are also located near more parks.
Distance to Hawker Centers (-0.66): Properties closer to hawker centers have more hawker centers within a 2 km radius.
Distance to MRT Stations (-0.52): Properties closer to MRT stations have more MRT stations within a 2 km radius.
Distance to Supermarkets (-0.53): Properties closer to supermarkets have more supermarkets within a 2 km radius.
The heatmap provides a visual representation of the strength and direction of relationships between various factors and resale prices. Positive correlations indicate that factors such as larger floor area, newer lease commencement year, and proximity to amenities enhance property value.
Negative correlations highlight the significance of convenience, where properties closer to key amenities tend to have higher resale prices.
This analysis helps real estate stakeholders understand which are the key drivers of property value and prioritize features that enhance desirability and market value.
The Geolocation map shows the MRT/LRT stations, schools, shopping malls and HDB resale flats sold.
Sengkang and Punggol towns are popular to buyers could be due to accessible by LRT and a lot of schools and shopping malls nearby.
From the Density Map, we can see that Sengkang & Punggol towns are very popular to buyers as the transaction density over those areas are very red.
R-squared (R²) score and Mean Absolute Error
Random Forest Regression has the best scores for both in R² and Mean Absolute Error.
1. Predict the price using the 3 types of Regressions – Linear, Random Forest and Decision Tree.
2. From the predication below, the closest price compared to the actual is from Random Forest regression with categorical features.
From this project, we managed to identify that the top 3 main drivers for the resale price of a HDB flat is the size of the flat, the storey of the flat and the location of the flat. Sizes and storey have been identified as main drivers as they have strong positive linear relationship with the resale prices which means that sizes or storey increase, the price will increase. As for location, we can see that flats that are nearer to Central Business District (CBD) areas are sold at higher resale prices. Buyers are willing to pay higher price for flats that are nearer amenities such as parks, hawker centers, MRT stations and malls, but if the amenities are out of the 2KM radius the influence is not very significant.
For the predication section, we found out that both Random Forest and Decision Tree regression with categorical features are predicting the prices more closest as compared to the rest of the regressions.
As this is my 1st Capstone project for Associate Data Analyst Course, one of the favorite visualizations that I like the most is the heatmap of the HDB resales transactions which indicate the number of transactions of the location. The more redness on the map indicates there is more transactions in that area. We can also zoom in and out of the map to see the transactions more clearly.
The challenge for this project is that it has only 7 days duration before submission and the first 2 days is the selection of the topic for the project which I was still stuck on which topic shall I do.
After decided the topic, I started off aimlessly as I felt that I am running out of time. After wasted sometime doing aimlessly, I decided to approach my trainer, Bishmer for advice. Managed to catch up on time and complete it within the schedule. Stress handling and time management is important.