Google Colab

The house price data consist of 21613 rows of data with 21 different columns.

From the data exploration made, the column identiifications are found to be:

  1. id : Unique ID for each home sold

  2. date : Date of home sale

  3. price : Price for each home sold

  4. bedrooms : Number of bedrooms

  5. bathrooms : Number of bathrooms

  6. sqft_living : Square footage of the apartments interior living space

  7. sqft_lot - Square footage of the land space

  8. floors - Number of floors

  9. waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not

  10. view - An index from 0 to 4 of how good the view of the property was 0 = No view, 1 = Fair 2 = Average, 3 = Good, 4 = Excellent

  11. condition - An index from 1 to 5 on the condition of the apartment,1 = Poor- Worn out, 2 = Fair- Badly worn, 3 = Average, 4 = Good, 5= Very Good

  12. grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.

  13. sqft_above - The square footage of the interior housing space that is above ground level

  14. sqft_basement - The square footage of the interior housing space that is below ground level

  15. yr_built - The year the house was initially built

  16. yr_renovated - The year of the house’s last renovation

  17. zipcode - What zipcode area the house is in

  18. lat - Lattitude

  19. long - Longitude

  20. sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors

  21. sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors






  1. Understanding the Data

As no description was provided following the dataset, we began with establishing our general understanding about the dataset and identify the input and output variables. Upon exploration, we found that the data consist of 21613 rows and 21 columns.


2.0 Data Cleaning

At this stage, we identify the columns that could be the input variables and which columns are to be the output column. The dplyr package is install before running the cleaning process. The steps involved including of:

  1. Selecting the columns

  2. Filtering the data

  3. Arranging the data

  4. Removing out the redundant data

  5. Converting integer data type into factor

  6. Mutate

  7. Grouping the data according to its categories.