Google Colab
The house price data consist of 21613 rows of data with 21 different columns.
From the data exploration made, the column identiifications are found to be:
id : Unique ID for each home sold
date : Date of home sale
price : Price for each home sold
bedrooms : Number of bedrooms
bathrooms : Number of bathrooms
sqft_living : Square footage of the apartments interior living space
sqft_lot - Square footage of the land space
floors - Number of floors
waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not
view - An index from 0 to 4 of how good the view of the property was 0 = No view, 1 = Fair 2 = Average, 3 = Good, 4 = Excellent
condition - An index from 1 to 5 on the condition of the apartment,1 = Poor- Worn out, 2 = Fair- Badly worn, 3 = Average, 4 = Good, 5= Very Good
grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
sqft_above - The square footage of the interior housing space that is above ground level
sqft_basement - The square footage of the interior housing space that is below ground level
yr_built - The year the house was initially built
yr_renovated - The year of the house’s last renovation
zipcode - What zipcode area the house is in
lat - Lattitude
long - Longitude
sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors
- Understanding the Data
As no description was provided following the dataset, we began with establishing our general understanding about the dataset and identify the input and output variables. Upon exploration, we found that the data consist of 21613 rows and 21 columns.
2.0 Data Cleaning
At this stage, we identify the columns that could be the input variables and which columns are to be the output column. The dplyr package is install before running the cleaning process. The steps involved including of:
Selecting the columns
Filtering the data
Arranging the data
Removing out the redundant data
Converting integer data type into factor
Mutate
Grouping the data according to its categories.