In this dataset, each row describes a Boston town or suburb. There are 506 rows and 13 attributes (features) with a target column (price). The Boston house-price dataset of Harrison, D. and Rubinfeld D.L., based on their highly cited paper, 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. For pdf copy of original paper please follow this link. The analysis here borrows heavily form this kaggle python notebook. The dataset includes the following variables:
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per 10,000 dollars
PTRATIO pupil-teacher ratio by town
B = "1000(Bk - 0.63)^2" where Bk is the proportion of blacks by town
LSTAT percentage lower status of the population
PRICE Median value of owner-occupied homes in $1000's
In the Google Colabs below we use just 2 modelling techniques: (1) OLS and (2) Random Forest. The OLS technique and R square and error estimation are explained here. Random Forest is explained simply here.
Random Forests are an ensemble learning method for classification and regression that construct a multitude of decision trees by random sampling and resampling using the random subspace method. Random forests often prove themselves to be both reliable and effective, and are now part of any modern predictive modeler’s toolkit. Random forests thanks to its 'wisdom of crowds' can frequently outperform OLS and Logistic regression. Of course, better performance for out of sample data (test data) is the key metric. There are situations however where a linear regression can also do better than a random forest.
Random forest works for both categorical and numerical input variables. This may obviate somewhat the need to spend time hot encoding or labeling data. It may be less sensitive to missing data, and it can handle outliers to a certain extent. Overall, it is likely to save time that might in other settings be used for cleaning data and pre-modelling. (Non-trivial in any data science pipeline). You might not have to scale your data, or apply monotonic transformations (log etc). You might not have to fiddle around by removing outliers or agonize on what outliers to leave in.
You can incorporate categorical features without much fuss, and automatic partitioning of the data are easily tolerated. A lot of the time, you will simply register higher accuracy with RF. It is not unusual to spend an afternoon re-engineering your dataset, and still get a mid 70s% r-squared with OLS. With typically less fuss using a “from sklearn.ensemble import RandomForestRegressor” you may score a neat low 80s% r-squared. Read the following Quora entry for more views.
In the Google Colab above we train and test the Boston House Price Dataset. Train/Test is a method to measure the accuracy of your model and verify establish robustness out of sample and not just in sample. The Python code employed;
# Splitting to training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 4)
applied a test train split. The Train/Test split divides the complete 506 rows of data into a random 70/30 split. The Boston House Price dataset divided into: a training set and a testing set. 70% for training, and 30% for testing. You train the model using the training set. You test the model using the testing set. Train the model means create the model which we initially do here using OLS. In the video below - replete with colab and excel spreadsheet we replicate the steps for splitting training and testing. Each step is reproduced in the spreadsheet and R squares are replicated and presented as measures of accuracy. The excel Link can be found here or here.
Pivot tables in Excel provide a practical means for summarizing a more extensive database or spreadsheet by introducing a more compact table. Pandas provides a similar functionality to excel in this regard by making use of the pandas function pivot_table . Below we will focus on explaining the pandas pivot_table function and how to use it for the Boston House Price dataset.