This project is also done in statistical learning course at Queen's University. The project is taken from Kaggle and is written in R language.
Over the past several years, bike sharing become more and more popular around all over the world. People are able to rent a bike from one location and return it to a different location by using Bike sharing system. In this project, I would like to use R Programming Language to analyze the bike share data provided by Kaggle Competition and Captical Bikeshare. The goal for this project is to forecast bike rental demand hourly in the Capital Bikeshare program in Washington, D.C.
I will be using the following dataset in this project: https://www.kaggle.com/c/bike-sharing-demand/data: This including three datasets: training dataset, testing dataset and a sample submission dataset. I will give a brief explanation for each dataset.
Training dataset: This dataset contains data for training purpose. Furthermore, this dataset only contains first 19 days for each month, from 2011-2012.
Test dataset: This dataset is used for testing purpose. It only contains 20th day to the end of each month, from 2011-2012.
I will start with cleaning the training dataset and perform some preprocessing steps.
library(lubridate)
library(dplyr)
train_data <- read.csv('C:/Users/Simon/Documents/R/train_bikeshare.csv', header = TRUE) ## Gneral view of training data
test_data <- read.csv('C:/Users/Simon/Documents/R/test_bikeshare.csv', header = TRUE)
Notice that number of casual number plus registered number is equals to count. Therefore, I decide to drop these two columns. Also I change seasons, weather, and holiday/workingday to factor and add some columns based on datetime:
preprocessing <- function(old_data){
#drops two variables: 'casual','registered'
cf = c("datetime","season","holiday","workingday",
"weather","temp","atemp","humidity","windspeed")
new_data = old_data[, cf]
#add more variances, from datetime
new_data$hour <- hour(ymd_hms(old_data$datetime)) #hour is an integer
new_data$hourf <- as.factor(new_data$hour)
new_data$year <- as.factor(year(old_data$datetime))
new_data$month <- as.factor(month(old_data$datetime))
new_data$mday <- as.factor(mday(old_data$datetime))
new_data$wday <- ifelse(wday(old_data$datetime)==1, 7, wday(old_data$datetime)-1)
new_data$wday <- as.factor(new_data$wday)
new_data$season = as.factor(new_data$season)
#weather, make 4 to 3
new_data$weather[new_data$weather == 4] = 3
new_data$weather = as.factor(new_data$weather)
#make holiday,workingday factor
new_data$holiday = as.factor(new_data$holiday)
new_data$workingday = as.factor(new_data$workingday)
return(new_data)
}
train_new <- preprocessing(train_data)
train_new$count <- train_data$count
test_new <- preprocessing(test_data)
test_new <- select(test_new, -c(hourf))
In this section, I will perform some data visualizations and feature engineering on both datasets.
First, let’s see whether there are some missing values in both training and test dataset:
which(is.na(train_new))
## integer(0)
which(is.na(test_new))
## integer(0)
Note that there is no return from above command. Therefore, there is no missing value in both datasets.
One boxplot - detect outlier
One histogram, several line charts to help you better understanding the dataset and one additional correlation plot.
First notice that inconsistency of variance between hour and count, so first do log transformation on the response variable
Note that there are many 0s windspeed in this dataset. One proper explanation would be there are initially NA values, and then replace by 0. This will cause a problem when we predict the response. I will use a popular method to replace the 0 windspeed. This problem occurs in both training and testing dataset.
I will proceed by taking out those rows that not have 0 windspeed, and use Random Forest to build up a model for those rows and then predict the for the 0 windspeed rows.
I will proceed by leveraging Random Forest Algorithm and do some data cleaning/processing.
library(randomForest)
set.seed(3)
predictWind <- function(old_data){
cf = c("datetime", "season","holiday","workingday",
"weather","temp","atemp","humidity","windspeed", "hour", "year", "month", "mday", "wday")
new_data <- old_data[, cf]
new_data_win0 <- new_data[which(new_data["windspeed"] == 0), ]
new_data_winN0 <- new_data[which(new_data["windspeed"] != 0), ]
winModel <- randomForest(windspeed ~ season+weather+humidity+temp+month+year+atemp, data = new_data_winN0, importance = TRUE, ntree= 500)
rf.pred <- predict(winModel, newdata = new_data_win0)
new_data_win0["windspeed"] <- rf.pred
new_final <- rbind(new_data_win0, new_data_winN0)
new_final <- new_final[order(new_final$datetime), ]
return(new_final)
}
In this section, I will perform three models that suitable for this dataset. Before that, I used the batch function provided by Prof. Song to find the rows corresponds to the start and end of each month.
batches <- function(dataset){
batch_index <- matrix(nrow = 2, ncol = 24) # two years, total of 24 months
cnt <- 1
tt_index <- 1:dim(dataset)[1]
for (y in unique(dataset$year)){ # find rows for corresponding month
for (m in unique(dataset$month)) {
tmp = tt_index[dataset$year == y & dataset$month == m]
batch_index[1,cnt] <- min(tmp)
batch_index[2,cnt] <- max(tmp)
cnt = cnt + 1
}
}
return(batch_index)
}
train_bat <- batches(train_final_bin)
test_bat <- batches(test_final_bin)
Reference: https://bradleyboehmke.github.io/HOML/mars.html
I will start with linear model first and use this model as a baseline. I selected multivariate adaptive regression splines for the following reason:
1. By looking the dataset, there is no evidence to prove that covariates have a linear relation with the response.
2. This algorithm is a convenient approach to fit data that is non-linear. It can fit polynomial regression by placing knots in the entire dataset.
3. Unlike linear regression, multivariate adaptive regression spline does not require to standardize the data.
library(earth)
library(caret)
for (i in 1:dim(test_bat)[2]){
train_mars2 <- train_final_bin[train_bat[1,1]:train_bat[2,i], c(2:10,15,16)]
test_mars2 <- test_final_bin[test_bat[1,i]:test_bat[2,i], c(2:10,15)]
hyper_grid_mars <- expand.grid(# hyperparameter tuning using train function in caret
degree = c(1:6),
nprune = seq(2,20, length.out = 10)
)
tuned_mars <- train(
x = subset(train_mars2, select = -count),
y = train_mars2$count,
method = "earth",
metric = "RMSE",
trControl = trainControl(method = "cv", number = 10), # use 10-fold cv to train
tuneGrid = hyper_grid_mars
)
best_para <- tuned_mars$bestTune
best_nprune <- best_para[, "nprune"] # get the best parameter for nprune
best_degree <- best_para[, "degree"]# get the best parameter for degree
mars_fit <- earth(count ~ ., data = train_mars2, degree = best_degree, nprune = best_nprune, minspan = 1, endspan = 1)
pred_mars <- predict(mars_fit, test_mars2)
submission_mars[test_bat[1,i]:test_bat[2,i], "count"] <- exp(pred_mars) - 1
}
I tuned degree and nprune in this model. Degree represents to highest degree in the polynomial; nprune represents the max number of terms in the prune model.
This model will give me a prediction error of 0.54 in Kaggle. I will use this score as a baseline score for future algorithms.
Reference: https://topepo.github.io/caret/model-training-and-tuning.html
The reason why I selected random forest is because of the following reason:
1. This algorithm can maintain the accuracy of a large dataset and It will prevent the overfitting problem.
2. It can handle large dataset with high dimension. In this case, The dimension is 12. 3. The bagging algorithm will build multiple trees and combined them for a better one.
library(ranger)
library(dplyr)
registerDoParrallel(cores = 4)
for (i in 1:dim(test_bat)[2]){
month_train <- train_final_bin[train_bat[1,1]:train_bat[2,i], c(2:10,15,16)] # use all historical data
month_test <- test_final_bin[test_bat[1,i]:test_bat[2,i], c(2:10,15)]
hyper_grid <- expand.grid( # hyperparameter tuning by GridSearch
mtry = seq(2,8), # values for mtry
node_size = seq(3,6), # values for node size
max_depth = seq(2,7), # values for max depth
OOB_RMSE = 0 # this is used for find the minimum RMSE value
)
for (j in 1:nrow(hyper_grid)){ # loop though all hyperparameters
rf <- ranger( # train model
formula = count ~ .,
data = month_train,
num.trees = 650, # use a large number of trees, I selected 650 trees
mtry = hyper_grid$mtry[j], # record each mtry values
min.node.size = hyper_grid$node_size[j], # record minimum node size values
max.depth = hyper_grid$max_depth[j], # record max depth values
num.threads = 4,
seed = 3
)
hyper_grid$OOB_RMSE[j] <- sqrt(rf$prediction.error) # find the minimum prediction error
}
ind <- which.min(hyper_grid[, "OOB_RMSE"]) # find the index for the minimum prediction error
optimal_mtry <- hyper_grid[ind, 1] # find the set of optimal parameters in that index
node_size <- hyper_grid[ind, 2]
max_dep <- hyper_grid[ind, 3]
rf <- ranger(count ~ ., data = month_train, num.trees = 650, mtry = optimal_mtry, min.node.size = node_size, sample.fraction = 0.7, max.depth = max_dep, seed = 3, num.threads = 4) # fit final model
pred <- predict(rf, month_test)
new_submission[(test_bat[1,i]:test_bat[2,i]), 'count'] <- exp(pred$predictions)-1
}
For Random Forest, the best score achieved is 0.47, which is much better than MARS. The sample fraction is the number of samples used to train, lower number may result in high bias, whereas higher number may result in high variance. The default in 63.25%, I select 70% here.
Mtry is a important parameter, it represents the number of variables to random sample at each split. Usually, the value is around #of covariates / 3. In this case, mtry value for minimize prediction error is 5.
Reference: https://www.listendata.com/2015/07/gbm-boosted-models-tuning-parameters.html, STAT457 Notes from Prof. Song
I selected GBM for the following reason:
1. It is an ensemble model for improving model accuracy.
2. For GB, each model is added by sequentially to an ensemble. For each model, it will correct its own predecessor in terms of the residual error.
I will tune learning rate, interaction depth, minobsinnode, bag fraction in this model.
library(gbm)
registerDoParallel(cores = 4)
for (j in 1:dim(test_bat)[2]){
train <- train_final_bin[train_bat[1,1]:train_bat[2,j], c(2:10,15,16)] # get train and test data for GBM
test <- test_final_bin[test_bat[1,j]:test_bat[2,j], c(2:10,15)]
hyper_grid_gbm <- expand.grid( # hyperparameter tuning by GridSearch
shrinkage = c(0.01, 0.03, 0.05, 0.1, 0.3), # learning rate
interaction.depth = c(1, 3, 5), # interaction depth
n.minobsinnode = c(2:6),# min number of observation allowed in terminal node
bag.fraction = c(.65, .8, 1), # stochastic gradient descent
optimal_trees = 0,
min_RMSE = 0
)
for(i in 1:nrow(hyper_grid_gbm)) {
set.seed(123)
gbm.tune <- gbm( # loop through all parameter to find the optimal set
formula = count ~ .,
distribution = "gaussian",
data = train,
n.trees = 2500,
interaction.depth = hyper_grid_gbm$interaction.depth[i],
shrinkage = hyper_grid_gbm$shrinkage[i],
n.minobsinnode = hyper_grid_gbm$n.minobsinnode[i],
bag.fraction = hyper_grid_gbm$bag.fraction[i],
train.fraction = .80, # use 80% training data for train, and 20% for validation(used to compute validation error)
n.cores = 4,
verbose = FALSE
)
# add the minium training error and optimal trees to grid
hyper_grid_gbm$optimal_trees[i] <- which.min(gbm.tune$valid.error)
hyper_grid_gbm$min_RMSE[i] <- sqrt(min(gbm.tune$valid.error))
}
ind_gbm <- which.min(hyper_grid_gbm[, "min_RMSE"]) # Get the minimum validation error
shrinkage <- hyper_grid_gbm[ind_gbm, 1] # get each optimal parameter
interaction <- hyper_grid_gbm[ind_gbm, 2]
min_node <- hyper_grid_gbm[ind_gbm, 3]
fraction <- hyper_grid_gbm[ind_gbm, 4]
trees <- hyper_grid_gbm[ind_gbm, 5]
gbm1 <- gbm( # fit final model
formula = count ~ .,
data = train,
distribution = "gaussian",
n.trees = trees,
interaction.depth = interaction,
shrinkage = shrinkage,
n.minobsinnode = min_node,
bag.fraction = fraction,
train.fraction = 1,
n.cores = 4,
verbose = FALSE
)
pred_gbm <- predict(gbm1, test, n.trees = gbm1$n.trees) # prediction
submission_gbm[(test_bat[1,j]:test_bat[2,j]), 'count'] <- exp(pred_gbm)-1 # record
}
For GBM, usually more trees may end up with better score, however, because of the limitation of my computer, I tried up to 2500 trees.
The interaction depth is always 3 to 5; the learning rate is between 0.05~0.3, as this dataset is relatively large. The train fraction is 0.80, means that the first 80% of data are used for training, and remaining are used to compute out of sample estimate for loss function. This train fraction has the similar functionality as cross validation.
For more details, you can refer to my GitHub page.
Citation:
@article{mz2021,
title = "Kaggle Competition: Washington D.C. Bike Share Demand",
author = "Zhou, Meng",
year = "2021",
url = "https://sites.google.com/view/mengzhou/project/kaggle-project-1"
}