xgboost

study notes

The score returned from XGBoost depends on the objective function it uses.

If using binary:logistic it defaults to use logloss function, which means the predicted score is the log odds. So summing up the leaf values and transform to probability by 1/(1+e^-z).

xgboost parameters

General parameters

booster

gbtree(default), gblinear or dart. gbtree and dart use tree based model while gblinear uses linear function.

Tree booster parameters

learning_rate

the step-size of shrinkage. when adding an additional tree f(x) to the model, we usually add only a small portion of f(x)

i.e. eta*f(x)

which means we dont do full optimization when adding a new tree, but reserve chance for future rounds.

This helps reduce overfitting, because it doesn't push everything to the limit, but leave some space so there may be

other trees later that could adjust the weights more properly.

The smaller learning rate, the more conservative the training is.

0.01 ~ 0.2 can be good.default is 0.3

A small learning rate also means a large space for future rounds of constructing trees, so remember to set n_estimator(#trees) to a bigger value.

min_child_weight

Defines the minimum sum of weights of all observations required in a child. default 1.

Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.

gamma

the minimum loss reduction required to make a further leaf split. It has to be at least 0.

The values can vary depending on the loss function and should be tuned.

max_depth

the maximum depth of the trees.

Small trees are preferred because complex tree structure with many leaves tend to overfit the training data.

usually hundreds of small trees is better than dozens of large trees.

3-10. Better use 3~6. default 6.

max_leaves

similar to max_depth, but constrain the tree size of # leaves.

Don't need to use this. Use the max_depth instead.

#leaves = 2^depth for a binary tree.

subsample

when constructing the additional tree, don't use the full training data set. Instead use a subset of randomly selected training data.

This helps to reduce overfitting, as the data is not exactly the same each round. The training tends not to over focus on specific

training data points.

usually choose 50% ~ 100% of the full training data set. 0.5 looks good.

colsample_bytree

similar to subsample, this selects a random subset of features (columns) to construct an additional tree.

Also helps to reduce overfitting, as there is less chance it over focuses on specific features.

0.5 looks good.

colsample_bylevel

percentage of traning data selected at each level split. No need to use this parameter if subsample & colsample_bytree are used already.

lambda

The weight of the L2 regularization term. A bigger value reduces the complexity of the model.

default 1, may try bigger values

the regularization term = lambda * sum(wi^2)

alpha

the weight of the L1 regularization term. ie. Lasso

default 0. May try 1 and set lambda to 0.

Note: the max_depth and min_child_weight have the highest impact on model outcome. so usually tune these two parameters first and then the rest.

Learning task parameters

objective

"reg:linear" --linear regression

"reg:logistic" --logistic regression

"binary:logistic" --logistic regression for binary classification, output probability

base_score

the initial prediction score of all instances, global bias

for sufficient number of iterations, this value hasn't too much effect.

default 0.5

eval_metric

automatically set by objective

e.g. if its regression, then 'rmse' (root mean square error), if its classification then 'error', etc.

normally dont need to set this unless you are very sure.

review:

https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/

parameters tuning

https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

Tunning parameters:

It's all about avoid overfitting.

The number of trees (n_estimators) specifies how many trees will be built. Gut feel is if you have too many trees (given a certain amount of training data), you can easily overfit the data by using a small learning rate e.g 0.01.

So first of all things is to determine a rough number of trees you will need given your training data.

Better to use xgboost.cv as it has an early stop option, which is not offered in Python's GridSearchCV.

In init_params, set learning rate to 0.1 (relatively bigger), max_depth = 5, min_child_weight = 1, n_estimator = 500 (max), subsample = 0.9 and colsample_bytree =0.8

cv_result = xgboost.cv( #run cross validation with the given parameters

init_params,

xgtrain,

num_boost_round =500,

nfold=5, #5 fold cross validation

early_stopping_rounds = 50, #if no improvement, stop the iteration early

metrics = 'auc'

)

After the cv finishes, check cv_result to see from which round the performance is not improving. So a rough number of trees = # rounds before cv stops.

best_boost_round = cv_result.shape[0]

Next, tune max_depth and min_child_weight.

Here use Python's GridSearchCV to search for the best max_depth and min_child_weight

Set 'max_depth':[3,4,5,6,7,8,9,10]

'min_child_weight':[0.5,1,2,3,4,5]

Again use GridSearchCV in all the following searches.

Set 'gamma':[i/10.0 for i in range(0,5)]

Next,

'subsample':[0.7,0.8,0.9, 0.95],

'colsample_bytree':[0.6,0.7,0.8,0.9]

Next, tune the L1 regularization, leaving the L2 regularization lambda to 1

'reg_alpha':[0, 0.05, 0.1, 0.2, 0.5, 1]

finally, try different learning rates

0.01 to 0.1

Notes from github

control model complexity

max_depth, main_child_weight, gamma

robust to noise

subsample, colsample_bytree

only care about the ranking order

balance the positive and negative weights, by scale_pos_weight

use AUC as the evaluation metric

care about predicting the right probability

cannot re-balance the dataset

set max_delta_step to a finite number (say 1) will help convergence

to select ideal parameters, use xgboot.cv

trust the score for the test

if overfitting observed, reduce stepsize eta and increase nround at the same time??? don't get the idea. isn't finer stepsize overfits data easier?