xgboost

study notes

How xgboost tree is trained

The score returned from XGBoost depends on the objective function it uses.

If using binary:logistic it defaults to use logloss function, which means the predicted score is the log odds. So summing up the leaf values and transform to probability by 1/(1+e^-z).

xgboost parameters

General parameters

    booster 

        gbtree(default), gblinear or dart. gbtree and dart use tree based model while gblinear uses linear function.

Tree booster parameters

    learning_rate

        the step-size of shrinkage. when adding an additional tree f(x) to the model, we usually add only a small portion of f(x) 

        i.e. eta*f(x)

         which means we dont do full optimization when adding a new tree, but reserve chance for future rounds. 

        This helps reduce overfitting, because it doesn't push everything to the limit, but leave some space so there may be

        other trees later that could adjust the weights more properly.

        The smaller learning rate, the more conservative the training is.

        0.01 ~ 0.2 can be good.default is 0.3

        A small learning rate also means a large space for future rounds of constructing trees, so remember to set n_estimator(#trees) to a bigger value.

    min_child_weight

        Defines the minimum sum of weights of all observations required in a child. default 1.

        Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.

        

    gamma

        the minimum loss reduction required to make a further leaf split. It has to be at least 0.

        The values can vary depending on the loss function and should be tuned.

    

    max_depth

        the maximum depth of the trees. 

        Small trees are preferred because complex tree structure with many leaves tend to overfit the training data.

        usually hundreds of small trees is better than dozens of large trees.

        3-10. Better use 3~6. default 6.

    max_leaves

        similar to max_depth, but constrain the tree size of # leaves.

        Don't need to use this. Use the max_depth instead.

        #leaves = 2^depth for a binary tree.

    subsample

        when constructing the additional tree, don't use the full training data set. Instead use a subset of randomly selected training data.

        This helps to reduce overfitting, as the data is not exactly the same each round. The training tends not to over focus on specific

        training data points. 

        usually choose 50% ~ 100% of the full training data set. 0.5 looks good.

    colsample_bytree

        similar to subsample, this selects a random subset of features (columns) to construct an additional tree.

        Also helps to reduce overfitting, as there is less chance it over focuses on specific features.

        0.5 looks good.

    colsample_bylevel

        percentage of traning data selected at each level split. No need to use this parameter if subsample & colsample_bytree are used already.

    lambda 

        The weight of the L2 regularization term. A bigger value reduces the complexity of the model.

        default 1, may try bigger values

        the regularization term = lambda * sum(wi^2)

    alpha

        the weight of the L1 regularization term. ie. Lasso

        default 0. May try 1 and set lambda to 0.

  Note: the max_depth and min_child_weight have the highest impact on model outcome. so usually tune these two parameters first and then the rest.

Learning task parameters

    objective 

        "reg:linear" --linear regression

        "reg:logistic" --logistic regression   

        "binary:logistic" --logistic regression for binary classification, output probability

    base_score

        the initial prediction score of all instances, global bias

        for sufficient number of iterations, this value hasn't too much effect.

        default 0.5

    eval_metric

        automatically set by objective

        e.g. if its regression, then 'rmse' (root mean square error), if its classification then 'error', etc.

        normally dont need to set this unless you are very sure.

review:

https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/

parameters tuning

https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

Tunning parameters:

It's all about avoid overfitting.

The number of trees (n_estimators) specifies how many trees will be built. Gut feel is if you have too many trees (given a certain amount of training data), you can easily overfit the data by using a small learning rate e.g 0.01.

So first of all things is to determine a rough number of trees you will need given your training data.

Better to use xgboost.cv as it has an early stop option, which is not offered in Python's GridSearchCV.

In init_params, set learning rate to 0.1 (relatively bigger), max_depth = 5, min_child_weight = 1, n_estimator = 500 (max), subsample = 0.9 and colsample_bytree =0.8

cv_result = xgboost.cv( #run cross validation with the given parameters

              init_params,

              xgtrain, 

              num_boost_round =500,

              nfold=5,                    #5 fold cross validation

              early_stopping_rounds = 50, #if no improvement, stop the iteration early

              metrics = 'auc'

           ) 

After the cv finishes, check cv_result to see from which round the performance is not improving. So a rough number of trees = # rounds before cv stops.

best_boost_round = cv_result.shape[0]

Next, tune max_depth and min_child_weight.

Here use Python's GridSearchCV to search for the best max_depth and min_child_weight

Set 'max_depth':[3,4,5,6,7,8,9,10]

      'min_child_weight':[0.5,1,2,3,4,5]  

Again use GridSearchCV in all the following searches.

Set 'gamma':[i/10.0 for i in range(0,5)]

Next,

'subsample':[0.7,0.8,0.9, 0.95],

'colsample_bytree':[0.6,0.7,0.8,0.9]

Next, tune the L1 regularization, leaving the L2 regularization lambda to 1

'reg_alpha':[0, 0.05, 0.1, 0.2, 0.5, 1] 

finally, try different learning rates

0.01 to 0.1

Notes from github

control model complexity

  max_depth, main_child_weight, gamma

robust to noise

  subsample, colsample_bytree

only care about the ranking order

  balance the positive and negative weights, by scale_pos_weight

  use AUC as the evaluation metric

care about predicting the right probability

  cannot re-balance the dataset

  set max_delta_step to a finite number (say 1) will help convergence

to select ideal parameters, use xgboot.cv

  trust the score for the test

  if overfitting observed, reduce stepsize eta and increase nround at the same time??? don't get the idea. isn't finer stepsize overfits data easier?