Using XGBoost and Hyperopt in a Kaggle Comp

10/19/19

Using the Machine Learning model XGBoost effectively with optimal hyperparameters from Hyperopt in my first Kaggle competition on predicting future sales.

Code available here

My kaggle profile can be found here

As of the time of writing I am in the top 15% sitting at 632/4454.

What is Kaggle?

Kaggle is a data scientist's playground.

It is a website that hosts data science competitions. It is a platform in which data scientists from across the world, learn, collaborate and compete. Wherever you are someone who is learning data science, a seasoned veteran or a business leader there is always something you can learn.

Kaggle competitions are competitions which are (mostly) created by private companies. For example, Lyft have a competition out at the moment on 3D object detection for autonomous vehicles [1]. The competitions provide a win-win situation. Kaggle competitions benefit private companies by getting data scientists all around the world to help them with their data and business problem. It also gives the company a unique opportunity to introduce themselves to some of the best data scientists in the world for potential recruitment. For the data scientists there is cash to be won but more importantly there is the thrill of the competition: bragging rights; finding new contacts (h20.ai is made up of some of the best kaggle competitors); learning something new. Having experience with kaggle comps also looks very good on your resume.

Kaggle competitions work by asking you/your team to provide solutions to well defined problems which are then compared against (hidden) answers. Once you submit your solution you then get a ranking on the leader-board and the top ~three scores obtain prize money.

How to get the most out of Kaggle?

Someone who is learning data science:

  • Set aside time every week and browse the top notebooks from the competitions (old and new).
  • Think about why you are learning data science and which problems you would like to solve. There are competitions on every facet of data science including: time-series analysis, recommendations, marketing, computer vision, natural language processing and many more. Use the search box and type your domain of interest. This will bring up a list of recommended notebooks (to explore other peoples code) as well as competitions.

Seasoned veteran:

  • You know what to do.

Business leader:

  • Look at competitions created by competitors. e.g. if you work at BP, look at the competition hosted by Statoil [2]. You can type in a company in the search bar to see what comes up.
  • Think about a business problem which you are currently trying to solve. For example, your company may be looking at ways to cut costs for its direct mail marketing campaigns and would like to increase conversion. For this you could browse the Springleaf Marketing Response competition [3]. The goal of this competition is to use data science to predict which customers will respond to a direct mail offer. I would recommend looking at the overview and discussion sections of the competition and if you are brave: the notebooks.

Predict Future Sales competition

I have entered the Predict Future Sales competition [4] hosted by 1C Company. This competition accompanies the Coursera course 'How to Win a Data Science Competition: Learn from Top Kagglers' (for reference my course notes can be found here). The goal of this competition is to predict future sales of multiple shop and product combinations. There are 204,211 shop-product combinations and you have to predict how much many items will be sold in the next month given three years of data.

Structuring machine learning projects

There are a number of steps involved in machine learning projects:

  1. Understand the business objective.
  2. Get the data.
  3. Clean the data and undertake feature engineering.
  4. Choose a model.
  5. Choose a metric to optimize for the model.
  6. Find the best hyper-parameters for the model.
  7. Make a prediction.
  8. Deploy model.

In this blog post I will focus on 4 and 6: choosing a model and finding the best hyper-parameters.

XGBoost and Hyperopt

While they sound like special moves from the video game Street Fighter they are actually acronyms for eXtreme Gradient Boosting and hyper-optimization. These are explained below.

XGBoost

XGBoost is an optimized distributed gradient boosting library that can be used to solve many data science problems in a fast and accurate way. It is known to produce very good results when compared to other machine learning models across many tasks [5] and the model has been used from many winning solutions to kaggle comps.. The method it uses is known as Gradient Boosted Decision Trees (GBDT). First, let's understand what a decision tree is: A decision tree is a technique used in classification (to find which group a sample belongs to) or regression (time series analysis). An example is shown in the image below: using a decision tree to classify types of flowers based on their petal length and petal width. The machine learning aspect of this model works by optimizing the gini impurity, which measures how 'pure' a node is - whether a node contains samples belonging to the same class. A more advanced machine learning model is a Random Forest which is an ensemble of decision trees trained on different random subsets of the training data. A random forest uses the bagging method to combine predictions from many decision trees. For classification, the most frequent prediction (mode) is used and for regression it is the ensemble average. A slightly different technique to Random forest is Gradient boosting which refers to additive training. Each new decision tree corrects the fit of the last decision tree. For reference it uses a greedy algorithm where it finds an optimum split in the decision tree at the top level. It does not check wherever the split leads to the lowest possible impurity several layers down. However, this finds a reasonably good solution but not necessarily the best solution.

Decision tree of the iris dataset to determine which species a flower belongs to. Taken from Geron (2019)

The python API for XGBoost is given here and there are detailed examples here. To learn more about GBDT's I recommend reading these slides by Tianqi Chen, -the author or XGBoost - or his paper. For further reading I recommend Friedman (2001).

My model for the kaggle comp is as follows:

from xgboost import XGBRegressor

model = XGBRegressor(
    max_depth=17,
    n_estimators=2110,
    colsample_bytree=0.5,
    min_child_weight=330,
    subsample=0.8,    
    eta=0.2,
    objective='reg:squarederror',
    tree_method='gpu_hist')

model.fit(
    X_train, 
    Y_train, 
    eval_metric="rmse", 
    eval_set=[(X_train, Y_train), (X_valid, Y_valid)], 
    verbose=True, 
    early_stopping_rounds=10)

A detailed list of the model parameters can be found here. Below. I will explain the model parameters used in my model:

  • max_depth - Maximum tree depth - Used to control overfitting.
  • n_estimators - Number of trees to fit.
  • colsample_bytree - Number of columns (features) to subsample. This acts similar to dropout in CNNs to regularize the model.
  • min_child_weight - Minimum number of samples needed to be in each node - a larger value fits the training data better.
  • subsample - A percentage of how much training data to use to prevent overfitting.
  • eta - Learning rate/shrinkage. Ideally a low value to get the best results but this will make training time take longer.
  • objective - Objective of the learning. reg:squarederror means regression with squared loss.
  • tree_method - The tree construction algorithm. gpu_hist means a GPU implementation of the fast histogram optimized approximate greedy algorithm.
  • eval_metric - Evaluation metric for validation data. rmse means root-mean-square-error.
  • early_stopping_rounds - This prevents overfitting by stopping the algorithm if the evaluation metric does not improve after N epochs. It my case N is set to 10.

Hyperopt

Hyperopt is a tool for distributed asynchronous hyper-parameter optimization. What do those words mean? Distributed means parallel computing: operations that can utilize more than one core of your CPU (central processing unit) to speed up computation. Asynchronous is the opposite of synchronous. It means running co-routines concurrently so that you do not have to wait for other tasks to finish (to understand this further you are welcome to look at my cfanalytics projects where I scraped crossfit open results from the crossfit website asynchronously: blog; code). Hyper-parameters refer to the parameters of the model above which can be tweaked to get a good fitting model. Optimization means finding the best hyper-parameters of the model to give the best predictions.

Traditional methods to find the best hyper-parameters include manual, random, or grid search. Hyperopt uses a Bayesian Optimization Primer - where a probabilistic model of the objective function is created and updated with new evidence (posterior) - which is easier (and faster) to optimize than the actual objective function. The algorithm hyperopt uses is called the Tree-structured Parzen Estimator (TPE) which works by drawing sample hyper-parameters from a prior distribution and evaluating them given the results of two posterior distributions (less and greater than a chosen threshold). TPE uses Bayes rule and has the advantage of 'remembering' the previous hyper-parameters so that it improves after every iteration, as opposed to grid search.

The source code for hyperopt is given here and examples can be found here. A blog post by Will Koehrsen [6] gives a reasonably digestible explanation of hyperopt. For further reading I recommend Bergstra et al (2011; 2013).

My code for the kaggle comp is as follows:

from hyperopt import fmin, tpe, hp, STATUS_OK, Trials, space_eval
from xgboost import XGBRegressor

# Choose hyperparameter domain to search over
space = {
        'max_depth':hp.choice('max_depth', np.arange(10, 25, 1, dtype=int)),
        'n_estimators':hp.choice('n_estimators', np.arange(1000, 10000, 10, dtype=int)),
        'colsample_bytree':hp.quniform('colsample_bytree', 0.5, 1.0, 0.1),
        'min_child_weight':hp.choice('min_child_weight', np.arange(250, 350, 10, dtype=int)),
        'subsample':hp.quniform('subsample', 0.7, 0.9, 0.1),
        'eta':hp.quniform('eta', 0.1, 0.3, 0.1),
        
        'objective':'reg:squarederror',
        
        'tree_method':'gpu_hist',
        'eval_metric': 'rmse',
    }

def score(params):
    model = XGBRegressor(**params)
    
    model.fit(X_train, Y_train, eval_set=[(X_train, Y_train), (X_valid, Y_valid)],
              verbose=False, early_stopping_rounds=10)
    Y_pred = model.predict(X_valid).clip(0, 20)
    score = sqrt(mean_squared_error(Y_valid, Y_pred))
    print(score)
    return {'loss': score, 'status': STATUS_OK}    
    
def optimize(trials, space):
    
    best = fmin(score, space, algo=tpe.suggest, max_evals=1000)
    return best

trials = Trials()
best_params = optimize(trials, space)

# Return the best parameters
space_eval(space, best_params)

An explanation of the most important parameters/functions are given below:

  • hp - To chose which type of distribution e.g. choice, uniform, quniform.
  • tpe - Tree-structured Parzen Estimator - the algorithm described briefly above.
  • fmin - The main function. Give it a space to search over a score to optimize. max_evals describes how many times to run the algorithm.
  • Trials - This is the 'database' which stores the completed hyper-parameters and the score.
  • space_eval - Use this function to find the best hyperparamters. It will convert the indexed hyperparmeters into values which you can easily plug into the model as I did for the XGBRegressor model above.

See also

https://github.com/keras-team/keras-tuner

https://scikit-optimize.github.io/

https://blog.dask.org/2019/09/30/dask-hyperparam-opt

https://github.com/dask/dask-xgboost

References

[1] https://www.kaggle.com/c/3d-object-detection-for-autonomous-vehicles

[2] https://www.kaggle.com/c/statoil-iceberg-classifier-challenge

[3] https://www.kaggle.com/c/springleaf-marketing-response

[4] https://www.kaggle.com/c/competitive-data-science-predict-future-sales

[5] https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d

[6] https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f

Bergstra et al (2011), Algorithms for Hyper-Parameter Optimization, https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

Bergstra et al (2013). Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures, http://proceedings.mlr.press/v28/bergstra13.pdf

Friedman (2001) Greedy function approximation: A gradient boosting machine, The Annals of Statistics, 29(5), pp 1189-1232 https://projecteuclid.org/euclid.aos/1013203451

Geron (2019) Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/

https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python

https://towardsdatascience.com/an-introductory-example-of-bayesian-optimization-in-python-with-hyperopt-aae40fff4ff0