## Introduction to Boosted TreesXGBoost is short for “Extreme Gradient Boosting”, where the term “Gradient Boosting” is proposed in the paper The GBM (boosted trees) has been around for really a while, and there are a lot of materials on the topic. This tutorial tries to explain boosted trees in a self-contained and principled way using the elements of supervised learning. We think this explanation is cleaner, more formal, and motivates the variant used in xgboost. ## Elements of Supervised LearningXGBoost is used for supervised learning problems, where we use the training data (with multiple features) xito predict a target variable yi. Before we dive into trees, let us start by reviewing the basic elements in supervised learning. ## Model and ParametersThe The ## Objective Function : Training Loss + RegularizationBased on different understandings of yi we can have different problems, such as regression, classification, ordering, etc. We need to find a way to find the best parameters given the training data. In order to do so, we need to define a so-called A very important fact about objective functions is they Obj(Θ)=L(θ)+Ω(Θ) where L is the training loss function, and Ω is the regularization term. The training loss measures how L(θ)=∑i(yi−y^i)2 Another commonly used loss function is logistic loss for logistic regression L(θ)=∑i[yiln(1+e−y^i)+(1−yi)ln(1+ey^i)] The The correct answer is marked in red. Please consider if this visually seems a reasonable fit to you. The general principle is we want both a ## Why introduce the general principle?The elements introduced above form the basic elements of supervised learning, and they are naturally the building blocks of machine learning toolkits. For example, you should be able to describe the differences and commonalities between boosted trees and random forests. Understanding the process in a formalized way also helps us to understand the objective that we are learning and the reason behind the heuristics such as pruning and smoothing. ## Tree EnsembleNow that we have introduced the elements of supervised learning, let us get started with real trees. To begin with, let us first learn about the We classify the members of a family into different leaves, and assign them the score on the corresponding leaf. A CART is a bit different from decision trees, where the leaf only contains decision values. In CART, a real score is associated with each of the leaves, which gives us richer interpretations that go beyond classification. This also makes the unified optimization step easier, as we will see in a later part of this tutorial. Usually, a single tree is not strong enough to be used in practice. What is actually used is the so-called tree ensemble model, which sums the prediction of multiple trees together. Here is an example of a tree ensemble of two trees. The prediction scores of each individual tree are summed up to get the final score. If you look at the example, an important fact is that the two trees try to y^i=∑k=1Kfk(xi),fk∈F where K is the number of trees, f is a function in the functional space F, and F is the set of all possible CARTs. Therefore our objective to optimize can be written as obj(θ)=∑inl(yi,y^i)+∑k=1KΩ(fk) Now here comes the question, what is the ## Tree BoostingAfter introducing the model, let us begin with the real training part. How should we learn the trees? The answer is, as is always for all supervised learning models: Assume we have the following objective function (remember it always needs to contain training loss and regularization) obj=∑i=1nl(yi,y^(t)i)+∑i=1tΩ(fi) ## Additive TrainingFirst thing we want to ask is what are the y^(0)iy^(1)iy^(2)iy^(t)i=0=f1(xi)=y^(0)i+f1(xi)=f1(xi)+f2(xi)=y^(1)i+f2(xi)…=∑k=1tfk(xi)=y^(t−1)i+ft(xi) It remains to ask, which tree do we want at each step? A natural thing is to add the one that optimizes our objective. obj(t)=∑i=1nl(yi,y^(t)i)+∑i=1tΩ(fi)=∑i=1nl(yi,y^(t−1)i+ft(xi))+Ω(ft)+constant If we consider using MSE as our loss function, it becomes the following form. obj(t)=∑i=1n(yi−(y^(t−1)i+ft(xi)))2+∑i=1tΩ(fi)=∑i=1n[2(y^(t−1)i−yi)ft(xi)+ft(xi)2]+Ω(ft)+constant The form of MSE is friendly, with a first order term (usually called the residual) and a quadratic term. For other losses of interest (for example, logistic loss), it is not so easy to get such a nice form. So in the general case, we take the Taylor expansion of the loss function up to the second order obj(t)=∑i=1n[l(yi,y^(t−1)i)+gift(xi)+12hif2t(xi)]+Ω(ft)+constant where the gi and hi are defined as gihi=∂y^(t−1)il(yi,y^(t−1)i)=∂2y^(t−1)il(yi,y^(t−1)i) After we remove all the constants, the specific objective at step t becomes ∑i=1n[gift(xi)+12hif2t(xi)]+Ω(ft) This becomes our optimization goal for the new tree. One important advantage of this definition is that it only depends on gi and hi. This is how xgboost can support custom loss functions. We can optimize every loss function, including logistic regression and weighted logistic regression, using exactly the same solver that takes gi and hi as input! ## Model ComplexityWe have introduced the training step, but wait, there is one important thing, the ft(x)=wq(x),w∈RT,q:Rd→{1,2,⋯,T}. Here w is the vector of scores on leaves, q is a function assigning each data point to the corresponding leaf, andT is the number of leaves. In XGBoost, we define the complexity as Ω(f)=γT+12λ∑j=1Tw2j Of course there is more than one way to define the complexity, but this specific one works well in practice. The regularization is one part most tree packages treat less carefully, or simply ignore. This was because the traditional treatment of tree learning only emphasized improving impurity, while the complexity control was left to heuristics. By defining it formally, we can get a better idea of what we are learning, and yes it works well in practice. ## The Structure ScoreHere is the magical part of the derivation. After reformalizing the tree model, we can write the objective value with the t-th tree as: Obj(t)≈∑i=1n[giwq(xi)+12hiw2q(xi)]+γT+12λ∑j=1Tw2j=∑j=1T[(∑i∈Ijgi)wj+12(∑i∈Ijhi+λ)w2j]+γT where Ij={i|q(xi)=j} is the set of indices of data points assigned to the j-th leaf. Notice that in the second line we have changed the index of the summation because all the data points on the same leaf get the same score. We could further compress the expression by defining Gj=∑i∈Ijgi and Hj=∑i∈Ijhi: obj(t)=∑j=1T[Gjwj+12(Hj+λ)w2j]+γT In this equation wj are independent with respect to each other, the form Gjwj+12(Hj+λ)w2j is quadratic and the best wj for a given structure q(x) and the best objective reduction we can get is: w∗j=−GjHj+λobj∗=−12∑j=1TG2jHj+λ+γT The last equation measures If all this sounds a bit complicated, let’s take a look at the picture, and see how the scores can be calculated. Basically, for a given tree structure, we push the statistics gi and hi to the leaves they belong to, sum the statistics together, and use the formula to calculate how good the tree is. This score is like the impurity measure in a decision tree, except that it also takes the model complexity into account. ## Learn the tree structureNow that we have a way to measure how good a tree is, ideally we would enumerate all possible trees and pick the best one. In practice this is intractable, so we will try to optimize one level of the tree at a time. Specifically we try to split a leaf into two leaves, and the score it gains is Gain=12[G2LHL+λ+G2RHR+λ−(GL+GR)2HL+HR+λ]−γ This formula can be decomposed as 1) the score on the new left leaf 2) the score on the new right leaf 3) The score on the original leaf 4) regularization on the additional leaf. We can see an important fact here: if the gain is smaller than γ, we would do better not to add that branch. This is exactly the For real valued data, we usually want to search for an optimal split. To efficiently do so, we place all the instances in sorted order, like the following picture. A left to right scan is sufficient to calculate the structure score of all possible split solutions, and we can find the best split efficiently. ## Final words on XGBoostNow that you understand what boosted trees are, you may ask, where is the introduction on XGBoost? XGBoost is exactly a tool motivated by the formal principle introduced in this tutorial! More importantly, it is developed with both deep consideration in terms of |

### XGBoost

Trang con (11):
A Gentle Introduction to XGBoost for Applied Machine Learning
Data Preparation for Gradient Boosting with XGBoost in Python
Feature Importance and Feature Selection With XGBoost in Python
How to Develop Your First XGBoost Model in Python with scikit-learn
How to Save Gradient Boosting Models with XGBoost in Python
How to Tune the Number and Size of Decision Trees with XGBoost in Python
Parameter Tuning Notes
Python API Reference
Stochastic Gradient Boosting with XGBoost and scikit-learn in Python
Story and Lessons Behind the Evolution of XGBoost
Winning solution of Kaggle Higgs competition: what a single model can do?

Comments