Trang chủ‎ > ‎IT‎ > ‎Machine Learning‎ > ‎Ensemble Learning‎ > ‎

What is an intuitive explanation of Gradient Boosting?

https://youtu.be/sRktKszFmSk

https://www.quora.com/What-is-the-difference-between-the-R-gbm-gradient-boosting-machine-and-xgboost-extreme-gradient-boosting

-------------------------------------------------------------------------------------------------------------------------------------------

If you know what Gradient descent is, it is easy to think of Gradient Boosting as an approximation of it. So lets start with Gradient Descent. Note that I am presenting a simplified version of things. For rigor you can refer to the original paper or any of the books that cover it.

Math alert! :)

Gradient Descent (GD) - a short primer

You have a function f(x)f(x) that you want to minimize. Assume xx to be a scalar. One way to iteratively minimize, and find the corresponding xx at the minima, is to follow this update rule at the ithith iteration:

x(i)=x(i1)ηdf(x(i1))dxx(i)=x(i−1)−ηdf(x(i−1))dx

Here ηη is a positive constant.

In effect, the value of xx found in the current iteration is its value in the previous iteration added to some fraction of the slope/gradient at this previous value. We stop when x(i)=x(i1)x(i)=x(i−1). We may start with an arbitrary value for x(0)x(0).

The following fig shows how this works out:

When x(i1)x(i−1) corresponds to A, the (negative) gradient is quite high, and consequently x(i)x(i)(at A') occurs farther ahead. At B, the gradient is considerably less, and hence B' occurs close to B. C and C' are the exact same point since the gradient here is 00. This is our stopping condition and C is where the minimum occurs.

In effect, we seem to be moving our current estimate by an amount proportional to the gradient. This makes sense, since we expect the gradient to gradually become 00 near the minima, and therefore, farther away we are from it, the absolute value of the gradient is higher. We don't want to spend a lot of iterations in these regions - so we move with longer "steps". Conversely, near the minima, we want to be cautious enough to not overshoot the minima, so we want to take smaller steps.

In the case where xx is a vector, the principle remains the same. Now, we adjust every individual dimension of xx based on the slope along that direction. For the ithith iteration and the jthjth dimension, this is what the update rule looks like:

x(i)j=xj(i)=x(i1)jηf(x(i1))x(i1)jxj(i−1)−η∂f(x(i−1))∂xj(i−1)

At each iteration all dimensions are adjusted. The idea is, we want to move the vector xx, as a whole, in a direction where each individual component minimizes f(x)f(x). This is important - this form would show up in the subsequent discussion.

This is all we really need to know about gradient descent to understand gradient boosting.

Gradient Boosting

Gradient Boosting carries over the previous technique to supervised learning. If you are careful with the notation its not difficult to see how it works.

So we start with a function to minimize. We want a function whose value increases with how bad the classifier/regressor is. For a general treatment, we refer to this function as the loss function represented by LL. For Gradient Boosting loss functions must be differentiable. An example is the squared error between the actual and predicted value i.e. L=(yih(xi))2L=(yi−h(xi))2

We want to minimize f(x)=Ni=1L(yi,h(xi))f(x)=∑i=1NL(yi,h(xi)) i.e. the loss over all points (xi,yi)(xi,yi). Here h(x)h(x) is the classifier/regressor; which for brevity we'll refer to as the predictor. NN is the total number of points.

In the example of Gradient Descent above we minimized wrt xx. What are we minimizing wrt here? We are minimizing wrt the predictor function h(x)h(x) since we want a predictor that minimizes the total loss f(x)(x).

Don't worry about minimizing wrt a function - that won't complicate matters much.

Moving to the iterative world of gradient descent these are the steps we now take:

  1. Initialize h(0)(x)=ch(0)(x)=c, a constant, such that cc minimizes f(x)f(x) i.e. pick cc that minimizes Ni=1L(yi,c)∑i=1NL(yi,c).
  2. At the ithith iteration, for j=1,2,...,Nj=1,2,...,N compute rji=L(yj,h(i1)(xj))h(i1)(xj)rji=−∂L(yj,h(i−1)(xj))∂h(i−1)(xj). This is doable since we have assumed LL to be differentiable - for the ex of squared error, Lh=2(yh)∂L∂h=−2(y−h) - we are only plugging in the values of yjyj and h(i1)(xj)h(i−1)(xj) in the differentiated expression. This is analogous to how we dealt with the components of the vector xx in the previous section.
  3. The previous step gives us a value rjirji for each point jj. Thus we have a set of tuples (xj,rji)(xj,rji). We use this set of points as training data to construct a regression tree that can predict rjirji given xjxj. This tree approximates the gradient. Think of the tree as a "black box" gradient expression which produces rjirji given  xjxj . This takes place of the f(x(i1))x(i1)j∂f(x(i−1))∂xj(i−1) expression we saw in GD, with this one tree sort of embodying the gradient for all points xjxj. We'll refer to this tree as T(i)gTg(i) (g for gradient, i is for the iteration). As before we want this gradient-tree to play a role in the update equation, but we are still left with the task of finding ηη ...
  4. Assume that the tree T(i)gTg(i) has KK leaves. We know that the leaves of a tree fragment the feature space into disjoint exhaustive regions. Lets refer to these regions as RkRk, for k=1,2,...,Kk=1,2,...,K. If we send each point xjxj  down the tree T(i)gTg(i), it will end up at some region RkRk. We now want to associate a constant ηkηk for each such region RkRk such that the loss in a region, defined as: xjRk∑xj∈RkL(yj,h(i1)(xj)+ηk)L(yj,h(i−1)(xj)+ηk) is minimized. These are solved as kk (simple) independent minimization problems for the kk regions.  Note that now we have a tree providing well defined regions RkRk, and we have constant values ηkηk, associated with each of these regions. In effect, this combination may be seen as another tree: structure given by T(i)gTg(i), but ηkηk as predictions at the leaves.
  5. Finally, we come to the update step: h(i)(x)=h(i1)(x)+kηkI(xRk)h(i)(x)=h(i−1)(x)+∑kηkI(x∈Rk). Here I(xRk)I(x∈Rk) is an indicator function that has a value of 11 when xx falls in the region RkRk00 otherwise. Don't let the indicator function confuse you - its just an elegant way of saying that for a point xx the second term is ηkηk corresponding to the region is falls into. This second term, as mentioned in the previous step, is effectively a tree derived from T(i)gTg(i). You can probably now see why ηkηk was determined in the way it was: the minimization in the last step and the updation have the same form; thus the updated function has the minimum possible loss. Note how similar the update equation is to that of GD. You have your gradient and you have your ηη (well, you've more than one now - ηkηk - but this is a minor difference). Also note, and this is very interesting, there is actually no addition taking place in this updation step - what we are simply saying is, if you want to compute h(i)(x)h(i)(x), compute h(i1)(x)h(i−1)(x), and add to it what ever ηkηk you obtain by passing xx down the tree represented by the second term.
  6. Keep going back to step 2 till you have iterated the number of times - say MM -     you want to.
  7. Finally return h(M)(x)h(M)(x) as your predictor. Since at every iteration, your only update to the function at that point is adding a tree in step 5, what you finally return is a sum of trees. Or rather, you return a bunch of trees whose sum (plus cc from Step 1) is supposed to give you the function h(M)(x)h(M)(x).

(I personally don't like GOTO statements like Step 6, but my inability to created nested lists on Quora has stymied me here :) ).

A few things to think about:

  • Make sure you understand the role of calculating the per-point-gradient in Step 2. In the end we want the function h(x)h(x) to minimize loss over all training points. As mentioned before, this is analogous to the case of a vector valued xx in GD.
  • We had necessitated that the loss function be differentiable - we see this property being used in Step 2.
  • Where are major computations happening in the algorithm? In constructing the "gradient trees" in Step 4 and determining the ηkηk values in Step 5.

In essence we have learnt a function h(M)(x)h(M)(x) based on the values (xj,yj)(xj,yj), that minimizes prediction errors f(x)f(x). The minimization is done in multiple steps: at every step we add a tree (Steps 4 and 5) that emulates adding a gradient based correction - very much like in GD. Using trees ensures that generalization of the gradient expression happens - because we need the gradient for an unseen/test point at each iteration as part of the calculation of h(M)(x).h(M)(x).

Probably trivial, but it is also interesting to note how our final function differs in form relative to the final point returned in GD. In GD, we have an updated point - one point - at any iteration. In the case of Gradient Boosting, we don't have a neat form in which the updated function exists; the function exists tangibly as a bunch of trees, with each tree representing the update in some iteration. In GD, you come to me after the ithith iteration, I will give you a point xixi; here, after the ithith iteration, I will give you ii trees (plus, of course, the constant cc from Step 1): this is what h(i)(x)h(i)(x) is.

I find Gradient Boosting to be one of the cleverer algorithms out there due to the manner in which it adapts a well known optimization technique to the domain of learning.

--------------------------------------------------------------------------------------------------------------------------------------------

Gradient Boosting is basically about "boosting" many weak predictive models into a strongone, in the form of ensemble of weak models. Here, a weak predict model can be any model that works just a little better than random guess

To build the strong model, we need to find a good way to "combine" weak models. In AdaBoost, arguably the most popular boosting algorithm, weak models are trained in an adaptive way (AdaBoost, and other boosting models, can be used for both classification and regression. Classification model is used here as an example):

  1. Train a weak model m using data samples drawn according to some weight distribution
  2. Increase the weight of samples that are misclassified by model m, and decrease the weight of samples that are classified correctly by model m
  3. Train next weak model using samples drawn according to the updated weight distribution

In this way, the algorithm always trains models using data samples that are "difficult" to learn in previous rounds, which results an ensemble of models that are good at learning different "parts" of training data.

--------------------------------------------------------------------------------------------------------------------------------------------


Comments