------------------------------------------------------------------------------------------------------------------------------------------- If you know what Gradient descent is, it is easy to think of Gradient Boosting as an approximation of it. So lets start with Gradient Descent. Note that I am presenting a simplified version of things. For rigor you can refer to the original paper or any of the books that cover it. Math alert! :)
You have a function f(x) that you want to minimize. Assume x to be a scalar. One way to x(i)=x(i−1)−ηdf(x(i−1))dx Here η is a positive constant. In effect, the value of x found in the current iteration is its value in the previous iteration added to some fraction of the slope/gradient at this previous value. We stop when x(i)=x(i−1). We may start with an arbitrary value for x(0). The following fig shows how this works out: When x(i−1) corresponds to A, the (negative) gradient is quite high, and consequently x(i)(at A') occurs farther ahead. At B, the gradient is considerably less, and hence B' occurs close to B. C and C' are the exact same point since the gradient here is 0. This is our stopping condition and C is where the minimum occurs. In effect, we seem to be moving our current estimate by an amount proportional to the gradient. This makes sense, since we expect the gradient to gradually become 0 near the minima, and therefore, farther away we are from it, the absolute value of the gradient is higher. We don't want to spend a lot of iterations in these regions - so we move with longer "steps". Conversely, near the minima, we want to be cautious enough to not overshoot the minima, so we want to take smaller steps. In the case where x is a vector, the principle remains the same. Now, we adjust every x(i)j=x(i−1)j−η∂f(x(i−1))∂x(i−1)j At each iteration This is all we really need to know about gradient descent to understand gradient boosting.
Gradient Boosting carries over the previous technique to supervised learning. If you are careful with the notation its not difficult to see how it works. So we start with a function to minimize. We want a function whose value increases with how bad the classifier/regressor is. For a general treatment, we refer to this function as the We want to minimize f(x)=∑Ni=1L(yi,h(xi)) i.e. the loss over all points (xi,yi). Here h(x) is the classifier/regressor; which for brevity we'll refer to as the predictor. N is the total number of points. In the example of Gradient Descent above we minimized wrt x. What are we minimizing wrt here? We are minimizing wrt the predictor Don't worry about minimizing wrt a function - that won't complicate matters much. Moving to the iterative world of gradient descent these are the steps we now take: - Initialize h(0)(x)=c, a constant, such that c minimizes f(x) i.e. pick c that minimizes ∑Ni=1L(yi,c).
- At the ith iteration, for j=1,2,...,N compute rji=−∂L(yj,h(i−1)(xj))∂h(i−1)(xj). This is doable since we have assumed L to be differentiable - for the ex of squared error, ∂L∂h=−2(y−h) - we are only plugging in the values of yj and h(i−1)(xj) in the differentiated expression. This is analogous to how we dealt with the components of the vector x in the previous section.
- The previous step gives us a value rji for each point j. Thus we have a set of tuples (xj,rji). We use this set of points as
*training data*to construct a regression tree that can predict rji given xj. This tree approximates the gradient. Think of the tree as a "black box" gradient expression which produces rji given xj . This takes place of the ∂f(x(i−1))∂x(i−1)j expression we saw in GD, with this one tree sort of embodying the gradient for all points xj. We'll refer to this tree as T(i)g (g for gradient, i is for the iteration). As before we want this gradient-tree to play a role in the update equation, but we are still left with the task of finding η ... - Assume that the tree T(i)g has K leaves. We know that the leaves of a tree fragment the feature space into disjoint exhaustive regions. Lets refer to these regions as Rk, for k=1,2,...,K. If we send each point xj down the tree T(i)g, it will end up at some region Rk. We now want to associate a
*constant*ηk for*each*such region Rk such that the loss in a region, defined as: ∑xj∈RkL(yj,h(i−1)(xj)+ηk) is minimized. These are solved as k (simple) independent minimization problems for the k regions. Note that now we have a tree providing well defined regions Rk, and we have constant values ηk, associated with each of these regions. In effect, this combination may be seen as another tree: structure given by T(i)g, but ηk as predictions at the leaves. - Finally, we come to the update step: h(i)(x)=h(i−1)(x)+∑kηkI(x∈Rk). Here I(x∈Rk) is an
*indicator function*that has a value of 1 when x falls in the region Rk, 0 otherwise. Don't let the indicator function confuse you - its just an elegant way of saying that for a point x the second term is ηk corresponding to the region is falls into. This second term, as mentioned in the previous step, is effectively a tree derived from T(i)g. You can probably now see why ηk was determined in the way it was: the minimization in the last step and the updation have the same form; thus the updated function has the minimum possible loss. Note how similar the update equation is to that of GD. You have your gradient and you have your η (well, you've more than one now - ηk - but this is a minor difference). Also note, and this is*very interesting*, there is*actually no*addition taking place in this updation step - what we are simply saying is, if you want to compute h(i)(x), compute h(i−1)(x), and add to it what ever ηk you obtain by passing x down the tree represented by the second term. - Keep going back to step 2 till you have iterated the number of times - say M - you want to.
- Finally return h(M)(x) as your predictor. Since at every iteration, your only update to the function at that point is adding a tree in step 5, what you finally return is a sum of trees. Or rather, you return a bunch of trees whose sum (plus c from Step 1) is supposed to give you the function h(M)(x).
(I personally don't like GOTO statements like Step 6, but my inability to created nested lists on Quora has stymied me here :) ). A few things to think about: - Make sure you understand the role of calculating the per-point-gradient in Step 2. In the end we want the function h(x) to minimize loss over all training points. As mentioned before, this is analogous to the case of a vector valued x in GD.
- We had necessitated that the loss function be differentiable - we see this property being used in Step 2.
- Where are major computations happening in the algorithm? In constructing the "gradient trees" in Step 4 and determining the ηk values in Step 5.
In essence we have learnt a function h(M)(x) based on the values (xj,yj), that minimizes prediction errors f(x). The minimization is done in multiple steps: at every step we add a tree (Steps 4 and 5) that emulates adding a gradient based correction - very much like in GD. Using trees ensures that generalization of the gradient expression happens - because we need the gradient for an Probably trivial, but it is also interesting to note how our final function differs in form relative to the final point returned in GD. In GD, we have an updated point - I find Gradient Boosting to be one of the cleverer algorithms out there due to the manner in which it adapts a well known optimization technique to the domain of learning.
Gradient Boosting is basically about "boosting" many - Train a weak model
*m*using data samples drawn according to some weight distribution - Increase the weight of samples that are misclassified by model
*m*, and decrease the weight of samples that are classified correctly by model*m* - Train next weak model using samples drawn according to the updated weight distribution
In this way, the algorithm always trains models using data samples that are "difficult" to learn in previous rounds, which results an ensemble of models that are good at learning different "parts" of training data. -------------------------------------------------------------------------------------------------------------------------------------------- |