Gradient-Descent Boosted Training (GBM) Algorithm

(Redirected from Gradient Boosting)
Jump to navigation Jump to search

A Gradient-Descent Boosted Training (GBM) Algorithm is an boosted ensemble learning algorithm that is a gradient-descent algorithm which employs gradient descent optimization to build models sequentially in a way that each new model incrementally reduces the residual errors of the previous models.



  • (Wikipedia, 2024) ⇒ Retrieved:2024-3-13.
    • Gradient boosting is a machine learning technique based on boosting in a functional space, where the target is pseudo-residuals rather than the typical residuals used in traditional boosting. It gives a prediction model in the form of an ensemble of weak prediction models, i.e., models that make very few assumptions about the data, which are typically simple decision trees. [1] When a decision tree is the weak learner, the resulting algorithm is called gradient-boosted trees; it usually outperforms random forest.[2][1] A gradient-boosted trees model is built in a stage-wise fashion as in other boosting methods, but it generalizes the other methods by allowing optimization of an arbitrary differentiable loss function.


  • (Wikipedia, 2015) ⇒ Retrieved:2015-6-24.
    • Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

      The idea of gradient boosting originated in the observation by Leo Breiman [3] that boosting can be interpreted as an optimization algorithm on a suitable cost function. Explicit regression gradient boosting algorithms were subsequently developed by Jerome H. Friedman[4] [5] simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean

      .[6] [7]

      The latter two papers introduced the abstract view of boosting algorithms as iterative functional gradient descent algorithms. That is, algorithms that optimize a cost functional over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.


  • (Wikipedia, 2017) ⇒ Retrieved:2017-1-21.
    • … Gradient boosting method assumes a real-valued y and seeks an approximation [math]\displaystyle{ \hat{F}(x) }[/math] in the form of a weighted sum of functions [math]\displaystyle{ h_i (x) }[/math] from some class ℋ, called base (or weak) learners: : [math]\displaystyle{ F(x) = \sum_{i=1}^M \gamma_i h_i(x) + \mbox{const} }[/math] .

      In accordance with the empirical risk minimization principle, the method tries to find an approximation [math]\displaystyle{ \hat{F}(x) }[/math] that minimizes the average value of the loss function on the training set. It does so by starting with a model, consisting of a constant function [math]\displaystyle{ F_0(x) }[/math] , and incrementally expanding it in a greedy fashion: : [math]\displaystyle{ F_0(x) = \underset{\gamma}{\arg\min} \sum_{i=1}^n L(y_i, \gamma) }[/math] , : [math]\displaystyle{ F_m(x) = F_{m-1}(x) + \underset{f \in \mathcal{H}}{\operatorname{arg\,min}} \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + f(x_i)) }[/math] ,

      where is restricted to be a function from the class ℋ of base learner functions.

      However, the problem of choosing at each step the best for an arbitrary loss function is a hard optimization problem in general, and so we'll "cheat" by solving a much easier problem instead.

      The idea is to apply a steepest descent step to this minimization problem. If we only cared about predictions at the points of the training set, and were unrestricted, we'd update the model per the following equation, where we view [math]\displaystyle{ L(y, f) }[/math] not as a function of , but as a function of a vector of values [math]\displaystyle{ f(x_1), \ldots, f(x_n) }[/math] : : [math]\displaystyle{ F_m(x) = F_{m-1}(x) - \gamma_m \sum_{i=1}^n \nabla_{F_{m-1}} L(y_i, F_{m-1}(x_i)), }[/math] : [math]\displaystyle{ \gamma_m = \underset{\gamma}{\arg\min} \sum_{i=1}^n L\left(y_i, F_{m-1}(x_i) - \gamma \frac{\partial L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)} \right). }[/math] But as must come from a restricted class of functions (that's what allows us to generalize), we'll just choose the one that most closely approximates the gradient of . Having chosen, the multiplier is then selected using line search just as shown in the second equation above.

      In pseudocode, the generic gradient boosting method is:

      Input: training set [math]\displaystyle{ \{(x_i, y_i)\}_{i=1}^n, }[/math] a differentiable loss function [math]\displaystyle{ L(y, F(x)), }[/math] number of iterations .


      1. Initialize model with a constant value:
        [math]\displaystyle{ F_0(x) = \underset{\gamma}{\arg\min} \sum_{i=1}^n L(y_i, \gamma). }[/math] # For = 1 to :
        1. Compute so-called pseudo-residuals:
          [math]\displaystyle{ r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x)=F_{m-1}(x)} \quad \mbox{for } i=1,\ldots,n. }[/math] ## Fit a base learner [math]\displaystyle{ h_m(x) }[/math] to pseudo-residuals, i.e. train it using the training set [math]\displaystyle{ \{(x_i, r_{im})\}_{i=1}^n }[/math] .
        2. Compute multiplier [math]\displaystyle{ \gamma_m }[/math] by solving the following one-dimensional optimization problem:
          [math]\displaystyle{ \gamma_m = \underset{\gamma}{\operatorname{arg\,min}} \sum_{i=1}^n L\left(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)\right). }[/math] ## Update the model:
          [math]\displaystyle{ F_m(x) = F_{m-1}(x) + \gamma_m h_m(x). }[/math] # Output [math]\displaystyle{ F_M(x). }[/math]


    • QUOTE: Gradient boosting builds an ensemble of trees one-by-one, then the predictions of the individual trees are summed: [math]\displaystyle{ D(x)= d_\text{tree 1}(x)+d_\text{tree 2}(x)+... D(\mathbf{x}) = d_\text{tree 1}(\mathbf{x}) + d_\text{tree 2}(\mathbf{x}) + … D(x)=d​_\text{tree 1}​​(x)+d​_\text{tree 2}​​(x)+... }[/math] The next decision tree tries to cover the discrepancy between the target function [math]\displaystyle{ f(x) f(\mathbf{x}) f(x) }[/math] and the current ensemble prediction by reconstructing the residual.

      For example, if an ensemble has 3 trees the prediction of that ensemble is: D(x)=dtree 1(x)+dtree 2(x)+dtree 3(x) D(\mathbf{x}) = d_\text{tree 1}(\mathbf{x}) + d_\text{tree 2}(\mathbf{x}) + d_\text{tree 3}(\mathbf{x}) D(x)=d​tree 1​​(x)+d​tree 2​​(x)+d​tree 3​​(x) The next tree (tree 4 \text{tree 4} tree 4 ) in the ensemble should complement well the existing trees and minimize the training error of the ensemble.

      In the ideal case we'd be happy to have: [math]\displaystyle{ D(x)+d_\text{tree 4}(x)=f(x). D(\mathbf{x}) + d_\text{tree 4}(\mathbf{x}) = f(\mathbf{x}). D(x)+d​tree 4​​(x)=f(x) }[/math].

      To get a bit closer to the destination, we train a tree to reconstruct the difference between the target function and the current predictions of an ensemble, which is called the residual: R(x)=f(x)−D(x). R(\mathbf{x}) = f(\mathbf{x}) - D(\mathbf{x}). R(x)=f(x)−D(x). Did you notice? If decision tree completely reconstructs R(x) R(\mathbf{x}) R(x), the whole ensemble gives predictions without errors (after adding the newly-trained tree to the ensemble)! That said, in practice this never happens, so we instead continue the iterative process of ensemble building.


    • QUOTE:
      • AdaBoost is equivalent to forward stagewise additive modeling using the exponential loss function.
      • Generalization of AdaBoost to work with arbitrary loss functions resulted in GBM. Gradient Boosting = Gradient Descent + Boosting
      • GBM uses gradient descent algorithm which can optimize any differentiable loss function.
      • In Adaboost, ‘shortcomings’ are identified by high-weight data points.
      • In Gradient Boosting, “shortcomings” are identified by negative gradients (also called pseudo residuals).
      • In GBM instead of reweighting used in adaboost, each new tree is fit to the negative gradients of the previous tree.
      • Each tree in GBM is a successive gradient descent step.





  1. 1.0 1.1 Cite error: Invalid <ref> tag; no text was provided for refs named hastie
  2. Cite error: Invalid <ref> tag; no text was provided for refs named :1
  3. Brieman, L. “Arcing The Edge" (June 1997)
  4. Friedman, J. H. “Greedy Function Approximation: A Gradient Boosting Machine." (February 1999)
  5. Friedman, J. H. “Stochastic Gradient Boosting." (March 1999)