Gradient Boosting-based Tree (GBT) Learning Algorithm: Difference between revisions

From GM-RKB
Jump to navigation Jump to search
 
m (Text replacement - ". " to ". ")
 
(9 intermediate revisions by 2 users not shown)
Line 2: Line 2:
* <B>Context:</B>
* <B>Context:</B>
** It can be applied by a [[Gradient Boosting-based Decision Tree Learning System]].
** It can be applied by a [[Gradient Boosting-based Decision Tree Learning System]].
** It can range from being a [[Gradient Boosting-based Classification Tree Learning Algorithm]] to being a [[Gradient Boosting-based Regression Tree Learning Algorithm]].  
** It can range from being a [[Gradient Boosting-based Classification Tree Learning Algorithm]] to being a [[Gradient Boosting-based Regression Tree Learning Algorithm]].
* <B>Example(s):</B>
* <B>Example(s):</B>
** [[CART-based Gradient Tree Boosted Algorithm]] (for [[CART]]).
** [[CART-based Gradient Tree Boosted Algorithm]] (for [[CART]]).
** …
* <B>Counter-Example(s):</B>
* <B>Counter-Example(s):</B>
** [[Random Forests Algorithm]].
** [[Random Forests Algorithm]].
* <B>See:</B> [[Decision Tree]], [[Indicator Notation]], [[Interaction (Statistics)]], [[Decision Stump]].
* <B>See:</B> [[Decision Tree]], [[Indicator Notation]], [[Interaction (Statistics)]], [[Decision Stump]].
----
----
----
----
== References ==
== References ==


Line 18: Line 21:
=== 2015 ===
=== 2015 ===
* (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting Retrieved:2015-1-14.
* (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting Retrieved:2015-1-14.
** Gradient boosting is typically used with [[decision tree]]s (especially [[Classification and regression tree|CART]] trees) of a fixed size as base learners. For this special case Friedman proposes a modification to gradient boosting method which improves the quality of fit of each base learner. <P> Generic gradient boosting at the ''m''-th step would fit a decision tree <math>\! h_m(x)</math> to pseudo-residuals. Let <math>\! J</math> be the number of its leaves. The tree partitions the input space into <math>\! J</math> disjoint regions <math>\! R_{1m}, \ldots, R_{Jm}</math> and predicts a constant value in each region. Using the [[indicator notation]], the output of <math>\! h_m(x)</math> for input ''x'' can be written as the sum: <P> : <math>h_m(x) = \sum_{j=1}^J b_{jm} I(x \in R_{jm}),</math> <P> where <math>\! b_{jm}</math> is the value predicted in the region <math>\! R_{jm}</math>. <ref> Note: in case of usual CART trees, the trees are fitted using least-squares loss, and so the coefficient <math>b_{jm}</math> for the region <math>R_{jm}</math> is equal to just the value of output variable, averaged over all training instances in <math>R_{jm}</math>. </ref> Then the coefficients <math>\! b_{jm}</math> are multiplied by some value <math>\! \gamma_m</math>, chosen using line search so as to minimize the loss function, and the model is updated as follows: : <math> F_m(x) = F_{m-1}(x) + \gamma_m h_m(x), \quad \gamma_m = \underset{\gamma}{\operatorname{arg\,min}} \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)). <P> </math> <P> Friedman proposes to modify [[this algorithm]] so that it chooses a separate optimal value <math>\! \gamma_{jm}</math> for each of the tree's regions, instead of a single <math>\! \gamma_m</math> for the whole tree. He calls the modified algorithm "TreeBoost". The coefficients <math>\! b_{jm}</math> from the tree-fitting procedure can be then simply discarded and the model update rule becomes: <P> : <math> <P> F_m(x) = F_{m-1}(x) + \sum_{j=1}^J \gamma_{jm} I(x \in R_{jm}), \quad <P> \gamma_{jm} = \underset{\gamma}{\operatorname{arg\,min}} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)). <P> </math> <P> === Size of trees === <P> <math>\! J</math>, the number of terminal nodes in trees, is the method's parameter which can be adjusted for a data set at hand. It controls the maximum allowed level of [[Interaction (statistics)|interaction]] between variables in the model. With <math>\! J = 2</math> ([[decision stump]]s), no interaction between variables is allowed. With <math>\! J = 3</math> the model may include effects of the interaction between up to two variables, and so on. <P> Hastie et al.<ref name="hastie"></ref> comment that typically <math>4 \leq J \leq 8</math> work well for boosting and results are fairly insensitive to the choice of <math>J</math> in this range, <math>J = 2</math> is insufficient for many applications, and <math>J > 10</math> is unlikely to be required.
** Gradient boosting is typically used with [[decision tree]]s (especially [[Classification and regression tree|CART]] trees) of a fixed size as base learners. For this special case Friedman proposes a modification to gradient boosting method which improves the quality of fit of each base learner.         <P>       Generic gradient boosting at the ''m''-th step would fit a decision tree <math>\! h_m(x)</math> to pseudo-residuals. Let <math>\! J</math> be the number of its leaves. The tree partitions the input space into <math>\! J</math> disjoint regions <math>\! R_{1m}, \ldots, R_{Jm}</math> and predicts a constant value in each region. Using the [[indicator notation]], the output of <math>\! h_m(x)</math> for input ''x'' can be written as the sum:         <P>       : <math>h_m(x) = \sum_{j=1}^J b_{jm} I(x \in R_{jm}),</math>         <P>       where <math>\! b_{jm}</math> is the value predicted in the region <math>\! R_{jm}</math>. <ref> Note: in case of usual CART trees, the trees are fitted using least-squares loss, and so the coefficient <math>b_{jm}</math> for the region <math>R_{jm}</math> is equal to just the value of output variable, averaged over all training instances in <math>R_{jm}</math>. </ref> Then the coefficients <math>\! b_{jm}</math> are multiplied by some value <math>\! \gamma_m</math>, chosen using line search so as to minimize the loss function, and the model is updated as follows: : <math> F_m(x) = F_{m-1}(x) + \gamma_m h_m(x), \quad \gamma_m = \underset{\gamma}{\operatorname{arg\,min}} \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)).         <P>       </math>         <P>       Friedman proposes to modify [[this algorithm]] so that it chooses a separate optimal value <math>\! \gamma_{jm}</math> for each of the tree's regions, instead of a single <math>\! \gamma_m</math> for the whole tree. He calls the modified algorithm "TreeBoost". The coefficients <math>\! b_{jm}</math> from the tree-fitting procedure can be then simply discarded and the model update rule becomes:         <P>       : <math>         <P>       F_m(x) = F_{m-1}(x) + \sum_{j=1}^J \gamma_{jm} I(x \in R_{jm}), \quad         <P>       \gamma_{jm} = \underset{\gamma}{\operatorname{arg\,min}} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)).         <P>       </math>         <P>       === Size of trees ===         <P>       <math>\! J</math>, the number of terminal nodes in trees, is the method's parameter which can be adjusted for a data set at hand. It controls the maximum allowed level of [[Interaction (statistics)|interaction]] between variables in the model. With <math>\! J = 2</math> ([[decision stump]]s), no interaction between variables is allowed. With <math>\! J = 3</math> the model may include effects of the interaction between up to two variables, and so on.         <P>       Hastie et al. comment that typically <math>4 \leq J \leq 8</math> work well for boosting and results are fairly insensitive to the choice of <math>J</math> in this range, <math>J = 2</math> is insufficient for many applications, and <math>J > 10</math> is unlikely to be required.
<references/>
<references/>


----
----
__NOTOC__
[[Category:Concept]]
[[Category:Concept]]
__NOTOC__

Latest revision as of 18:54, 1 August 2022

A Gradient Boosting-based Tree (GBT) Learning Algorithm is a decision tree learning algorithm that is a gradient boosting-based learning algorithm which can be implemented by a Gradient Boosted Decision Tree Learning System (to solve a gradient boosted decision tree learning task).



References

2017

  • http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html
    • QUOTE: Transform your features into a higher dimensional, sparse space. Then train a linear model on these features.

      First fit an ensemble of trees (totally random trees, a random forest, or gradient boosted trees) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion.

2015

  • (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting Retrieved:2015-1-14.
    • Gradient boosting is typically used with decision trees (especially CART trees) of a fixed size as base learners. For this special case Friedman proposes a modification to gradient boosting method which improves the quality of fit of each base learner.

      Generic gradient boosting at the m-th step would fit a decision tree [math]\displaystyle{ \! h_m(x) }[/math] to pseudo-residuals. Let [math]\displaystyle{ \! J }[/math] be the number of its leaves. The tree partitions the input space into [math]\displaystyle{ \! J }[/math] disjoint regions [math]\displaystyle{ \! R_{1m}, \ldots, R_{Jm} }[/math] and predicts a constant value in each region. Using the indicator notation, the output of [math]\displaystyle{ \! h_m(x) }[/math] for input x can be written as the sum:

       : [math]\displaystyle{ h_m(x) = \sum_{j=1}^J b_{jm} I(x \in R_{jm}), }[/math]

      where [math]\displaystyle{ \! b_{jm} }[/math] is the value predicted in the region [math]\displaystyle{ \! R_{jm} }[/math]. [1] Then the coefficients [math]\displaystyle{ \! b_{jm} }[/math] are multiplied by some value [math]\displaystyle{ \! \gamma_m }[/math], chosen using line search so as to minimize the loss function, and the model is updated as follows: : [math]\displaystyle{ F_m(x) = F_{m-1}(x) + \gamma_m h_m(x), \quad \gamma_m = \underset{\gamma}{\operatorname{arg\,min}} \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)). \lt P\gt }[/math]

      Friedman proposes to modify this algorithm so that it chooses a separate optimal value [math]\displaystyle{ \! \gamma_{jm} }[/math] for each of the tree's regions, instead of a single [math]\displaystyle{ \! \gamma_m }[/math] for the whole tree. He calls the modified algorithm "TreeBoost". The coefficients [math]\displaystyle{ \! b_{jm} }[/math] from the tree-fitting procedure can be then simply discarded and the model update rule becomes:

       : [math]\displaystyle{ \lt P\gt F_m(x) = F_{m-1}(x) + \sum_{j=1}^J \gamma_{jm} I(x \in R_{jm}), \quad \lt P\gt \gamma_{jm} = \underset{\gamma}{\operatorname{arg\,min}} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)). \lt P\gt }[/math]

      === Size of trees ===

      [math]\displaystyle{ \! J }[/math], the number of terminal nodes in trees, is the method's parameter which can be adjusted for a data set at hand. It controls the maximum allowed level of interaction between variables in the model. With [math]\displaystyle{ \! J = 2 }[/math] (decision stumps), no interaction between variables is allowed. With [math]\displaystyle{ \! J = 3 }[/math] the model may include effects of the interaction between up to two variables, and so on.

      Hastie et al. comment that typically [math]\displaystyle{ 4 \leq J \leq 8 }[/math] work well for boosting and results are fairly insensitive to the choice of [math]\displaystyle{ J }[/math] in this range, [math]\displaystyle{ J = 2 }[/math] is insufficient for many applications, and [math]\displaystyle{ J \gt 10 }[/math] is unlikely to be required.

  1. Note: in case of usual CART trees, the trees are fitted using least-squares loss, and so the coefficient [math]\displaystyle{ b_{jm} }[/math] for the region [math]\displaystyle{ R_{jm} }[/math] is equal to just the value of output variable, averaged over all training instances in [math]\displaystyle{ R_{jm} }[/math].