Gradient Boosting-based Tree (GBT) Learning Algorithm: Difference between revisions

Latest revision as of 18:54, 1 August 2022

A Gradient Boosting-based Tree (GBT) Learning Algorithm is a decision tree learning algorithm that is a gradient boosting-based learning algorithm which can be implemented by a Gradient Boosted Decision Tree Learning System (to solve a gradient boosted decision tree learning task).

Context:
- It can be applied by a Gradient Boosting-based Decision Tree Learning System.
- It can range from being a Gradient Boosting-based Classification Tree Learning Algorithm to being a Gradient Boosting-based Regression Tree Learning Algorithm.
Example(s):
- CART-based Gradient Tree Boosted Algorithm (for CART).
- …
Counter-Example(s):
- Random Forests Algorithm.
See: Decision Tree, Indicator Notation, Interaction (Statistics), Decision Stump.

References

2017

http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html
- QUOTE: Transform your features into a higher dimensional, sparse space. Then train a linear model on these features.
  First fit an ensemble of trees (totally random trees, a random forest, or gradient boosted trees) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion.

2015

(Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting Retrieved:2015-1-14.
- Gradient boosting is typically used with decision trees (especially CART trees) of a fixed size as base learners. For this special case Friedman proposes a modification to gradient boosting method which improves the quality of fit of each base learner.
  Generic gradient boosting at the m-th step would fit a decision tree [math]\displaystyle{ \! h_m(x) }[/math] to pseudo-residuals. Let [math]\displaystyle{ \! J }[/math] be the number of its leaves. The tree partitions the input space into [math]\displaystyle{ \! J }[/math] disjoint regions [math]\displaystyle{ \! R_{1m}, \ldots, R_{Jm} }[/math] and predicts a constant value in each region. Using the indicator notation, the output of [math]\displaystyle{ \! h_m(x) }[/math] for input x can be written as the sum:
  : [math]\displaystyle{ h_m(x) = \sum_{j=1}^J b_{jm} I(x \in R_{jm}), }[/math]
  where [math]\displaystyle{ \! b_{jm} }[/math] is the value predicted in the region [math]\displaystyle{ \! R_{jm} }[/math]. ^[1] Then the coefficients [math]\displaystyle{ \! b_{jm} }[/math] are multiplied by some value [math]\displaystyle{ \! \gamma_m }[/math], chosen using line search so as to minimize the loss function, and the model is updated as follows: : [math]\displaystyle{ F_m(x) = F_{m-1}(x) + \gamma_m h_m(x), \quad \gamma_m = \underset{\gamma}{\operatorname{arg\,min}} \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)). \lt P\gt }[/math]
  Friedman proposes to modify this algorithm so that it chooses a separate optimal value [math]\displaystyle{ \! \gamma_{jm} }[/math] for each of the tree's regions, instead of a single [math]\displaystyle{ \! \gamma_m }[/math] for the whole tree. He calls the modified algorithm "TreeBoost". The coefficients [math]\displaystyle{ \! b_{jm} }[/math] from the tree-fitting procedure can be then simply discarded and the model update rule becomes:
  : [math]\displaystyle{ \lt P\gt F_m(x) = F_{m-1}(x) + \sum_{j=1}^J \gamma_{jm} I(x \in R_{jm}), \quad \lt P\gt \gamma_{jm} = \underset{\gamma}{\operatorname{arg\,min}} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)). \lt P\gt }[/math]
  === Size of trees ===
  [math]\displaystyle{ \! J }[/math], the number of terminal nodes in trees, is the method's parameter which can be adjusted for a data set at hand. It controls the maximum allowed level of interaction between variables in the model. With [math]\displaystyle{ \! J = 2 }[/math] (decision stumps), no interaction between variables is allowed. With [math]\displaystyle{ \! J = 3 }[/math] the model may include effects of the interaction between up to two variables, and so on.
  Hastie et al. comment that typically [math]\displaystyle{ 4 \leq J \leq 8 }[/math] work well for boosting and results are fairly insensitive to the choice of [math]\displaystyle{ J }[/math] in this range, [math]\displaystyle{ J = 2 }[/math] is insufficient for many applications, and [math]\displaystyle{ J \gt 10 }[/math] is unlikely to be required.

↑ Note: in case of usual CART trees, the trees are fitted using least-squares loss, and so the coefficient [math]\displaystyle{ b_{jm} }[/math] for the region [math]\displaystyle{ R_{jm} }[/math] is equal to just the value of output variable, averaged over all training instances in [math]\displaystyle{ R_{jm} }[/math].

[1] Note: in case of usual CART trees, the trees are fitted using least-squares loss, and so the coefficient [math]\displaystyle{ b_{jm} }[/math] for the region [math]\displaystyle{ R_{jm} }[/math] is equal to just the value of output variable, averaged over all training instances in [math]\displaystyle{ R_{jm} }[/math].

[1]

@@ Line 2: / Line 2: @@
 * <B>Context:</B>
 ** It can be applied by a [[Gradient Boosting-based Decision Tree Learning System]].
 ** It can range from being a [[Gradient Boosting-based Classification Tree Learning Algorithm]] to being a [[Gradient Boosting-based Regression Tree Learning Algorithm]].
 * <B>Example(s):</B>
 ** [[CART-based Gradient Tree Boosted Algorithm]] (for [[CART]]).
+** …
 * <B>Counter-Example(s):</B>
 ** [[Random Forests Algorithm]].
 * <B>See:</B> [[Decision Tree]], [[Indicator Notation]], [[Interaction (Statistics)]], [[Decision Stump]].
 ----
 ----
 == References ==
@@ Line 18: / Line 21: @@
 === 2015 ===
 * (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting Retrieved:2015-1-14.
-** Gradient boosting is typically used with [[decision tree]]s (especially [[Classification and regression tree|CART]] trees) of a fixed size as base learners. For this special case Friedman proposes a modification to gradient boosting method which improves the quality of fit of each base learner. 	<P>	 Generic gradient boosting at the ''m''-th step would fit a decision tree <math>\! h_m(x)</math> to pseudo-residuals. Let <math>\! J</math> be the number of its leaves. The tree partitions the input space into <math>\! J</math> disjoint regions <math>\! R_{1m}, \ldots, R_{Jm}</math> and predicts a constant value in each region. Using the [[indicator notation]], the output of <math>\! h_m(x)</math> for input ''x'' can be written as the sum: 	<P>	 : <math>h_m(x) = \sum_{j=1}^J b_{jm} I(x \in R_{jm}),</math> 	<P>	 where <math>\! b_{jm}</math> is the value predicted in the region <math>\! R_{jm}</math>. <ref> Note: in case of usual CART trees, the trees are fitted using least-squares loss, and so the coefficient <math>b_{jm}</math> for the region <math>R_{jm}</math> is equal to just the value of output variable, averaged over all training instances in <math>R_{jm}</math>. </ref> Then the coefficients <math>\! b_{jm}</math> are multiplied by some value <math>\! \gamma_m</math>, chosen using line search so as to minimize the loss function, and the model is updated as follows: : <math> F_m(x) = F_{m-1}(x) + \gamma_m h_m(x), \quad \gamma_m = \underset{\gamma}{\operatorname{arg\,min}} \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)). 	<P>	 </math> 	<P>	 Friedman proposes to modify [[this algorithm]] so that it chooses a separate optimal value <math>\! \gamma_{jm}</math> for each of the tree's regions, instead of a single <math>\! \gamma_m</math> for the whole tree. He calls the modified algorithm "TreeBoost". The coefficients <math>\! b_{jm}</math> from the tree-fitting procedure can be then simply discarded and the model update rule becomes: 	<P>	 : <math> 	<P>	 F_m(x) = F_{m-1}(x) + \sum_{j=1}^J \gamma_{jm} I(x \in R_{jm}), \quad 	<P>	 \gamma_{jm} = \underset{\gamma}{\operatorname{arg\,min}} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)). 	<P>	 </math> 	<P>	 === Size of trees === 	<P>	 <math>\! J</math>, the number of terminal nodes in trees, is the method's parameter which can be adjusted for a data set at hand. It controls the maximum allowed level of [[Interaction (statistics)|interaction]] between variables in the model. With <math>\! J = 2</math> ([[decision stump]]s), no interaction between variables is allowed. With <math>\! J = 3</math> the model may include effects of the interaction between up to two variables, and so on. 	<P>	 Hastie et al.<ref name="hastie"></ref> comment that typically <math>4 \leq J \leq 8</math> work well for boosting and results are fairly insensitive to the choice of <math>J</math> in this range, <math>J = 2</math> is insufficient for many applications, and <math>J > 10</math> is unlikely to be required.
+** Gradient boosting is typically used with [[decision tree]]s (especially [[Classification and regression tree|CART]] trees) of a fixed size as base learners. For this special case Friedman proposes a modification to gradient boosting method which improves the quality of fit of each base learner.         <P>        Generic gradient boosting at the ''m''-th step would fit a decision tree <math>\! h_m(x)</math> to pseudo-residuals. Let <math>\! J</math> be the number of its leaves. The tree partitions the input space into <math>\! J</math> disjoint regions <math>\! R_{1m}, \ldots, R_{Jm}</math> and predicts a constant value in each region. Using the [[indicator notation]], the output of <math>\! h_m(x)</math> for input ''x'' can be written as the sum:         <P>        : <math>h_m(x) = \sum_{j=1}^J b_{jm} I(x \in R_{jm}),</math>         <P>        where <math>\! b_{jm}</math> is the value predicted in the region <math>\! R_{jm}</math>. <ref> Note: in case of usual CART trees, the trees are fitted using least-squares loss, and so the coefficient <math>b_{jm}</math> for the region <math>R_{jm}</math> is equal to just the value of output variable, averaged over all training instances in <math>R_{jm}</math>. </ref> Then the coefficients <math>\! b_{jm}</math> are multiplied by some value <math>\! \gamma_m</math>, chosen using line search so as to minimize the loss function, and the model is updated as follows: : <math> F_m(x) = F_{m-1}(x) + \gamma_m h_m(x), \quad \gamma_m = \underset{\gamma}{\operatorname{arg\,min}} \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)).         <P>        </math>         <P>        Friedman proposes to modify [[this algorithm]] so that it chooses a separate optimal value <math>\! \gamma_{jm}</math> for each of the tree's regions, instead of a single <math>\! \gamma_m</math> for the whole tree. He calls the modified algorithm "TreeBoost". The coefficients <math>\! b_{jm}</math> from the tree-fitting procedure can be then simply discarded and the model update rule becomes:         <P>        : <math>         <P>        F_m(x) = F_{m-1}(x) + \sum_{j=1}^J \gamma_{jm} I(x \in R_{jm}), \quad         <P>        \gamma_{jm} = \underset{\gamma}{\operatorname{arg\,min}} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)).         <P>        </math>         <P>        === Size of trees ===         <P>        <math>\! J</math>, the number of terminal nodes in trees, is the method's parameter which can be adjusted for a data set at hand. It controls the maximum allowed level of [[Interaction (statistics)|interaction]] between variables in the model. With <math>\! J = 2</math> ([[decision stump]]s), no interaction between variables is allowed. With <math>\! J = 3</math> the model may include effects of the interaction between up to two variables, and so on.         <P>        Hastie et al. comment that typically <math>4 \leq J \leq 8</math> work well for boosting and results are fairly insensitive to the choice of <math>J</math> in this range, <math>J = 2</math> is insufficient for many applications, and <math>J > 10</math> is unlikely to be required.
 <references/>
 ----
+__NOTOC__
 [[Category:Concept]]
-__NOTOC__

Gradient Boosting-based Tree (GBT) Learning Algorithm: Difference between revisions

Latest revision as of 18:54, 1 August 2022

References

2017

2015

Navigation menu

Search