Ensemble-based Learning Algorithm

An Ensemble-based Learning Algorithm is a learning algorithm that uses one or more model-based learning algorithms (ensemble base algorithms) to produces an ensemble-based predictor (composed of two or more predictors).

Context:
- It can be applied by an Ensemble-based Learning System (to solve a ensemble-based learning task).
- It can range from (typically) being a Supervised Ensemble Algorithm (such as a fully-supervised ensemble algorithm) to being an Unsupervised Ensemble Algorithm.
- It can range from being a Averaging Algorithm, to being a Decision Tree Ensemble Learning Algorithm, to being an Boosting Algorithm.
- …
Example(s):
- A Bagging Algorithm.
- A Boosting Algorithms such as:
  - An AdaBoost algorithm.
  - A Gradient Boosting Algorithm.
- A Decision Tree Ensemble Learning Algorithm such as:
  - A Gradient Tree Boosting Algorithm.
  - A Random Forests Algorithm such as:
    - An Extremely Randomized Trees Algorithm
    - A Totally Random Trees Embedding Algorithm.
  - A Rotation Forests Algorithm.
  - An Isolation Forest Algorithm.
- A DECORATE Algorithm.
- A Stacked Learning Algorithms.
- A Voting Classification Algorithm.
- A Negative Correlation Learning Algorithm.
- A Random Subspace Method-Based Algorithm.
- An ID3 algorithm.
- A C4.5 algorithm.
- A CART.
- A CHAID (CHi-squared Automatic Interaction Detector).
- A Multivariate adaptive regression splines,
- A Conditional Inference Trees.
- …
Counter-Example(s):
- any Single-Model Supervised Learning Algorithm.
- Consensus Clustering.
See: Non-Parametric Supervised Learning, Weak Learner, Learning Algorithm, Bayesian Probability.

References

2017A

(Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Decision_tree_learning#Decision_tree_types Retrieved:2017-10-15.
- Decision trees used in data mining are of two main types:
  - Classification tree analysis is when the predicted outcome is the class to which the data belongs.
  - Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient's length of stay in a hospital).
- The term Classification And Regression Tree (CART) analysis is an umbrella term used to refer to both of the above procedures, first introduced by Breiman et al.^[1] Trees used for regression and trees used for classification have some similarities - but also some differences, such as the procedure used to determine where to split.
  Some techniques, often called ensemble methods, construct more than one decision tree:
  - Boosted trees Incrementally building an ensemble by training each new instance to emphasize the training instances previously mis-modeled. A typical example is AdaBoost. These can be used for regression-type and classification-type problems. ^[2] ^[3]
  - Bootstrap aggregated (or bagged) decision trees, an early ensemble method, builds multiple decision trees by repeatedly resampling training data with replacement, and voting the trees for a consensus prediction. ^[4]
    - A random forest classifier is a specific type of bootstrap aggregating
  - Rotation forest - in which every decision tree is trained by first applying principal component analysis (PCA) on a random subset of the input features. ^[5]
    A special case of a decision tree is a decision list, which is a one-sided decision tree, so that every internal node has exactly 1 leaf node and exactly 1 internal node as a child (except for the bottommost node, whose only child is a single leaf node). While less expressive, decision lists are arguably easier to understand than general decision trees due to their added sparsity, permit non-greedy learning methods and monotonic constraints to be imposed.
    Decision tree learning is the construction of a decision tree from class-labeled training tuples. A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node.
    There are many specific decision-tree algorithms. Notable ones include:
  - ID3 (Iterative Dichotomiser 3) * C4.5 (successor of ID3)
  - CART (Classification And Regression Tree)
  - CHAID (CHi-squared Automatic Interaction Detector). Performs multi-level splits when computing classification trees.
  - MARS: extends decision trees to handle numerical data better.
  - Conditional Inference Trees. Statistics-based approach that uses non-parametric tests as splitting criteria, corrected for multiple testing to avoid overfitting. This approach results in unbiased predictor selection and does not require pruning.^[6] ^[7]
- ID3 and CART were invented independently at around the same time (between 1970 and 1980), yet follow a similar approach for learning decision tree from training tuples.

2017B

(Scikit Learn, 2017) ⇒ "Ensemble Methods" in http://scikit-learn.org/stable/modules/ensemble.html Retrieved: 2017-10-15
- QUOTE: The goal of ensemble methods is to combine the predictions of several base estimators built with a given [[learning algorithm[[in order to improve generalizability robustness over a single estimator.
  Two families of ensemble methods are usually distinguished:
  In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.
  Examples: Bagging methods, Forests of randomized trees, …
  By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.
  Examples: AdaBoost, Gradient Tree Boosting, …

2017c
(Brown, 2017) ⇒ Gavin Brown. (2017). [Decision Tree Ensemble Learning System "Ensemble Learning."] In: "Encyclopedia of Machine Learning and Data Mining"(Editors: Claude Sammut, Geoffrey I. Webb) pp 393-402
QUOTE: Algorithms for Learning a Set of Models
If we had a committee of people taking decisions, it is self-evident that we would not want them all to make the same bad judgments at the same time. With a committee of learning models, the same intuition applies: we will have no gain from combining a set of identical models. We wish the models to exhibit a certain element of “diversity” in their group behavior, though still retaining good performance individually.
We therefore make a distinction between two types of ensemble learning algorithms, those which encourage diversity implicitly, and those which encourage it explicitly. The vast majority of ensemble methods are implicit, in that they provide different random subsets of the training data to each learner. Diversity is encouraged “implicitly” by random sampling of the data space: at no point is a measurement taken to ensure diversity will emerge. The random differences between the datasets might be in the selection of examples (the Bagging algorithm), the selection of features (Random Subspace Method, Ho (1998) or Rotation Forests, Rodriguez et al. 2006), or combinations of the two (the Random Forests algorithm, Breiman 2001). Many other “randomization” schemes are of course possible.
An alternative is to explicitly encourage diversity, constructing each ensemble member with some measurement ensuring that it is substantially different from the other members. Boosting algorithms achieve this by altering the distribution of training examples for each learner such that it is encouraged to make more accurate predictions where previous predictors have made errors. The DECORATE algorithm (Melville and Mooney 2005) explicitly alters the distribution of class labels, such that successive models are forced to learn different answers to the same problem. Negative correlation learning (see Brown 2004; Brown et al. 2005), includes a penalty term when learning each ensemble member, explicitly managing the accuracy-diversity trade-off.
In general, ensemble methods constitute a large class of algorithms – some based on heuristics, and some on sound learning-theoretic principles. The three algorithms that have received the most attention in the literature are reviewed here. It should be noted that we present only the most basic form of each; numerous modifications have been proposed for a variety of learning scenarios. As further study the reader is referred to the many comprehensive surveys of the field (Brown et al. 2005; Kuncheva 2004b; Polikar 2006).
2010
(Seni & Elder, 2010) ⇒ Giovanni Seni, and John F. Elder. (2010). “Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions.” Morgan & Claypool. doi:10.2200/S00240ED1V01Y200912DMK002
ABSTRACT: Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the past decade. They combine multiple models into one usually more accurate than the best of its components. Ensembles can provide a critical boost to industrial challenges -- from investment timing to drug discovery, and fraud detection to recommendation systems -- where predictive accuracy is more vital than model interpretability. Ensembles are useful with all modeling algorithms, but this book focuses on decision trees to explain them most clearly. After describing trees and their strengths and weaknesses, the authors provide an overview of regularization -- today understood to be a key reason for the superior performance of modern ensembling algorithms. The book continues with a clear description of two recent developments: Importance Sampling (IS) and Rule Ensembles (RE). IS reveals classic ensemble methods -- bagging, random forests, and boosting -- to be special cases of a single algorithm, thereby showing how to improve their accuracy and speed. REs are linear rule models derived from decision tree ensembles. They are the most interpretable version of ensembles, which is essential to applications such as credit scoring and fault diagnosis. Lastly, the authors explain the paradox of how ensembles achieve greater accuracy on new data despite their (apparently much greater) complexity.
This book is aimed at novice and advanced analytic researchers and practitioners -- especially in Engineering, Statistics, and Computer Science. Those with little exposure to ensembles will learn why and how to employ this breakthrough method, and advanced practitioners will gain insight into building even more powerful models. Throughout, snippets of code in R are provided to illustrate the algorithms described and to encourage the reader to try the techniques.
The authors are industry experts in data mining and machine learning who are also adjunct professors and popular speakers. Although early pioneers in discovering and using ensembles, they here distill and clarify the recent groundbreaking work of leading academics (such as Jerome Friedman) to bring the benefits of ensembles to practitioners.
2006
(Bishop, 2006) ⇒ Christopher M. Bishop. (2006). “Pattern Recognition and Machine Learning. Springer, Information Science and Statistics.
2005
(Bühlmann, 2005) ⇒ Peter Bühlmann. (2005). “16.1 An Introduction to Ensemble Methods." website
QUOTE: Ensemble methods aim at improving the predictive performance of a given statistical learning or model fitting technique. The general principle of ensemble methods is to construct a linear combination of some model fitting method, instead of using a single fit of the method. More precisely, consider for simplicity the framework of function estimation. We are interested in estimating a real-valued function [math]\displaystyle{ \displaystyle g:\mathbb{R}^d \to \mathbb{R} }[/math] based on data [math]\displaystyle{ (X_1,Y_1),\ldots, (X_n,Y_n) }[/math] where [math]\displaystyle{ X }[/math] is a [math]\displaystyle{ d }[/math]-dimensional predictor variable and [math]\displaystyle{ Y }[/math] a univariate response. Generalizations to other functions [math]\displaystyle{ g(\cdot) }[/math] and other data-types are possible. We assume to have specified a base procedure which, given some input data (as above), yields an estimated function [math]\displaystyle{ \hat{g}(\cdot) }[/math]. For example, the base procedure could be a nonparametric kernel estimator (if [math]\displaystyle{ d }[/math] is small) or a nonparametric statistical method with some structural restrictions (for [math]\displaystyle{ d \ge 2 }[/math]) such as a regression tree (or class-probability estimates from a classification tree). We can run a base procedure many times when changing the input data: the original idea of ensemble methods is to use reweighted original data to obtain different estimates [math]\displaystyle{ \hat{g}_1(\cdot),\hat{g}_2(\cdot),\hat{g}_3(\cdot),\ldots }[/math] based on different reweighted input data. We can then construct an ensemble-based function estimate [math]\displaystyle{ g_{ens}(\cdot) }[/math] by taking linear combinations of the individual function estimates [math]\displaystyle{ \hat{g}_k(\cdot) }[/math]:
[math]\displaystyle{ \displaystyle \hat{g}_{ens}(\cdot) =\sum_{k=1}^M c_k \hat{g}_k(\cdot)\;, }[/math] (16.1)
where the [math]\displaystyle{ \hat{g}_k(\cdot) }[/math] are obtained from the base procedure based on the [math]\displaystyle{ k }[/math]th reweighted data-set. For some ensemble methods, e.g. for bagging (see Sect. 16.2), the linear combination coefficients [math]\displaystyle{ c_k \equiv 1/M }[/math] are averaging weights; for other methods, e.g. for boosting (see Sect. 16.3), [math]\displaystyle{ \sum_{k=1}^M c_k }[/math] increases as [math]\displaystyle{ M }[/math] gets larger.
Ensemble methods became popular as a relatively simple device to improve the predictive performance of a base procedure. There are different reasons for this: the bagging procedure turns out to be a variance reduction scheme, at least for some base procedures. On the other hand, boosting methods are primarily reducing the (model) bias of the base procedure. This already indicates that bagging and boosting are very different ensemble methods. We will argue in Sects. 16.3.1 and 16.3.6 that boosting may be even viewed as a non-ensemble method which has tremendous advantages over ensemble (or multiple prediction) methods in terms of interpretation.
Random forests (Breiman, 2001) is a very different ensemble method than bagging or boosting. The earliest random forest proposal is from Amit and Geman (Amit & Geman, 1997). From the perspective of prediction, random forests is about as good as boosting, and often better than bagging. For further details about random forests we refer to (Breiman, 2001).
2000
(Dietterich, 2000a) ⇒ Thomas G. Dietterich. (2000). “An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization.” In: Machine Learning Journal, 40(2). doi:10.1023/A:1007607513941

(Valpola, 2000) ⇒ Harri Valpola. (2000). “Bayesian Ensemble Learning for Nonlinear Factor Analysis." PhD Dissertation, Helsinki University of Technology.
QUOTE: ensemble learning: A technique for approximating the exact application of Bayesian probability theory.
Ensemble learning is a technique for parametric approximation of the posterior probability where fitting the parametric approximation to the actual posterior probability is achieved by minimising their misfit. The misfit is measured with Kullback-Leibler information [70], also known as relative or cross entropy. It is a measure suited for comparing probability distributions and, more importantly, it can be computed efficiently in practice if the approximation is chosen to be simple enough.
1998
(Barber & Bishop, 1998) ⇒ David Barber, and Christopher M. Bishop. (1998). “Ensemble Learning in Bayesian Neural Networks.” In: Christopher M. Bishop (editor), "Generalization in Neural Networks and Machine Learning." Springer. ISBN:354064928X
QUOTE: A third approach, called ensemble learning, was introduced by Hinton and van Camp (1993) and again involves finding a simple, analytically tractable, approximation to the true posterior distribution. Unlike Laplace's method, however, the approximating distribution is fitted globally, rather than locally, by minimizing a Kullback-Leibler divergence. Hinton and van Camp (1993) showed that, in the case of a Gaussian approximating distribution with a diagonal covariance, a deterministic learning algorithm could be derived.
1993
(Hinton & Camp, 1993) ⇒ G. E. Hinton, and D. van Camp (1993). “Keeping neural networks simple by minimizing the description length of the weights.” In: Proceedings of the Sixth Annual Conference on Computational Learning Theory.

↑ Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software. ISBN 978-0-412-04841-8.

↑ Friedman, J. H. (1999). Stochastic gradient boosting. Stanford University.

↑ Hastie, T., Tibshirani, R., Friedman, J. H. (2001). The elements of statistical learning : Data mining, inference, and prediction. New York: Springer Verlag.

↑ Breiman, L. (1996). Bagging Predictors. “Machine Learning, 24": pp. 123-140.

↑ Rodriguez, J.J. and Kuncheva, L.I. and Alonso, C.J. (2006), Rotation forest: A new classifier ensemble method, IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1619-1630.

↑ Hothorn, T.; Hornik, K.; Zeileis, A. (2006). “Unbiased Recursive Partitioning: A Conditional Inference Framework". Journal of Computational and Graphical Statistics. 15 (3): 651–674. JSTOR 27594202. doi:10.1198/106186006X133933.

↑ Strobl, C.; Malley, J.; Tutz, G. (2009). “An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests". Psychological Methods. 14 (4): 323–348. doi:10.1037/a0016973.

[bfos-1] Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software. ISBN 978-0-412-04841-8.

[2] Friedman, J. H. (1999). Stochastic gradient boosting. Stanford University.

[3] Hastie, T., Tibshirani, R., Friedman, J. H. (2001). The elements of statistical learning : Data mining, inference, and prediction. New York: Springer Verlag.

[4] Breiman, L. (1996). Bagging Predictors. “Machine Learning, 24": pp. 123-140.

[5] Rodriguez, J.J. and Kuncheva, L.I. and Alonso, C.J. (2006), Rotation forest: A new classifier ensemble method, IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1619-1630.

[Hothorn2006-6] Hothorn, T.; Hornik, K.; Zeileis, A. (2006). “Unbiased Recursive Partitioning: A Conditional Inference Framework". Journal of Computational and Graphical Statistics. 15 (3): 651–674. JSTOR 27594202. doi:10.1198/106186006X133933.

[Strobl2009-7] Strobl, C.; Malley, J.; Tutz, G. (2009). “An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests". Psychological Methods. 14 (4): 323–348. doi:10.1037/a0016973.

[1]

[2]

[3]

[4]

[5]

[6]

[7]