Ensemble-based Learning Algorithm

(Redirected from Ensemble learning)

An Ensemble-based Learning Algorithm is a learning algorithm that uses one or more model-based learning algorithms (ensemble base algorithms) to produces an ensemble-based predictor (composed of two or more predictors).

References

2017A

• (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Decision_tree_learning#Decision_tree_types Retrieved:2017-10-15.
• Decision trees used in data mining are of two main types:
• Classification tree analysis is when the predicted outcome is the class to which the data belongs.
• Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient's length of stay in a hospital).
• The term Classification And Regression Tree (CART) analysis is an umbrella term used to refer to both of the above procedures, first introduced by Breiman et al.[1] Trees used for regression and trees used for classification have some similarities - but also some differences, such as the procedure used to determine where to split.[1]

Some techniques, often called ensemble methods, construct more than one decision tree:

• Boosted trees Incrementally building an ensemble by training each new instance to emphasize the training instances previously mis-modeled. A typical example is AdaBoost. These can be used for regression-type and classification-type problems. [2] [3]
• Bootstrap aggregated (or bagged) decision trees, an early ensemble method, builds multiple decision trees by repeatedly resampling training data with replacement, and voting the trees for a consensus prediction. [4]
• Rotation forest - in which every decision tree is trained by first applying principal component analysis (PCA) on a random subset of the input features. [5]

A special case of a decision tree is a decision list, which is a one-sided decision tree, so that every internal node has exactly 1 leaf node and exactly 1 internal node as a child (except for the bottommost node, whose only child is a single leaf node). While less expressive, decision lists are arguably easier to understand than general decision trees due to their added sparsity, permit non-greedy learning methods and monotonic constraints to be imposed.

Decision tree learning is the construction of a decision tree from class-labeled training tuples. A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node.

There are many specific decision-tree algorithms. Notable ones include:

• ID3 (Iterative Dichotomiser 3) * C4.5 (successor of ID3)
• CART (Classification And Regression Tree)
• CHAID (CHi-squared Automatic Interaction Detector). Performs multi-level splits when computing classification trees.
• MARS: extends decision trees to handle numerical data better.
• Conditional Inference Trees. Statistics-based approach that uses non-parametric tests as splitting criteria, corrected for multiple testing to avoid overfitting. This approach results in unbiased predictor selection and does not require pruning.[6] [7]
• ID3 and CART were invented independently at around the same time (between 1970 and 1980), yet follow a similar approach for learning decision tree from training tuples.

2017c

• (Brown, 2017) ⇒ Gavin Brown. (2017). [Decision Tree Ensemble Learning System "Ensemble Learning."] In: "Encyclopedia of Machine Learning and Data Mining"(Editors: Claude Sammut, Geoffrey I. Webb) pp 393-402
• QUOTE: Algorithms for Learning a Set of Models

If we had a committee of people taking decisions, it is self-evident that we would not want them all to make the same bad judgments at the same time. With a committee of learning models, the same intuition applies: we will have no gain from combining a set of identical models. We wish the models to exhibit a certain element of “diversity” in their group behavior, though still retaining good performance individually.

We therefore make a distinction between two types of ensemble learning algorithms, those which encourage diversity implicitly, and those which encourage it explicitly. The vast majority of ensemble methods are implicit, in that they provide different random subsets of the training data to each learner. Diversity is encouraged “implicitly” by random sampling of the data space: at no point is a measurement taken to ensure diversity will emerge. The random differences between the datasets might be in the selection of examples (the Bagging algorithm), the selection of features (Random Subspace Method, Ho (1998) or Rotation Forests, Rodriguez et al. 2006), or combinations of the two (the Random Forests algorithm, Breiman 2001). Many other “randomization” schemes are of course possible.

An alternative is to explicitly encourage diversity, constructing each ensemble member with some measurement ensuring that it is substantially different from the other members. Boosting algorithms achieve this by altering the distribution of training examples for each learner such that it is encouraged to make more accurate predictions where previous predictors have made errors. The DECORATE algorithm (Melville and Mooney 2005) explicitly alters the distribution of class labels, such that successive models are forced to learn different answers to the same problem. Negative correlation learning (see Brown 2004; Brown et al. 2005), includes a penalty term when learning each ensemble member, explicitly managing the accuracy-diversity trade-off.

In general, ensemble methods constitute a large class of algorithms – some based on heuristics, and some on sound learning-theoretic principles. The three algorithms that have received the most attention in the literature are reviewed here. It should be noted that we present only the most basic form of each; numerous modifications have been proposed for a variety of learning scenarios. As further study the reader is referred to the many comprehensive surveys of the field (Brown et al. 2005; Kuncheva 2004b; Polikar 2006).

2010

• (Seni & Elder, 2010) ⇒ Giovanni Seni, and John F. Elder. (2010). “Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions.” Morgan & Claypool. doi:10.2200/S00240ED1V01Y200912DMK002
• ABSTRACT: Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the past decade. They combine multiple models into one usually more accurate than the best of its components. Ensembles can provide a critical boost to industrial challenges -- from investment timing to drug discovery, and fraud detection to recommendation systems -- where predictive accuracy is more vital than model interpretability. Ensembles are useful with all modeling algorithms, but this book focuses on decision trees to explain them most clearly. After describing trees and their strengths and weaknesses, the authors provide an overview of regularization -- today understood to be a key reason for the superior performance of modern ensembling algorithms. The book continues with a clear description of two recent developments: Importance Sampling (IS) and Rule Ensembles (RE). IS reveals classic ensemble methods -- bagging, random forests, and boosting -- to be special cases of a single algorithm, thereby showing how to improve their accuracy and speed. REs are linear rule models derived from decision tree ensembles. They are the most interpretable version of ensembles, which is essential to applications such as credit scoring and fault diagnosis. Lastly, the authors explain the paradox of how ensembles achieve greater accuracy on new data despite their (apparently much greater) complexity.

This book is aimed at novice and advanced analytic researchers and practitioners -- especially in Engineering, Statistics, and Computer Science. Those with little exposure to ensembles will learn why and how to employ this breakthrough method, and advanced practitioners will gain insight into building even more powerful models. Throughout, snippets of code in R are provided to illustrate the algorithms described and to encourage the reader to try the techniques.

The authors are industry experts in data mining and machine learning who are also adjunct professors and popular speakers. Although early pioneers in discovering and using ensembles, they here distill and clarify the recent groundbreaking work of leading academics (such as Jerome Friedman) to bring the benefits of ensembles to practitioners.

2005

• (Bühlmann, 2005) ⇒ Peter Bühlmann. (2005). “16.1 An Introduction to Ensemble Methods." website
• QUOTE: Ensemble methods aim at improving the predictive performance of a given statistical learning or model fitting technique. The general principle of ensemble methods is to construct a linear combination of some model fitting method, instead of using a single fit of the method. More precisely, consider for simplicity the framework of function estimation. We are interested in estimating a real-valued function $\displaystyle g:\mathbb{R}^d \to \mathbb{R}$ based on data $(X_1,Y_1),\ldots, (X_n,Y_n)$ where $X$ is a $d$-dimensional predictor variable and $Y$ a univariate response. Generalizations to other functions $g(\cdot)$ and other data-types are possible. We assume to have specified a base procedure which, given some input data (as above), yields an estimated function $\hat{g}(\cdot)$. For example, the base procedure could be a nonparametric kernel estimator (if $d$ is small) or a nonparametric statistical method with some structural restrictions (for $d \ge 2$) such as a regression tree (or class-probability estimates from a classification tree). We can run a base procedure many times when changing the input data: the original idea of ensemble methods is to use reweighted original data to obtain different estimates $\hat{g}_1(\cdot),\hat{g}_2(\cdot),\hat{g}_3(\cdot),\ldots$ based on different reweighted input data. We can then construct an ensemble-based function estimate $g_{ens}(\cdot)$ by taking linear combinations of the individual function estimates $\hat{g}_k(\cdot)$:

$\displaystyle \hat{g}_{ens}(\cdot) =\sum_{k=1}^M c_k \hat{g}_k(\cdot)\;,$ (16.1)

where the $\hat{g}_k(\cdot)$ are obtained from the base procedure based on the $k$th reweighted data-set. For some ensemble methods, e.g. for bagging (see Sect. 16.2), the linear combination coefficients $c_k \equiv 1/M$ are averaging weights; for other methods, e.g. for boosting (see Sect. 16.3), $\sum_{k=1}^M c_k$ increases as $M$ gets larger.

Ensemble methods became popular as a relatively simple device to improve the predictive performance of a base procedure. There are different reasons for this: the bagging procedure turns out to be a variance reduction scheme, at least for some base procedures. On the other hand, boosting methods are primarily reducing the (model) bias of the base procedure. This already indicates that bagging and boosting are very different ensemble methods. We will argue in Sects. 16.3.1 and 16.3.6 that boosting may be even viewed as a non-ensemble method which has tremendous advantages over ensemble (or multiple prediction) methods in terms of interpretation.

Random forests (Breiman, 2001) is a very different ensemble method than bagging or boosting. The earliest random forest proposal is from Amit and Geman (Amit & Geman, 1997). From the perspective of prediction, random forests is about as good as boosting, and often better than bagging. For further details about random forests we refer to (Breiman, 2001).

2000

• (Valpola, 2000) ⇒ Harri Valpola. (2000). “Bayesian Ensemble Learning for Nonlinear Factor Analysis." PhD Dissertation, Helsinki University of Technology.
• QUOTE: ensemble learning: A technique for approximating the exact application of Bayesian probability theory.

Ensemble learning is a technique for parametric approximation of the posterior probability where fitting the parametric approximation to the actual posterior probability is achieved by minimising their misfit. The misfit is measured with Kullback-Leibler information [70], also known as relative or cross entropy. It is a measure suited for comparing probability distributions and, more importantly, it can be computed efficiently in practice if the approximation is chosen to be simple enough.

1993

• (Hinton & Camp, 1993) ⇒ G. E. Hinton, and D. van Camp (1993). “Keeping neural networks simple by minimizing the description length of the weights.” In: Proceedings of the Sixth Annual Conference on Computational Learning Theory.

1. Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software. ISBN 978-0-412-04841-8.
2. Friedman, J. H. (1999). Stochastic gradient boosting. Stanford University.
3. Hastie, T., Tibshirani, R., Friedman, J. H. (2001). The elements of statistical learning : Data mining, inference, and prediction. New York: Springer Verlag.
4. Breiman, L. (1996). Bagging Predictors. "Machine Learning, 24": pp. 123-140.
5. Rodriguez, J.J. and Kuncheva, L.I. and Alonso, C.J. (2006), Rotation forest: A new classifier ensemble method, IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1619-1630.
6. Hothorn, T.; Hornik, K.; Zeileis, A. (2006). “Unbiased Recursive Partitioning: A Conditional Inference Framework". Journal of Computational and Graphical Statistics. 15 (3): 651–674. JSTOR 27594202. doi:10.1198/106186006X133933.
7. Strobl, C.; Malley, J.; Tutz, G. (2009). “An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests". Psychological Methods. 14 (4): 323–348. doi:10.1037/a0016973.