# Ensemble Learning Algorithm

(Redirected from Ensemble Algorithm)

An Ensemble Learning Algorithm is a learning algorithm that uses one or more model-based learning algorithms (ensemble base algorithms) to produces an ensemble-based predictor (composed of two or more predictors).

## References

### 2005

• (Bühlmann, 2005) ⇒ Peter Bühlmann. (2005). "16.1 An Introduction to Ensemble Methods." website
• QUOTE Ensemble methods aim at improving the predictive performance of a given statistical learning or model fitting technique. The general principle of ensemble methods is to construct a linear combination of some model fitting method, instead of using a single fit of the method. More precisely, consider for simplicity the framework of function estimation. We are interested in estimating a real-valued function $$\displaystyle g:\mathbb{R}^d \to \mathbb{R}$$ based on data $$(X_1,Y_1),\ldots, (X_n,Y_n)$$ where $$X$$ is a $$d$$-dimensional predictor variable and $$Y$$ a univariate response. Generalizations to other functions $$g(\cdot)$$ and other data-types are possible. We assume to have specified a base procedure which, given some input data (as above), yields an estimated function $$\hat{g}(\cdot)$$. For example, the base procedure could be a nonparametric kernel estimator (if $$d$$ is small) or a nonparametric statistical method with some structural restrictions (for $$d \ge 2$$) such as a regression tree (or class-probability estimates from a classification tree). We can run a base procedure many times when changing the input data: the original idea of ensemble methods is to use reweighted original data to obtain different estimates $$\hat{g}_1(\cdot),\hat{g}_2(\cdot),\hat{g}_3(\cdot),\ldots$$ based on different reweighted input data. We can then construct an ensemble-based function estimate $$g_{ens}(\cdot)$$ by taking linear combinations of the individual function estimates $$\hat{g}_k(\cdot)$$:

$$\displaystyle \hat{g}_{ens}(\cdot) =\sum_{k=1}^M c_k \hat{g}_k(\cdot)\;,$$ (16.1)

where the $$\hat{g}_k(\cdot)$$ are obtained from the base procedure based on the $$k$$th reweighted data-set. For some ensemble methods, e.g. for bagging (see Sect. 16.2), the linear combination coefficients $$c_k \equiv 1/M$$ are averaging weights; for other methods, e.g. for boosting (see Sect. 16.3), $$\sum_{k=1}^M c_k$$ increases as $$M$$ gets larger.

Ensemble methods became popular as a relatively simple device to improve the predictive performance of a base procedure. There are different reasons for this: the bagging procedure turns out to be a variance reduction scheme, at least for some base procedures. On the other hand, boosting methods are primarily reducing the (model) bias of the base procedure. This already indicates that bagging and boosting are very different ensemble methods. We will argue in Sects. 16.3.1 and 16.3.6 that boosting may be even viewed as a non-ensemble method which has tremendous advantages over ensemble (or multiple prediction) methods in terms of interpretation.

Random forests (Breiman, 2001) is a very different ensemble method than bagging or boosting. The earliest random forest proposal is from Amit and Geman (Amit & Geman, 1997). From the perspective of prediction, random forests is about as good as boosting, and often better than bagging. For further details about random forests we refer to (Breiman, 2001).

### 2000

• (Valpola, 2000) ⇒ Harri Valpola. (2000). "Bayesian Ensemble Learning for Nonlinear Factor Analysis." PhD Dissertation, Helsinki University of Technology.
• QUOTE: ensemble learning: A technique for approximating the exact application of Bayesian probability theory.

Ensemble learning is a technique for parametric approximation of the posterior probability where fitting the parametric approximation to the actual posterior probability is achieved by minimising their misfit. The misfit is measured with Kullback-Leibler information [70], also known as relative or cross entropy. It is a measure suited for comparing probability distributions and, more importantly, it can be computed efficiently in practice if the approximation is chosen to be simple enough.

### 1993

• (Hinton & Camp, 1993) ⇒ G. E. Hinton, and D. van Camp (1993). "Keeping neural networks simple by minimizing the description length of the weights." In: Proceedings of the Sixth Annual Conference on Computational Learning Theory.