Bayesian Model Averaging Algorithm

A Bayesian Model Averaging Algorithm is an Statistical Modeling Algorithm that seeks to approximate the Bayes Optimal Classifier by averaging the individual predictions of all bayesian classifiers in the hypothesis space, weighted by how well the classifiers explain the training data and how much we believe in them a priori.

Context:
- It assigns weights to the hypotheses in the original space according to a fixed formula.
- It can be applied by a Bayesian Model Averaging System.
- …
Counter-Example(s):
- an Ensemble Algorithm. (Domingos, 2012)
See: Bayesian Predictor, Bayes' law.

References

2013

http://en.wikipedia.org/wiki/Ensemble_learning#Bayesian_model_averaging
- Bayesian model averaging is an ensemble technique that seeks to approximate the Bayes Optimal Classifier by sampling hypotheses from the hypothesis space, and combining them using Bayes' law.^[1] Unlike the Bayes optimal classifier, Bayesian model averaging can be practically implemented. Hypotheses are typically sampled using a Monte Carlo sampling technique such as MCMC. For example, Gibbs sampling may be used to draw hypotheses that are representative of the distribution [math]\displaystyle{ P(T|H) }[/math]. It has been shown that under certain circumstances, when hypotheses are drawn in this manner and averaged according to Bayes' law, this technique has an expected error that is bounded to be at most twice the expected error of the Bayes optimal classifier.^[2] Despite the theoretical correctness of this technique, however, it has a tendency to promote over-fitting, and does not perform as well empirically as simpler ensemble techniques such as bagging.^[3]

↑ Template:Cite jstor
↑ David Haussler, Michael Kearns, and Robert E. Schapire. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning, 14:83–113, 1994
↑ Template:Cite conference

http://en.wikipedia.org/wiki/Ensemble_learning#Pseudo-code

  function train_bayesian_model_averaging(T)
	z = -infinity
	For each model, m, in the ensemble:
		Train m, typically using a random subset of the [[training data]], T.
		Let prior[m] be the prior probability that m is the generating hypothesis.
			Typically, [[uniform prior]]s are used, so prior[m] = 1.
		Let x be the [[predictive accuracy]] (from 0 to 1) of m for predicting the labels in T.
		Use x to estimate log_likelihood[m]. Often, this is computed as
			log_likelihood[m] = |T| * (x * log(x) + (1 - x) * log(1 - x)),
			where |T| is the number of training patterns in T.
		z = max(z, log_likelihood[m])
	For each model, m, in the ensemble:
		weight[m] = prior[m] * exp(log_likelihood[m] - z)
	Normalize all the model weights to sum to 1.

2012

(Domingos, 2012) ⇒ Pedro Domingos. (2012). “A Few Useful Things to Know About Machine Learning.” In: Communications of the ACM Journal, 55(10). doi:10.1145/2347736.2347755
- QUOTE: Model ensembles should not be confused with Bayesian model averaging (BMA) — the theoretically optimal approach to learning. (Bernardo & Smith, 1994) In BMA, predictions on new examples are made by averaging the individual predictions of all classifiers in the hypothesis space, weighted by how well the classifiers explain the training data and how much we believe in them a priori. Despite their superficial similarities, ensembles and BMA are very different. Ensembles change the hypothesis space (for example, from single decision trees to linear combinations of them), and can take a wide variety of forms. BMA assigns weights to the hypotheses in the original space according to a fixed formula. BMA weights are extremely different from those produced by (say) bagging or boosting: the latter are fairly even, while the former are extremely skewed, to the point where the single highest-weight classifier usually dominates, making BMA effectively equivalent to just selecting it. (Domingos, 2000) A practical consequence of this is that, while model ensembles are a key part of the machine learning toolkit, BMA is seldom worth the trouble.

1999

(Hoeting et al., 1999) ⇒ Jennifer A Hoeting, David Madigan, Adrian E Raftery, and Chris T Volinsky. (1999). “Bayesian Model Averaging: A Tutorial.” In: Statistical science.

1995

(Madigan et al., 1995) ⇒ David Madigan, Jeremy York, and Denis Allard. “Bayesian Graphical Models for Discrete Data.” In: International Statistical Review/Revue Internationale de Statistique.
- QUOTE: … we introduce Markov chain Monte Carlo model composition, a Monte Carlo method for Bayesian model averaging.

[1] Template:Cite jstor

[2] David Haussler, Michael Kearns, and Robert E. Schapire. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning, 14:83–113, 1994

[3] Template:Cite conference

[1]

[2]

[3]