# 2000 BayesianEnsembleLearning

(Redirected from Valpola, 2000)

## Quotes

Point estimates - http://www.cis.hut.fi/harri/thesis/valpola_thesis/node22.html

The most efficient and least accurate approximation is, in general, a point estimate of the posterior probability. It means that only the model with highest probability or probability density is used for making the predictions and decisions. Whether the accuracy is good depends on how large a part of the probability mass is occupied by models which are similar to the most probable model.

The two point estimates in wide use are the maximum likelihood (ML) and the maximum a posteriori (MAP) estimator. The ML estimator neglects the prior probability of the models and maximises only the probability which the model gives for the observation. The MAP estimator chooses the model which has the highest posterior probability mass or density.

EM Algorithm - http://www.cis.hut.fi/harri/thesis/valpola_thesis/node22.html

The expectation-maximisation (EM) algorithm [<a href="node59.html#dempster">21</a>] is often used for learning latent variable models, including the factor analysis model [110]. It is a mixture of point estimation and analytic integration over posterior density. The EM algorithm is useful for latent variable models if the posterior probability of the latent variables can be computed when other parameters of the model are assumed to be known.

The EM algorithm was developed for maximum likelihood parameter estimation from incomplete data. Let us denote the measured data by x, the missing data by y and the parameters by <IMG SRC="img28.gif" alt="$\theta$" width="12" align="BOTTOM" border="0" height="15">. The algorithm starts with an estimate

<IMG SRC="img29.gif" alt="$\hat{\theta}_0$" width="19" align="MIDDLE" border="0" height="38"> and alternates between two steps, called E-step for expectation and M-step for maximisation. In the former, the conditional probability distribution <IMG SRC="img30.gif" alt="$p(y \vert \hat{\theta}_i, x)$" width="66" align="MIDDLE" border="0" height="38"> of the missing data is computed given the current estimate <IMG SRC="img31.gif" alt="$\hat{\theta}_i$" width="17" align="MIDDLE" border="0" height="38"> of the parameters and in the latter, a new estimate

<IMG SRC="img32.gif" alt="$\hat{\theta}_{i+1}$" width="33" align="MIDDLE" border="0" height="38"> of the parameters is computed by maximising the expectation of <IMG SRC="img33.gif" alt="$\ln p(x, y \vert \theta)$" width="77" align="MIDDLE" border="0" height="31"> over the distribution computed in the E-step.

It can be proven that this iteration either increases the probability <IMG SRC="img34.gif" alt="$p(x \vert \theta)$" width="46" align="MIDDLE" border="0" height="31"> or leaves it unchanged. The usefulness of the method is due to the fact that it is often easier to integrate the logarithmic probability <IMG SRC="img33.gif" alt="$\ln p(x, y \vert \theta)$" width="77" align="MIDDLE" border="0" height="31"> than probability <IMG SRC="img35.gif" alt="$p(x, y \vert \theta)$" width="62" align="MIDDLE" border="0" height="31"> which would be required if <IMG SRC="img34.gif" alt="$p(x \vert \theta)$" width="46" align="MIDDLE" border="0" height="31"> were maximised directly.

The EM algorithm applies to latent variable models when the latent variables are assumed to be the missing data. When compared to simple point estimation, the benefit of the method is that fewer unknown variables are assigned a point estimate, thus alleviating the problems related to overfitting.

Stochastic sampling - http://www.cis.hut.fi/harri/thesis/valpola_thesis/node23.html

In stochastic sampling one generates a set of samples of models, whose distribution approximates the posterior probability of the models [33]. There are several techniques having slightly different properties, but in general the methods yield good approximations of the posterior probability of the models but are computationally demanding. To some extent the trade-off between efficiency and accuracy can be controlled by adjusting the number of generated samples.

For simple problems, the stochastic sampling approach is attractive because it poses the minimal amount of restrictions on the structure of the model and does not require careful design of the learning algorithm. For an accessible presentation of stochastic sampling methods from the point of view of neural networks, see [92].

Ensemble learning - http://www.cis.hut.fi/harri/thesis/valpola_thesis/node28.html

Ensemble learning is a technique for parametric approximation of the posterior probability where fitting the parametric approximation to the actual posterior probability is achieved by minimising their misfit. The misfit is measured with Kullback-Leibler information [70], also known as relative or cross entropy. It is a measure suited for comparing probability distributions and, more importantly, it can be computed efficiently in practice if the approximation is chosen to be simple enough.,

volumeDate ValuetitletypejournaltitleUrldoinoteyear
2000 BayesianEnsembleLearningBayesian Ensemble Learning for Nonlinear Factor Analysishttp://www.cis.hut.fi/harri/thesis/valpola thesis/2000
 Author Harri Valpola + title Bayesian Ensemble Learning for Nonlinear Factor Analysis + titleUrl http://www.cis.hut.fi/harri/thesis/valpola thesis/ + year 2000 +