2006 AFastLearningAlgorithmforDeepBe

(Hinton et al., 2006) ⇒ Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. (2006). “A Fast Learning Algorithm for Deep Belief Nets.” In: Neural Computation Journal, 18(7). doi:10.1162/neco.2006.18.7.1527

Subject Headings: Deep Belief Networks; Deep Neural Network Training; Directed Belief Networks.

Notes

Cited By

Quotes

Abstract

We show how to use "complementary priors" to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.

1 Introduction

Learning is difficult in densely-connected, directed belief nets that have many hidden layers because it is difficult to infer the conditional distribution of the hidden activities when given a data vector. Variational methods use simple approximations to the true conditional distribution, but the approximations may be poor, especially at the deepest hidden layer where the prior assumes independence. Also, variational learning still requires all of the parameters to be learned together and makes the learning time scale poorly as the number of parameters increases.

We describe a model in which the top two hidden layers form an undirected associative memory (see figure 1) and the �To appear in Neural Computation 2006 remaining hidden layers form a directed acyclic graph that converts the representations in the associative memory into observable variables such as the pixels of an image. This hybrid model has some attractive features:

. There is a fast, greedy learning algorithm that can find a fairly good set of parameters quickly, even in deep networks with millions of parameters and many hidden layers.
. The learning algorithm is unsupervised but can be applied to labeled data by learning a model that generates both the label and the data.
. There is a fine-tuning algorithm that learns an excellent generative model which outperforms discriminative methods on the MNIST database of hand-written digits.
. The generative model makes it easy to interpret the distributed representations in the deep hidden layers.
. The inference required for forming a percept is both fast and accurate.
. The learning algorithm is local: adjustments to a synapse strength depend only on the states of the presynaptic and post-synaptic neuron.
. The communication is simple: neurons only need to communicate their stochastic binary states.

Section 2 introduces the idea of a “complementary” prior which exactly cancels the “explaining away” phenomenon that makes inference difficult in directed models. An example of a directed belief network with complementary priors is presented. Section 3 shows the equivalence between restricted Boltzmann machines and infinite directed networks with tied weights.

Section 4 introduces a fast, greedy learning algorithm for constructing multi-layer directed networks one layer at a time. Using a variational bound it shows that as each new layer is added, the overall generative model improves. The greedy algorithm bears some resemblance to boosting in its repeated use of the same “weak” learner, but instead of reweighting each data-vector to ensure that the next step learns something new, it re-represents it. The “weak” learner that is used to construct deep directed nets is itself an undirected graphical model.

Section 5 shows how the weights produced by the fast greedy algorithm can be fine-tuned using the “up-down” algorithm. This is a contrastive version of the wake-sleep algorithm Hinton et al. (1995) that does not suffer from the “mode-averaging” problems that can cause the wake-sleep algorithm to learn poor recognition weights.

Section 6 shows the pattern recognition performance of a network with three hidden layers and about 1.7 million weights on the MNIST set of handwritten digits. When no knowledge of geometry is provided and there is no special preprocessing, the generalization performance of the network is 1.25% errors on the 10; 000 digit official test set. This beats the 1.5% achieved by the best back-propagation nets when they are not hand-crafted for this particular application. It is also slightly better than the 1.4% errors reported by Decoste and Schoelkopf (2002) for support vector machines on the same task.

Finally, section 7 shows what happens in the mind of the network when it is running without being constrained by visual input. The network has a full generative model, so it is easy to look into its mind – we simply generate an image from its high-level representations.

Throughout the paper, we will consider nets composed of stochastic binary variable s but the ideas can be generalized to other models in which the log probability of a variable is an additive function of the states of its directly-connected neighbours (see Appendix A for details).

2 Complementary Priors

The phenomenon of explaining away (illustrated in figure 2) makes inference difficult in directed belief nets. In densely connected networks, the posterior distribution over the hidden variables is intractable except in a few special cases such as mixture models or linear models with additive Gaussian noise. Markov Chain Monte Carlo methods (Neal, 1992) can be used to sample from the posterior, but they are typically very time consuming. Variational methods (Neal and Hinton, 1998) approximate the true posterior with a more tractable distribution and they can be used to improve a lower bound on the log probability of the training data. It is comforting that learning is guaranteed to improve a variational bound even when the inference of the hidden states is done incorrectly, but it would be much better to find a way of eliminating explaining away altogether, even in models whose hidden variables have highly correlated effects on the visible variables. It is widely assumed that this is impossible.

A logistic belief net (Neal, 1992) is composed of stochastic binary units. When the net is used to generate data, the probability of turning on unit i is a logistic function of the states of its immediate ancestors, j, and of the weights, wij, on the directed connections from the ancestors:: [math]\displaystyle{ p (s_i = 1) = \frac{1}{1 + exp (- b_i - \Sigma_j s_jw_{ij}) } \ (1) }[/math] where bi is the bias of unit i. If a logistic belief net only has one hidden layer, the prior distribution over the hidden variables is factorial because their binary states are chosen independently when the model is used to generate data. The non-independence in the posterior distribution is created by the likelihood term coming from the data. Perhaps we could eliminate explaining away in the first hidden layer by using extra hidden layers to create a “complementary” prior that has exactly the opposite correlations to those in the likelihood term. Then, when the likelihood term is multiplied by the prior, we will get a posterior that is exactly factorial. It is not at all obvious that complementary priors exist, but figure 3 shows a simple example of an infinite logistic belief net with tied weights in which the priors are complementary at every hidden layer (see Appendix A for a more general treatment of the conditions under which complementary priors exist). The use of tied weights to construct complementary priors may seem like a mere trick for making directed models equivalent to undirected ones. As we shall see, however, it leads to a novel and very efficient learning algorithm that works by progressively untying the weights in each layer from the weights in higher layers.

…

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2006 AFastLearningAlgorithmforDeepBe	Yee Whye Teh Geoffrey E. Hinton Simon Osindero			A Fast Learning Algorithm for Deep Belief Nets				10.1162/neco.2006.18.7.1527		2006