2007 LearningMultipleLayersofReprese

From GM-RKB
(Redirected from Hinton, 2007)
Jump to navigation Jump to search

Subject Headings: Higher-Order Predictor Feature, Representation Learning, Automated Feature Extraction.

Notes

Cited By

Quotes

Abstract

To achieve its' impressive performance at tasks such as speech or object recognition, the brain extracts multiple levels of representation from the sensory input. Backpropagation was the first computationally efficient model of how neural networks could learn multiple layers of representation, but it required labeled training data and it did not work well in deep networks. The limitations of backpropagation learning can now be overcome by using multi-layer neural networks that contain top-down connections and training them to generate sensory data rather than to classify it. Learning multilayer generative models appears to be difficult, but a recent discovery makes it easy to learn non-linear, distributed representations one layer at a time. The multiple layers of representation learned in this way can subsequently be fine-tuned to produce generative or discriminative models that work much better than previous approaches.

Learning feature detectors

To enable the perceptual system to make the fine distinctions that are required to control behaviour, sensory cortex needs an efficient way of adapting the synaptic weights of multiple layers of feature-detecting neurons. The backpropagation learning procedure [1] iteratively adjusts all of the weights to optimize some measure of the network's classification performance, but this requires labeled training data. To learn multiple layers of feature detectors when labeled data is scarce or non-existent, some objective other than classification is required. In a neural network that contains both bottom-up "recognition" connections and top-down "generative" connections it is possible to recognize data using a bottom-up pass and to generate data using a top-down pass. If the neurons are stochastic, repeated top-down passes will generate a whole distribution of data vectors. This suggests a sensible objective for learning: Adjust the weights on the top-down connections to maximize the probability that the network would generate the training data. The neural network's model of the training data then resides in its top-down connections. The role of the bottom-up connections is to allow the network to figure out activations for the features in each layer that constitute a plausible explanation of how the network could have generated an observed sensory data-vector. The hope is that the active features in the higher layers will be a much better guide to appropriate actions than the raw sensory data or the lower-level features. As we shall see, this is not just wishful thinking - if three layers of feature detectors are trained on unlabeled images of handwritten digits, the complicated non-linear features in the top layer allow excellent recognition of poorly written digits like those in 4b [2].

There are several reasons for believing that our visual systems contain multi-layer generative models in which top-down connections can be used to generate low-level features of images from high-level representations, and bottom-up connections can be used to infer the high-level representations that would have generated an observed set of low-level features. Single cell recordings [3] and the reciprocal connectivity between cortical areas [4] both suggest a hierarchy of progressively more complex features in which each layer can influence the layer below. Vivid visual imagery, dreaming, and the disambiguating effect of context on the interpretation of local image regions [5] also suggest that the visual system can perform top-down generation.

The aim of this review is to complement the neural and psychological evidence for generative models by reviewing recent computational advances that make multi-layer generative models easier to learn and better at discrimination than their feed-forward counterparts. The advances are illustrated in the domain of hand-written digits where they learn unsurpassed generative and discriminative models.

Inference in generative models

The crucial computational step in fitting a generative model to data is figuring out how the model, with its current generative parameters, might have used its hidden variables to generate an observed data-vector. Stochastic generative models generally have many different ways of generating any particular data-vector, so the best we can hope for is to infer a probability distribution over the various possible settings of the hidden variables. Consider, for example, a mixture of Gaussians model in which each data-vector is assumed to come from exactly one of the multivariate Gaussian distributions in the mixture. Inference then consists of computing the posterior probability that a articular data-vector came from each of the Gaussians. This is easy because the posterior probability assigned to each Gaussian in the mixture is simply proportional to the probability density of the data-vector under that Gaussian times the prior probability of using that Gaussian when generating data.

The generative models that are most familiar in statistics and machine learning are the ones for which the posterior distribution can be inferred efficiently and exactly because the model has been strongly constrained. These generative models include:

  • Factor Analysis in which there is a single layer of Gaussian hidden variables that have a linear effects on the visible variables (see figure 1). In addition, independent Gaussian noise is added to each visible variable[6, 7, 8]. Given a visible vector, it is impossible to infer the exact state of the factors that generated it, but it is easy to infer the mean and covariance of the Gaussian posterior distribution over the factors and this is sufficient to allow the parameters of the model to be improved.
  • Independent Components Analysis which generalizes factor analysis by allowing non-Gaussian hidden variables, but maintains tractable inference by eliminating the observation noise in the visible variables and using the same number of hidden and visible variables. These restrictions ensure that the posterior distribution collapses to a single point because there is only one setting of the hidden variables that can generate each visible vector exactly[9, 10, 11].
  • Mixture models in which each data-vector is assumed to be generated by one of the component distributions in the mixture and it is easy to compute the density under each of the component distributions.

If factor analysis is generalized to allow non-Gaussian hidden variables, it can model the development of low-level visual receptive fields [12]. However, if the extra constraints used in independent components analysis are not imposed, it is no longer easy to infer, or even to represent, the posterior distribution over the hidden variables. This is because of a phenomenon known as explaining away [13] (see figure 2b). …

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 LearningMultipleLayersofRepreseGeoffrey E. HintonLearning Multiple Layers of Representation2007