Convolutional-Pooling Neural Network (CNN/ConvNet) Model

Jump to: navigation, search

A Convolutional-Pooling Neural Network (CNN/ConvNet) Model is a multi-layer feed-forward neural network that includes convolutional layers and pooling layers.





  1. Input layer
  2. Feature-extraction (learning) layers
  3. Classification layers
The input layer accepts three-dimensional input generally in the form spatially of the size (width × height) of the image and has a depth representing the color channels (generally three for RGB color channels).

The feature-extraction layers have a general repeating pattern of the sequence:

  1. Convolution layer

    We express the Rectified Linear Unit (ReLU) activation function as a layer in the diagram here to match up to other literature.

  2. Pooling layer
These layers find a number of features in the images and progressively construct higher-order features. This corresponds directly to the ongoing theme in deep learning by which features are automatically learned as opposed to traditionally hand engineered.

Finally we have the classification layers in which we have one or more fully connected layers to take the higher-order features and produce class probabilities or scores. These layers are fully connected to all of the neurons in the previous layer, as their name implies. The output of these layers produces typically a two-dimensional output of the dimensions [b × N], where b is the number of examples in the mini-batch and N is the number of classes we’re interested in scoring.

Figure 4-9. High-level general CNN architecture



  • (VistaLab, 2013) ⇒ (2013) An Introduction to Convolutional Neural Networks. In: VISTA LabTeaching WIKI
    • QUOTE: The solution to FFNNs' problems with image processing took inspiration from neurobiology, Yann LeCun and Toshua Bengio tried to capture the organization of neurons in the visual cortex of the cat, which at that time was known to consist of maps of local receptive fields that decreased in granularity as the cortex moved anteriorly. There are several different theory about how to precisely define such a model, but all of the various implementations can be loosely described as involving the following process:
      • Convolve several small filters on the input image
      • Subsample this space of filter activations
      • Repeat steps 1 and 2 until your left with sufficiently high level features.
      • Use a standard a standard FFNN to solve a particular task, using the results features as input.
The LeCun Formulation

There are several different ways one might formalize the high level process described above, but the most common is LeCun's implementation, the LeNet] (...)

The complete implementation of the LeNet.


  • (DeepLearning Tutorial, 2013) ⇒ Theano Development Team (2008–2013). Convolutional Neural Networks (LeNet) In: DeepLearning 0.1 documentation
    • QUOTE: Convolutional Neural Networks (CNN) are variants of MLPs which are inspired from biology. From Hubel and Wiesel’s early work on the cat’s visual cortex [Hubel68], we know there exists a complex arrangement of cells within the visual cortex. These cells are sensitive to small sub-regions of the input space, called a receptive field, and are tiled in such a way as to cover the entire visual field. These filters are local in input space and are thus better suited to exploit the strong spatially local correlation present in natural images.

      Additionally, two basic cell types have been identified: simple cells (S) and complex cells (C). Simple cells (S) respond maximally to specific edge-like stimulus patterns within their receptive field. Complex cells (C) have larger receptive fields and are locally invariant to the exact position of the stimulus.

      The visual cortex being the most powerful “vision” system in existence, it seems natural to emulate its behavior. Many such neurally inspired models can be found in the literature. To name a few: the NeoCognitron Fukushima, HMAX Serre07 and LeNet-5 LeCun98, which will be the focus of this tutorial.



  • (Serre et al.,2007) ⇒ Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., & Poggio, T. (2007). "Robust object recognition with cortex-like mechanisms". IEEE transactions on pattern analysis and machine intelligence, 29(3), 411-426. DOI: 10.1109/TPAMI.2007.56
    • ABSTRACT: We introduce a new general framework for the recognition of complex visual scenes, which is motivated by biology: We describe a hierarchical system that closely follows the organization of visual cortex and builds an increasingly complex and invariant feature representation by alternating between a template matching and a maximum pooling operation. We demonstrate the strength of the approach on a range of recognition tasks: From invariant single object recognition in clutter to multiclass categorization problems and complex scene understanding tasks that rely on the recognition of both shape-based as well as texture-based objects. Given the biological constraints that the system had to satisfy, the approach performs surprisingly well: It has the capability of learning from only a few training examples and competes with state-of-the-art systems. We also discuss the existence of a universal, redundant dictionary of features that could handle the recognition of most object categories. In addition to its relevance for computer vision, the success of this approach suggests a plausibility proof for a class of feedforward models of object recognition in cortex





  1. LeCun, Yann. "LeNet-5, convolutional neural networks". Retrieved 16 November 2013.
  2. Zhang, Wei (1988). "Shift-invariant pattern recognition neural network and its optical architecture". Proceedings of annual conference of the Japan Society of Applied Physics.
  3. Zhang, Wei (1990). "Parallel distributed processing model with local space-invariant interconnections and its optical architecture". Applied Optics. 29 (32): 4790–7. Bibcode:1990ApOpt..29.4790Z. doi:10.1364/AO.29.004790. PMID 20577468.
  4. Matusugu, Masakazu; Katsuhiko Mori; Yusuke Mitari; Yuji Kaneda (2003). "Subject independent facial expression recognition with robust face detection using a convolutional neural network" (PDF). Neural Networks. 16 (5): 555–559. doi:10.1016/S0893-6080(03)00115-1. Retrieved 17 November 2013.
  5. Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel, “Handwritten digit recognition with a backpropagation neural network.” in Advances in Neural Information Processing Systems 2, D. S. Touretzky, Ed. San Mateo, CA: Morgan Kaufmann, 1990, pp. 396–404.
  6. D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex.” J. Physiol., vol. 160, pp. 106–154, 1962.
  7. Y. Le Cun and Y. Bengio, “Convolutional networks for images, speech, and time series.” in The Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed. Cambridge, MA: MIT Press, 1995, pp. 255–258.