Maxout Activation Function

A Maxout Activation Function is a neuron activation function that is based on the mathematical function: [math]\displaystyle{ f_i(x) =max_j(W^T_{ij} x+b_{ij}). }[/math] [math]\displaystyle{ }[/math].

Context:
- It can (typically) be used in the activation of Maxout Neurons.
Example(s):
- chainer.functions.maxout - Chainer's implementation,
- …
Counter-Example(s):
See: Artificial Neural Network, Artificial Neuron, Neural Network Topology, Neural Network Layer, Neural Network Learning Rate.

References

2018a

(Chainer, 2018) ⇒ http://docs.chainer.org/en/stable/reference/generated/chainer.functions.maxout.html Retrieved:2018-2-18
- QUOTE: chainer.functions.maxout(x, pool_size, axis=1)source
  Maxout activation function.
  It accepts an input tensor x, reshapes the axis dimension (say the size being M * pool_size) into two dimensions (M, pool_size), and takes maximum along the axis dimension.
  Parameters:
  - x (Variable or numpy.ndarray or cupy.ndarray) – Input variable. A n-dimensional (n≥ axis) float array. In general, its first dimension is assumed to be the minibatch dimension. The other dimensions are treated as one concatenated dimension.
  - pool_size (int) – The size used for downsampling of pooling layer.
  - axis (int) – The axis dimension to be reshaped. The size of axis dimension should be M * pool_size.

Returns: Output variable. The shape of the output is same as x except that axis dimension is transformed from M * pool_size to M.

Return type: Variable.

See also:

Maxout

Example:

Typically, x is the output of a linear layer or a convolution layer. The following is the example where we use maxout() in combination with a Linear link.

>>> in_size, out_size, pool_size = 10, 10, 10
>>> bias = np.arange(out_size * pool_size).astype('f')
>>> l = L.Linear(in_size, out_size * pool_size, initial_bias=bias)
>>> x = np.zeros((1, in_size), 'f')  # prepare data
>>> x = l(x)
>>> y = F.maxout(x, pool_size)
>>> x.shape
(1, 100)
>>> y.shape
(1, 10)
>>> x.reshape((out_size, pool_size)).data
array([ [0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14., 15., 16., 17., 18., 19.],
       [20., 21., 22., 23., 24., 25., 26., 27., 28., 29.],
       [30., 31., 32., 33., 34., 35., 36., 37., 38., 39.],
       [40., 41., 42., 43., 44., 45., 46., 47., 48., 49.],
       [50., 51., 52., 53., 54., 55., 56., 57., 58., 59.],
       [60., 61., 62., 63., 64., 65., 66., 67., 68., 69.],
       [70., 71., 72., 73., 74., 75., 76., 77., 78., 79.],
       [80., 81., 82., 83., 84., 85., 86., 87., 88., 89.],
       [90., 91., 92., 93., 94., 95., 96., 97., 98., 99.] ], dtype=float32)
>>> y.data
array([ [9., 19., 29., 39., 49., 59., 69., 79., 89., 99.] ], dtype=float32)

2018b

(Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions Retrieved:2018-2-18.
- The following table lists activation functions that are not functions of a single fold [math]\displaystyle{ x }[/math] from the previous layer or layers:

Name	Equation	Derivatives	Range	Order of continuity
Softmax	[math]\displaystyle{ f_i(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^J e^{x_j}} }[/math] for $i$ = 1, …, $J$	[math]\displaystyle{ \frac{\partial f_i(\vec{x})}{\partial x_j} = f_i(\vec{x})(\delta_{ij} - f_j(\vec{x})) }[/math]	[math]\displaystyle{ (0,1) }[/math]	[math]\displaystyle{ C^\infty }[/math]
Maxout^[1]	[math]\displaystyle{ f(\vec{x}) = \max_i x_i }[/math]	[math]\displaystyle{ \frac{\partial f}{\partial x_j} = \begin{cases} 1 & \text{for } j = \underset{i}{\operatorname{argmax}} \, x_i\\ 0 & \text{for } j \ne\underset{i}{\operatorname{argmax}} \, x_i\end{cases} }[/math]	[math]\displaystyle{ (-\infty,\infty) }[/math]	[math]\displaystyle{ C^0 }[/math]

Here, δ is the Kronecker delta.

2018c

(CS231n, 2018) ⇒ Commonly used activation functions. In: CS231n Convolutional Neural Networks for Visual Recognition Retrieved: 2018-01-28.
- QUOTE: Maxout. Other types of units have been proposed that do not have the functional form [math]\displaystyle{ f(w^Tx+b) }[/math] where a non-linearity is applied on the dot product between the weights and the data. One relatively popular choice is the Maxout neuron (introduced recently by l Goodfellow et al.) that generalizes the ReLU and its leaky version. The Maxout neuron computes the function [math]\displaystyle{ max(w^T_1x+b_1,w^T_2x+b_2) }[/math]. Notice that both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have [math]\displaystyle{ w_1,\;b1=0 }[/math]). The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU). However, unlike the ReLU neurons it doubles the number of parameters for every single neuron, leading to a high total number of parameters.

2013

(Goodfellow et al., 2013) ⇒ Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., & Bengio, Y. (2013). Maxout networks. arXiv preprint arXiv:1302.4389.
- ABSTRACT: We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique. We empirically verify that the model successfully accomplishes both of these tasks. We use maxout and dropout to demonstrate state of the art classification performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN.

↑ Goodfellow, Ian J.; Warde-Farley, David; Mirza, Mehdi; Courville, Aaron; Bengio, Yoshua (2013-02-18). "Maxout Networks". JMLR WCP 28 (3): 1319–1327. arXiv:1302.4389. Bibcode 2013arXiv1302.4389G.

[1] Goodfellow, Ian J.; Warde-Farley, David; Mirza, Mehdi; Courville, Aaron; Bengio, Yoshua (2013-02-18). "Maxout Networks". JMLR WCP 28 (3): 1319–1327. arXiv:1302.4389. Bibcode 2013arXiv1302.4389G.

[1]