Maxout Activation Function
Jump to navigation
Jump to search
A Maxout Activation Function is a neuron activation function that is based on the mathematical function: [math]\displaystyle{ f_i(x) =max_j(W^T_{ij} x+b_{ij}). }[/math] [math]\displaystyle{ }[/math].
- Context:
- It can (typically) be used in the activation of Maxout Neurons.
- Example(s):
chainer.functions.maxout
- Chainer's implementation,- …
- Counter-Example(s):
- a Softmax-based Activation Function,
- a Rectified-based Activation Function,
- a Heaviside Step Activation Function,
- a Ramp Function-based Activation Function,
- a Logistic Sigmoid-based Activation Function,
- a Hyperbolic Tangent-based Activation Function,
- a Gaussian-based Activation Function,
- a Softsign Activation Function,
- a Softshrink Activation Function,
- an Adaptive Piecewise Linear Activation Function,
- a Bent Identity Activation Function,
- a Long Short-Term Memory Unit-based Activation Function,
- an Inverse Square Root Unit-based Activation Function;
- a SoftExponential Activation Function;
- a Sinusoid-based Activation Function.
- See: Artificial Neural Network, Artificial Neuron, Neural Network Topology, Neural Network Layer, Neural Network Learning Rate.
References
2018a
- (Chainer, 2018) ⇒ http://docs.chainer.org/en/stable/reference/generated/chainer.functions.maxout.html Retrieved:2018-2-18
- QUOTE:
chainer.functions.maxout(x, pool_size, axis=1)
source Maxout activation function.
It accepts an input tensor
x
, reshapes theaxis
dimension (say the size beingM * pool_size
) into two dimensions(M, pool_size)
, and takes maximum along theaxis
dimension.Parameters:
- x (Variable or
numpy.ndarray
orcupy.ndarray
) – Input variable. A n-dimensional (n≥axis
) float array. In general, its first dimension is assumed to be the minibatch dimension. The other dimensions are treated as one concatenated dimension. - pool_size (int) – The size used for downsampling of pooling layer.
- axis (int) – The
axis
dimension to be reshaped. The size ofaxis
dimension should beM * pool_size
.
- x (Variable or
- QUOTE:
- Returns: Output variable. The shape of the output is same as x except that
axis
dimension is transformed fromM * pool_size
toM
. - Return type: Variable.
- See also:
- Example:
Typically,
x
is the output of a linear layer or a convolution layer. The following is the example where we usemaxout()
in combination with a Linear link.
- Returns: Output variable. The shape of the output is same as x except that
>>> in_size, out_size, pool_size = 10, 10, 10 >>> bias = np.arange(out_size * pool_size).astype('f') >>> l = L.Linear(in_size, out_size * pool_size, initial_bias=bias) >>> x = np.zeros((1, in_size), 'f') # prepare data >>> x = l(x) >>> y = F.maxout(x, pool_size) >>> x.shape (1, 100) >>> y.shape (1, 10) >>> x.reshape((out_size, pool_size)).data array([ [0., 1., 2., 3., 4., 5., 6., 7., 8., 9.], [10., 11., 12., 13., 14., 15., 16., 17., 18., 19.], [20., 21., 22., 23., 24., 25., 26., 27., 28., 29.], [30., 31., 32., 33., 34., 35., 36., 37., 38., 39.], [40., 41., 42., 43., 44., 45., 46., 47., 48., 49.], [50., 51., 52., 53., 54., 55., 56., 57., 58., 59.], [60., 61., 62., 63., 64., 65., 66., 67., 68., 69.], [70., 71., 72., 73., 74., 75., 76., 77., 78., 79.], [80., 81., 82., 83., 84., 85., 86., 87., 88., 89.], [90., 91., 92., 93., 94., 95., 96., 97., 98., 99.] ], dtype=float32) >>> y.data array([ [9., 19., 29., 39., 49., 59., 69., 79., 89., 99.] ], dtype=float32)
2018b
- (Wikipedia, 2018) ⇒ https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions Retrieved:2018-2-18.
- The following table lists activation functions that are not functions of a single fold [math]\displaystyle{ x }[/math] from the previous layer or layers:
Name | Equation | Derivatives | Range | Order of continuity |
---|---|---|---|---|
Softmax | [math]\displaystyle{ f_i(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^J e^{x_j}} }[/math] for i = 1, …, J | [math]\displaystyle{ \frac{\partial f_i(\vec{x})}{\partial x_j} = f_i(\vec{x})(\delta_{ij} - f_j(\vec{x})) }[/math] | [math]\displaystyle{ (0,1) }[/math] | [math]\displaystyle{ C^\infty }[/math] |
Maxout[1] | [math]\displaystyle{ f(\vec{x}) = \max_i x_i }[/math] | [math]\displaystyle{ \frac{\partial f}{\partial x_j} = \begin{cases} 1 & \text{for } j = \underset{i}{\operatorname{argmax}} \, x_i\\ 0 & \text{for } j \ne\underset{i}{\operatorname{argmax}} \, x_i\end{cases} }[/math] | [math]\displaystyle{ (-\infty,\infty) }[/math] | [math]\displaystyle{ C^0 }[/math] |
Here, δ is the Kronecker delta.
2018c
- (CS231n, 2018) ⇒ Commonly used activation functions. In: CS231n Convolutional Neural Networks for Visual Recognition Retrieved: 2018-01-28.
- QUOTE: Maxout. Other types of units have been proposed that do not have the functional form [math]\displaystyle{ f(w^Tx+b) }[/math] where a non-linearity is applied on the dot product between the weights and the data. One relatively popular choice is the Maxout neuron (introduced recently by l Goodfellow et al.) that generalizes the ReLU and its leaky version. The Maxout neuron computes the function [math]\displaystyle{ max(w^T_1x+b_1,w^T_2x+b_2) }[/math]. Notice that both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have [math]\displaystyle{ w_1,\;b1=0 }[/math]). The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU). However, unlike the ReLU neurons it doubles the number of parameters for every single neuron, leading to a high total number of parameters.
2013
- (Goodfellow et al., 2013) ⇒ Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., & Bengio, Y. (2013). Maxout networks. arXiv preprint arXiv:1302.4389.
- ABSTRACT: We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique. We empirically verify that the model successfully accomplishes both of these tasks. We use maxout and dropout to demonstrate state of the art classification performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN.
- ↑ Goodfellow, Ian J.; Warde-Farley, David; Mirza, Mehdi; Courville, Aaron; Bengio, Yoshua (2013-02-18). "Maxout Networks". JMLR WCP 28 (3): 1319–1327. arXiv:1302.4389. Bibcode 2013arXiv1302.4389G.