Gated Linear Unit (GLU)

A Gated Linear Unit (GLU) is a neural network activation function that combines both linear and gating mechanisms to control the flow of information through the network.

References

chat
- A Gated Linear Unit (GLU) is a type of neural network activation function that combines both linear and gating mechanisms to control the flow of information through the network. GLUs are particularly useful in deep learning models that involve sequential data, such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs). They were introduced in the paper "Language Modeling with Gated Convolutional Networks" by Yann Dauphin, Angela Fan, Michael Auli, and David Grangier in 2017.
  GLUs are composed of two parts: a linear transformation and a gating mechanism. The linear transformation can be any linear operation (e.g., convolution, matrix multiplication) applied to the input. The gating mechanism is a sigmoid function applied to another linear input transformation. The output of the GLU is the element-wise product of the linear transformation and the gating mechanism.
  GLUs are known for their ability to adaptively select the information passed through the network, which helps with gradient flow and learning long-range dependencies. This is particularly beneficial in tasks involving natural language processing and sequential data.

(Shazeer, 2020) ⇒ Noam Shazeer. (2020). “Glu Variants Improve Transformer.” arXiv preprint arXiv:2002.05202
- QUOTE: Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations. ...
  ... We have extended the GLU family of layers and proposed their use in Transformer. In a transfer-learning setup, the new variants seem to produce better perplexities for the de-noising objective used in pre-training, as well as better results on many downstream language-understanding tasks. These architectures are simple to implement, and have no apparent computational drawbacks. We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.