2017 LanguageModelingwithGatedConvol
- (Dauphin et al., 2017) ⇒ Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. (2017). “Language Modeling with Gated Convolutional Networks.” In: International Conference on Machine Learning (ICML-2017).
Subject Headings: Gated Linear Unit (GLU), Neural Language Modeling.
Notes
Cited By
2020
- (Shazeer, 2020) ⇒ Noam Shazeer. (2020). “Glu Variants Improve Transformer.” arXiv preprint arXiv:2002.05202
- QUOTE: Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.
Quotes
Abstract
The pre-dominant approach to language modeling to date is based on recurrent neural networks. Their success on this task is often linked to their ability to capture unbounded context. In this paper we develop a finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens. We propose a novel simplified gating mechanism that outperforms Oord et al (2016) and investigate the impact of key architectural decisions. The proposed approach achieves state-of-the-art on the WikiText-103 benchmark, even though it features long-term dependencies, as well as competitive results on the Google Billion Words benchmark. Our model reduces the latency to score a sentence by an order of magnitude compared to a recurrent baseline. To our knowledge, this is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.
...
6. Conclusion
We introduce a convolutional neural network for language modeling with a novel gating mechanism. Compared to recurrent neural networks, our approach builds a hierarchical representation of the input words that makes it easier to capture long-range dependencies, similar in spirit to the tree-structured analysis of linguistic grammar formalisms. The same property eases learning since features are passed through a fixed number of layers and non-linearities, unlike for recurrent networks where the number of processing steps differs depending on the position of the word in the input. The results show that our gated convolutional network achieves a new state of the art on WikiText-103. On the Google Billion Word benchmark, we show competitive results can be achieved with significantly fewer resources.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2017 LanguageModelingwithGatedConvol | Yann N. Dauphin Angela Fan Michael Auli David Grangier | Language Modeling with Gated Convolutional Networks | 2017 |