Attention Module

From GM-RKB
Jump to navigation Jump to search

An Attention Module is a neural network module that uses an alignment score function to amplify some parts of the input data while diminishing other parts.



References

2022

  • (Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Attention_(machine_learning) Retrieved:2022-4-24.
    • In neural networks, attention is a technique that mimics cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the thought being that the network should devote more focus to that small but important part of the data. Learning which part of the data is more important than others depends on the context and is trained by gradient descent.

      Attention-like mechanisms were introduced in the 1990s under names like multiplicative modules, sigma pi units, and hypernetworks.[1] Its flexibility comes from its role as "soft weights" that can change during runtime, in contrast to standard weights that must remain fixed at runtime. Uses of attention include memory in neural turing machines, reasoning tasks in differentiable neural computers,[2] language processing in transformers, and multi-sensory data processing (sound, images, video, and text) in perceivers.

  1. Yann Lecun (2020). Deep Learning course at NYU, Spring 2020, video lecture Week 6. Event occurs at 53:00. Retrieved 2022-03-08.
  2. Graves, Alex; Wayne, Greg; Reynolds, Malcolm; Harley, Tim; Danihelka, Ivo; Grabska-Barwińska, Agnieszka; Colmenarejo, Sergio Gómez; Grefenstette, Edward; Ramalho, Tiago; Agapiou, John; Badia, Adrià Puigdomènech; Hermann, Karl Moritz; Zwols, Yori; Ostrovski, Georg; Cain, Adam; King, Helen; Summerfield, Christopher; Blunsom, Phil; Kavukcuoglu, Koray; Hassabis, Demis (2016-10-12). "Hybrid computing using a neural network with dynamic external memory". Nature. 538 (7626): 471–476. Bibcode:2016Natur.538..471G. doi:10.1038/nature20101. ISSN 1476-4687. PMID 27732574. S2CID 205251479.

2022

  • (Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Attention_(machine_learning)#Variants Retrieved:2022-4-24.
    • There are many variants of attention: dot product, query-key-value, hard, soft, self, cross, Luong, and Bahdanau to name a few. These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients (see legend).

2018

Content-base attention

Below is a summary table of several popular attention mechanisms and corresponding alignment score functions:

Name Alignment score function Citation
Name Alignment score function Citation
Content-base attention score(st,hi)=cosine[st,hi] Graves2014
Additive(*) score(st,hi)=vatanh(Wa[st;hi]) Bahdanau2015
Location-Base αt,i=softmax(Wast)
Note: This simplifies the softmax alignment to only depend on the target position.
Luong2015
General score(st,hi)=stWahi
where Wa is a trainable weight matrix in the attention layer.
Luong2015
Dot-Product score(st,hi)=sthi Luong2015
Scaled Dot-Product(^) score(st,hi)=sthin
Note: very similar to the dot-product attention except for a scaling factor; where n is the dimension of the source hidden state.
Vaswani2017

(*) Referred to as “concat” in Luong, et al., 2015 and as “additive attention” in Vaswani, et al., 2017.
(^) It adds a scaling factor 1/n, motivated by the concern when the input is large, the softmax function may have an extremely small gradient, hard for efficient learning.

Here are a summary of broader categories of attention mechanisms:

Name Definition Citation
Self-Attention(&) Relating different positions of the same input sequence. Theoretically the self-attention can adopt any score functions above, but just replace the target sequence with the same input sequence. Cheng2016
Global/Soft Attending to the entire input state space. Xu2015
Local/Hard Attending to the part of input state space; i.e. a patch of the input image. Xu2015; Luong2015