Attention Module

An Attention Module is a neural network module that uses an alignment score function to amplify some parts of the input data while diminishing other parts.

Context:
- It can range from being a Self-Attention Module, Hard Attention Module to being a Soft Attention Module.
- …
Example(s):
- Scaled Dot Product-based Attention.
- Query-Key-Value-based Attention, composed of a Query Matrix (Q), Key Matrix (K) and a Value Matrix V
- hard, soft, self, cross, Luong, and Bahdanau
- …
Counter-Example(s):
- a Coverage Mechanism Module,
- a Gating Mechanism Module.
See: Attention Mechanism, Transformer-based NNet, Differentiable Neural Computer.

References

2022

(Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Attention_(machine_learning) Retrieved:2022-4-24.
- In neural networks, attention is a technique that mimics cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the thought being that the network should devote more focus to that small but important part of the data. Learning which part of the data is more important than others depends on the context and is trained by gradient descent.
  Attention-like mechanisms were introduced in the 1990s under names like multiplicative modules, sigma pi units, and hypernetworks.^[1] Its flexibility comes from its role as "soft weights" that can change during runtime, in contrast to standard weights that must remain fixed at runtime. Uses of attention include memory in neural turing machines, reasoning tasks in differentiable neural computers,^[2] language processing in transformers, and multi-sensory data processing (sound, images, video, and text) in perceivers.

↑ Yann Lecun (2020). Deep Learning course at NYU, Spring 2020, video lecture Week 6. Event occurs at 53:00. Retrieved 2022-03-08.
↑ Graves, Alex; Wayne, Greg; Reynolds, Malcolm; Harley, Tim; Danihelka, Ivo; Grabska-Barwińska, Agnieszka; Colmenarejo, Sergio Gómez; Grefenstette, Edward; Ramalho, Tiago; Agapiou, John; Badia, Adrià Puigdomènech; Hermann, Karl Moritz; Zwols, Yori; Ostrovski, Georg; Cain, Adam; King, Helen; Summerfield, Christopher; Blunsom, Phil; Kavukcuoglu, Koray; Hassabis, Demis (2016-10-12). "Hybrid computing using a neural network with dynamic external memory". Nature. 538 (7626): 471–476. Bibcode:2016Natur.538..471G. doi:10.1038/nature20101. ISSN 1476-4687. PMID 27732574. S2CID 205251479.

2022

(Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Attention_(machine_learning)#Variants Retrieved:2022-4-24.
- There are many variants of attention: dot product, query-key-value, hard, soft, self, cross, Luong, and Bahdanau to name a few. These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients (see legend).

2018

https://lilianweng.github.io/posts/2018-06-24-attention/
- QUOTE: Below is a summary table of several popular attention mechanisms and corresponding alignment score functions:

Content-base attention

Below is a summary table of several popular attention mechanisms and corresponding alignment score functions:

Name	Alignment score function	Citation

Name	Alignment score function	Citation
Content-base attention	$score (s_{t}, h_{i}) = cosine [s_{t}, h_{i}]$	Graves2014
Additive(*)	$score (s_{t}, h_{i}) = v_{a}^{⊤} \tanh (W_{a} [s_{t}; h_{i}])$	Bahdanau2015
Location-Base	$α_{t, i} = softmax (W_{a} s_{t})$ Note: This simplifies the softmax alignment to only depend on the target position.	Luong2015
General	$score (s_{t}, h_{i}) = s_{t}^{⊤} W_{a} h_{i}$ where $W_{a}$ is a trainable weight matrix in the attention layer.	Luong2015
Dot-Product	$score (s_{t}, h_{i}) = s_{t}^{⊤} h_{i}$	Luong2015
Scaled Dot-Product(^)	$score (s_{t}, h_{i}) = \frac{s_{t}^{⊤} h_{i}}{\sqrt{n}}$ Note: very similar to the dot-product attention except for a scaling factor; where n is the dimension of the source hidden state.	Vaswani2017

(*) Referred to as “concat” in Luong, et al., 2015 and as “additive attention” in Vaswani, et al., 2017.
(^) It adds a scaling factor $1 / \sqrt{n}$ , motivated by the concern when the input is large, the softmax function may have an extremely small gradient, hard for efficient learning.

Here are a summary of broader categories of attention mechanisms:

Name	Definition	Citation
Self-Attention(&)	Relating different positions of the same input sequence. Theoretically the self-attention can adopt any score functions above, but just replace the target sequence with the same input sequence.	Cheng2016
Global/Soft	Attending to the entire input state space.	Xu2015
Local/Hard	Attending to the part of input state space; i.e. a patch of the input image.	Xu2015; Luong2015

[Lecun2020-1] Yann Lecun (2020). Deep Learning course at NYU, Spring 2020, video lecture Week 6. Event occurs at 53:00. Retrieved 2022-03-08.

[Graves2016-2] Graves, Alex; Wayne, Greg; Reynolds, Malcolm; Harley, Tim; Danihelka, Ivo; Grabska-Barwińska, Agnieszka; Colmenarejo, Sergio Gómez; Grefenstette, Edward; Ramalho, Tiago; Agapiou, John; Badia, Adrià Puigdomènech; Hermann, Karl Moritz; Zwols, Yori; Ostrovski, Georg; Cain, Adam; King, Helen; Summerfield, Christopher; Blunsom, Phil; Kavukcuoglu, Koray; Hassabis, Demis (2016-10-12). "Hybrid computing using a neural network with dynamic external memory". Nature. 538 (7626): 471–476. Bibcode:2016Natur.538..471G. doi:10.1038/nature20101. ISSN 1476-4687. PMID 27732574. S2CID 205251479.

[1]

[2]

Attention Module

References

2022

2022

2018

Navigation menu

Search