Attention Mechanism

A Attention Mechanism is a neural network component within a memory augmented neural network that allows the model to dynamically focus on certain parts of its input or its own internal state (memory) that are most relevant for performing a given task.

AKA: Neural Attention Model, Neural Network Attention System.
Context:
- It can typically be incorporated into a Neural Network with Attention Mechanism to enable dynamic focusing on relevant information during processing.
- It can typically utilize an Attention Pattern Matrix that encodes the pairwise relevance between attention input tokens, allowing the model to selectively focus on different parts of the input when updating each token's representation.
- It can typically compute Attention Scores between attention query vectors (representing the current state) and attention key vectors (representing the input elements), which are then normalized using an attention softmax function to obtain attention weights.
- It can typically use the computed Attention Weights to take a weighted sum of attention value vectors, which correspond to the input elements, to obtain an attention context vector that captures the most relevant information for the current state.
- It can typically update its Attention Query Vectors, Attention Key Vectors, and Attention Value Vectors through learnable linear transformations, allowing the model to adapt and learn the most suitable representations for the given task during training.
- It can be defined by an Attention Function, which mathematically determines how attention scores are computed between input elements.
- It can range from being a Local Attention Mechanism to being a Global Attention Mechanism, depending on its attention scope.
- It can range from being a Self-Attention Mechanism to being a Cross-Attention Mechanism, depending on its attention source-target relationship.
- It can range from being an Additive Attention Mechanism to being a Multiplicative Attention Mechanism, depending on its attention scoring method.
- It can range from being a Deterministic Attention Mechanism to being a Stochastic Attention Mechanism, depending on its attention weight determination approach.
- It can range from being a Soft Attention Mechanism to being a Hard Attention Mechanism, depending on its attention weight distribution characteristic.
- It can range from being a Single-Head Attention Mechanism to being a Multi-Head Attention Mechanism, depending on its attention processing parallelism.
- It can significantly improve Neural Network Performance on tasks requiring selective information processing such as attention-based machine translation, attention-based document summarization, and attention-based image captioning.
- It can alleviate the Information Bottleneck Problem in encoder-decoder architectures by allowing the decoder to access all encoder states rather than just a fixed-length context vector.
- It can be analyzed through Attention Mechanism Computational Complexity Analysis to optimize implementation efficiency.
- ...
Examples:
- Attention Mechanism Classifications by scoring method, such as:
  - Content-Based Attention Mechanisms, such as:
    - Additive Attention Mechanism, computing attention weights by using a feed-forward network with a single hidden layer to combine query and key vectors, enabling complex non-linear relationships.
    - Multiplicative Attention Mechanisms, such as:
      - Scaled Dot-Product Attention Mechanism, scaling dot products by the dimensionality of the vectors to improve stability in models with large dimension sizes.
      - Cosine Similarity Attention Mechanism, using normalized vectors to compute attention scores for length-invariant matching.
- Attention Mechanism Classifications by attention scope, such as:
  - Global Attention Mechanism, attending to all source positions to generate a context vector through weighted averaging of all source states.
  - Local Attention Mechanism, focusing on a subset of source positions through a predicted alignment position and an attention window, reducing computational complexity.
- Attention Mechanism Applications by task domain, such as:
  - Natural Language Processing Attention Mechanisms, such as:
    - Encoder-Decoder Attention Mechanism, widely used in sequence-to-sequence models for tasks such as machine translation, allowing the decoder to attend over all positions in the input sequence.
    - Hierarchical Attention Mechanism, employing multiple levels of attention such as word-level and sentence-level attention for document classification.
  - Computer Vision Attention Mechanisms, such as:
    - Visual Attention Mechanism, selectively focusing on parts of an image for tasks like image captioning and visual question answering.
- Attention Mechanism Variants by design characteristics, such as:
  - Multi-Head Attention Mechanism, running several attention mechanisms in parallel to capture different types of relationships in the input data.
  - Hard Stochastic Attention Mechanism, where attention decisions are sampled from a probability distribution, leading to discrete attention focusing.
  - Block Sparse Attention Mechanism, introducing sparsity by computing attention within blocks or between specific blocks, reducing computational complexity.
  - Grouped Query Attention Mechanism, effectively combining elements of Multi-Head Attention Mechanism and Multi-Query Attention Mechanism for improved efficiency.
- ...
Counter-Examples:
- Self-Attention Mechanism without Positional Encoding, which lacks the ability to incorporate positional information that standard attention mechanisms provide, limiting its effectiveness for sequential or spatial data.
- Uniform Attention Distribution, which assigns equal weight to all input elements rather than selectively focusing as attention mechanisms do, reducing effectiveness for tasks requiring selective processing.
- Static Attention Mechanism, where attention weights are fixed rather than dynamically computed as in proper attention mechanisms, limiting adaptability to different inputs.
- Gating Mechanism such as in Gated Recurrent Unit, which controls information flow but lacks the fine-grained selective focusing capability of attention mechanisms.
- Sequential Memory Cell such as in Long Short-Term Memory Unit, which retains information over time but doesn't implement dynamic importance weighting across input elements that attention mechanisms provide.
- Convolutional Neural Network Layer, which applies fixed spatial filters rather than computing dynamic attention patterns based on content relevance.
See: Transformer Model, Sequence-to-Sequence Model with Attention, Attention Alignment, Attention Layer, Attention Map, Attention Weight Distribution, Neural Memory Architecture, Attentional Neural Network.

References

https://youtu.be/sznZ78HquPc

2018a

(Brown et al., 2018) ⇒ Andy Brown, Aaron Tuor, Brian Hutchinson, and Nicole Nichols. (2018). “Recurrent Neural Network Attention Mechanisms for Interpretable System Log Anomaly Detection.” In: Proceedings of the First Workshop on Machine Learning for Computing Systems (MLCS'18). ISBN:978-1-4503-5865-1 doi:10.1145/3217871.3217872
- QUOTE: In this work we use dot product attention (Figure 3), wherein an “attention vector” a is generated from three values: 1) a key matrix $\mathbf{K}$, 2) a value matrix \mathbf{V}, and 3) a query vector \mathbf{q}. In this formulation, keys are a function of the value matrix:

[math]\displaystyle{ \mathbf{K} = \tanh\left(\mathbf{VW}^a\right) }[/math]

(5)

parameterized by $\mathbf{W}a$ . The importance of each timestep is determined by the magnitude of the dot product of each key vector with the query vector $\mathbf{q} \in \R^{La}$ for some attention dimension hyperparameter, $La$. These magnitudes determine the weights, $\mathbf{d}$ on the weighted sum of value vectors, $\mathbf{a}$:

[math]\displaystyle{ \begin{align} \mathbf{d} &= \mathrm{softmax}\left(\mathbf{qKT} \right) \\ \mathbf{a} &= \mathbf{dV} \end{align} }[/math]	(6)
	(7)

**Figure 3:** Dot Product Attention.

2018b

(Yogatama et al., 2018) ⇒ Dani Yogatama, Yishu Miao, Gabor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, and Phil Blunsom. (2018). “Memory Architectures in Recurrent Neural Network Language Models.” In: Proceedings of 6th International Conference on Learning Representations.
- QUOTE: Random access memory. One common approach to retrieve information from the distant past more reliably is to augment the model with a random access memory block via an attention based method. In this model, we consider the previous $K$ states as the memory block, and construct a memory vector $\mathbf{m}_t$ by a weighted combination of these states:
  [math]\displaystyle{ \mathbf{m}_t = \displaystyle \sum_{i=t−K}^{t−1} a_i\mathbf{h}_i \quad }[/math], where [math]\displaystyle{ \quad a_i \propto \exp\left(\mathbf{w}_{m,i}\mathbf{h}_i + \mathbf{w}_{m,h} \mathbf{h}_t\right) }[/math]

2017a

(Gupta et al., 2017) ⇒ Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. (2017). “DeepFix: Fixing Common C Language Errors by Deep Learning.” In: Proceeding of AAAI.
- QUOTE: We present an end-to-end solution, called DeepFix, that does not use any external tool to localize or fix errors. We use a compiler only to validate the fixes suggested by DeepFix. At the heart of DeepFix is a multi-layered sequence-to-sequence neural network with attention (Bahdanau, Cho, and Bengio 2014), comprising of an encoder recurrent neural network (RNN) to process the input and a decoder RNN with attention that generates the output. The network is trained to predict an erroneous program location along with the correct statement. DeepFix invokes it iteratively to fix multiple errors in the program one-by-one. (...)
  DeepFix uses a simple yet effective iterative strategy to fix multiple errors in a program as shown in Figure 2 (...)

**Figure 2:** The iterative repair strategy of DeepFix

2017b

(See et al., 2017) ⇒ Abigail See, Peter J. Liu, and Christopher D. Manning. (2017). “Get To The Point: Summarization with Pointer-Generator Networks.” In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). DOI:10.18653/v1/P17-1099.
- QUOTE: The attention distribution at is calculated as in Bahdanau et al. (2015):

[math]\displaystyle{ \begin{align} e^t_i &= \nu^T \mathrm{tanh}\left(W_hh_i +W_sS_t +b_{attn}\right) \\ a^t &= \mathrm{softmax}\left(e^t \right) \end{align} }[/math]	(1)
	(2)

where $\nu$, $W_h$, $W_s$ and $b_{attn}$ are learnable parameters. The attention distribution can be viewed as a probability distribution over the source words, that tells the decoder where to look to produce the next word. Next, the attention distribution is used to produce a weighted sum of the encoder hidden states, known as the context vector $h^*_t$

[math]\displaystyle{ h^∗_t = \displaystyle\sum_i a^t_i h_i }[/math]

(3)

**Figure 2**: Baseline sequence-to-sequence model with attention. The model may attend to relevant words in the source text to generate novel words, e.g., to produce the novel word beat in the abstractive summary Germany beat Argentina 2-0 the model may attend to the words victorious and win in the source text.

2017c

(Synced Review, 2017) ⇒ Synced (2017). “A Brief Overview of Attention Mechanism.” In: Medium - Synced Review Blog Post.
- QUOTE: And to build context vector is fairly simple. For a fixed target word, first, we loop over all encoders' states to compare target and source states to generate scores for each state in encoders. Then we could use softmax to normalize all scores, which generates the probability distribution conditioned on target states. At last, the weights are introduced to make context vector easy to train. That’s it. Math is shown below:

[math]\displaystyle{ \begin{align} \alpha_{ts} &=\dfrac{\exp\left(\mathrm{score}\left(\mathbf{h}_t,\mathbf{\overline{h}}_s\right)\right)}{\displaystyle\sum_{s'=1}^S \exp\left(\mathrm{score}\left(\mathbf{h}_t,\mathbf{\overline{h}}_{s'}\right)\right)}\\ \mathbf{c}_t &=\displaystyle\sum_s \alpha_{ts}\mathbf{\overline{h}}_s\\ \mathbf{a}_t &=f\left(\mathbf{c}_t,\mathbf{h}_t\right) = \mathrm{tanh}\left(\mathbf{W}_c\big[\mathbf{c}_t; \mathbf{h}_t\big]\right) \end{align} }[/math]	[Attention weights]	(1)
	[Context vector]	(2)
	[Attention vector]	(3)

To understand the seemingly complicated math, we need to keep three key points in mind:

1. During decoding, context vectors are computed for every output word. So we will have a 2D matrix whose size is # of target words multiplied by # of source words. Equation (1) demonstrates how to compute a single value given one target word and a set of source word.

2. Once context vector is computed, attention vector could be computed by context vector, target word, and attention function $f$.

3. We need attention mechanism to be trainable. According to [equation (4), both styles offer the trainable weights (W in Luong’s, W1 and W2 in Bahdanau’s). Thus, different styles may result in different performance.

2017d

(Vaswani et al., 2017) ⇒ Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. (2017). “Attention is all You Need.” In: Advances in Neural Information Processing Systems.
- QUOTE: We call our particular attention “Scaled Dot-Product Attention” (Figure 2). The input consists of queries and keys of dimension [math]\displaystyle{ d_k }[/math], and values of dimension [math]\displaystyle{ d_v }[/math]. We compute the dot products of the query with all keys, divide each by [math]\displaystyle{ \sqrt{d_k} }[/math] , and apply a softmax function to obtain the weights on the values.

**Figure 2:** (left) Scaled Dot—Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix [math]\displaystyle{ Q }[/math]. The keys and values are also packed together into matrices [math]\displaystyle{ K }[/math] and [math]\displaystyle{ V }[/math]. We compute the matrix of outputs as:

[math]\displaystyle{ Attention(Q,K,V)=\mathrm{softmax}\left(\dfrac{QK_T}{\sqrt{d_k}}\right)V }[/math]

(1)

The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of [math]\displaystyle{ 1/\sqrt{d_k} }[/math]. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

2016a

(Yang et al., 2016) ⇒ Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. (2016). “Hierarchical Attention Networks for Document Classification.” In: Proceedings of the 2016_Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- QUOTE: The overall architecture of the Hierarchical Attention Network (HAN) is shown in Fig. 2. It consists of several parts: a word sequence encoder, a word-level attention layer, a sentence encoder and a sentence-level attention layer. (...)

**Figure 2:** Hierarchical Attention Network.

2016c

(Tilk & Alumae, 2016) ⇒ Ottokar Tilk, and Tanel Alumae. (2016). “Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration.” In: Proceedings of Interspeech 2016. doi:10.21437/Interspeech.2016
- QUOTE: We incorporated an attention mechanism [25] into our model to further increase its capacity of finding relevant parts of the context for punctuation decisions. For example the model might focus on words that indicate a question, but may be relatively far from the current word, to nudge the model towards ending the sentence with a question mark instead of a period.
  To fuse together the model state at current input word and the output from the attention mechanism we use a late fusion approach [28] adapted from LSTM to GRU. This allows the attention model output to directly interact with the recurrent layer state while not interfering with its memory.

**Figure 1:** Description of the model predicting punctuation [math]\displaystyle{ y_t }[/math] at time step [math]\displaystyle{ t }[/math] for the slot before the current input word $x_t$.

2015a

(Bahdanau et al., 2015) ⇒ Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. (2015). “Neural Machine Translation by Jointly Learning to Align and Translate.” In: Proceedings of the Third International Conference on Learning Representations (ICLR-2015).
- QUOTE: The context vector $c_i$ is, then, computed as a weighted sum of these annotations $h_i$ :

[math]\displaystyle{ c_i = \displaystyle\sum^{Tx}_{j=1} \alpha_{ij}h_j }[/math].

(5)

(...)

The probability $\alpha_{ij}$, or its associated energy $e_{ij}$, reflects the importance of the annotation $h_j$ with respect to the previous hidden state $s_{i-1}$ in deciding the next state $s_i$ and generating $y_i$. Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed- length vector. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.

2015b

(Luong, Pham et al., 2015) ⇒ Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. (2015). “Effective Approaches to Attention-based Neural Machine Translation". In: Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP-2015).
- QUOTE: Our various attention-based models are classifed into two broad categories, global and local. These classes differ in terms of whether the “attention” is placed on all source positions or on only a few source positions. We illustrate these two model types in Figure 2 and 3 respectively.

**Figure 2:** Global attentional model - at each time step $t$, the model infers a variable-length alignment weight vector at based on the current target state $\mathbf{h}_t$ and all source states $\mathbf{\overline{h}}_s$. A global context vector $\mathbf{c}_t$ is then computed as the weighted average, according to $\mathbf{a}_t$, over all the source states.

**Figure 3:** Local attention model - the model first predicts a single aligned position $p_t$ for the current target word. A window centered around the source position $p_t$ is then used to compute a context vector $\mathbf{c}_t$, a weighted average of the source hidden states in the window. The weights $\mathbf{a}_t$ are inferred from the current target state $\mathbf{h}_t$ and those source states $\mathbf{\overline{h}}_s$ in the window.

2015b

(Rush et al., 2015) ⇒ Alexander M. Rush, Sumit Chopra, and Jason Weston. (2015). “A Neural Attention Model for Abstractive Sentence Summarization.” In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP-2015).
- QUOTE: ... The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-) search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

2015c

(Vinyals et al., 2015) ⇒ Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. (2015). "Pointer Networks". In: Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems (NIPS 2015).

2015d

(Xu et al., 2015) ⇒ Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. (2015). “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Volume 37.

Attention Mechanism

References

2018a

2018b

2017a

2017b

2017c

2016a

2016c

2015a

2015b

2015b

2015c

2015d

Navigation menu

Search