Attention Mechanism
An Attention Mechanism is a neural network component within a memory augmented neural network that allows the model to dynamically focus on certain parts of its input or its own internal state (memory) that are most relevant for performing a given task
- AKA: Neural Attention Model, Neural Network Attention System.
- Context:
- It can (typically) be part of a Neural Network with Attention Mechanism.
- It can (typically) utilize an Attention Pattern Matrix that encodes the pairwise relevance between tokens, allowing the model to selectively focus on different parts of the input when updating each token's representation.
- It can (typically) compute Attention Scores between query vectors (representing the current state) and key vectors (representing the input elements), which are then normalized using a softmax function to obtain attention weights.
- It can (typically) use the computed Attention Weights to take a weighted sum of value vectors, which correspond to the input elements, to obtain a context vector that captures the most relevant information for the current state.
- It can (typically) update its Query Vectors, Key Vectors, and Value Vectors through learnable linear transformations, allowing the model to adapt and learn the most suitable representations for the given task during training.
- It can be described by a Neural Network Attention Function, which mathematically defines how attention scores are computed.
- It can range from being a Local Neural Attention Model to being a Global Neural Attention Model.
- It can range from being a Self-Attention Mechanism ( where attention is computed relative to the input itself) to being a Multi-Head Attention Mechanism (where attention is computed through multiple representation subspaces).
- It can range from being an Additive Attention Mechanism (which uses a feed-forward network) to being a Dot Product Attention Mechanism (which computes the dot product between the query and key vectors), based on different scoring methods.
- It can range from being a Deterministic Attention Mechanism (which produces fixed attention weights) to being Stochastic Attention Mechanism (which produces random attention weights).
- It can range from being a Soft Attention Mechanism (which assigns real-valued weights to the input data) to being a Hard Attention Mechanism.
- It can be an input to an Attention Mechanism Computational Complexity Analysis.
- …
- Example(s)
- an Additive Attention Mechanism, which computes attention weights by using a feed-forward network with a single hidden layer to combine query and key vectors.
- a Dot Product Attention Mechanism, where attention weights are computed as the dot product between the query and key vectors, often used due to its computational efficiency.
- Encoder-Decoder Attention Mechanism, which is widely used in sequence-to-sequence models for tasks such as machine translation, allowing the decoder to attend over all positions in the input sequence.
- Context-based Attention Mechanism, focusing on context information, where the model adjusts its focus based on the surrounding context of a specific input element.
- Hard Stochastic Attention Mechanism, where attention decisions are sampled from a probability distribution, leading to discrete attention focusing.
- Segment-Level Attention Mechanism, often used in natural language processing to attend to whole segments or phrases in the input sequence for better semantic understanding.
- Scaled Dot-Product Attention Mechanism, which scales the dot products by the dimensionality of the vectors, improving stability in models with large dimension sizes.
- a Soft Deterministic Attention Mechanism, which uses a deterministic approach to compute attention weights but allows for a distribution over inputs, balancing between focusing and distributing attention.
- a Block Sparse Attention Mechanism, which introduces sparsity into the attention mechanism by computing attention within blocks or between specific blocks, reducing computational complexity.
- a Tiered Attention Mechanism, which employs multiple levels of attention, such as focusing first on broader categories and then on more specific details within those categories.
- Multi-Head Attention Mechanism, which runs several attention mechanisms in parallel, allowing the model to capture different types of relationships in the input data.
- …
- Counter-Example(s):
- Self-Attention Mechanism without Positional Encoding: A self-attention mechanism that does not incorporate positional information, which may limit its ability to capture sequential or spatial relationships in the input data.
- Uniform Attention Distribution: A mechanism that assigns equal attention weights to all input elements, effectively not focusing on any specific part of the input, which may not be suitable for tasks that require selective attention.
- Static Attention Mechanism: An attention mechanism where the attention weights are fixed and not learned or updated during training, which may not be able to adapt to different input sequences or tasks.
- Single-Head Attention Mechanism: An attention mechanism that uses only one attention head, which may not be able to capture multiple types of relationships or attend to different aspects of the input simultaneously, compared to a Multi-Head Attention Mechanism.
- Attention Mechanism without Query-Key-Value Separation: An attention mechanism that does not separate the input elements into query, key, and value vectors, which may limit its expressiveness and ability to compute complex attention patterns.
- Coverage Mechanism, which prevents the model from attending to the same information repeatedly,
- Gating Mechanism such as that of a GRU, used to control the flow of information,
- Sequential Memory Cell such as that of an LSTM Unit, which is designed to remember patterns over time,
- Stacked Memory Cell, where multiple memory cells are stacked to form a deep network.
- See: Transformer Model, Seq2Seq Model with Attention, Attention Alignment, Attention Layer, Attention Map, Attention Mask, Attention Module, Attentional Neural Network, Attentive Neural Network.
References
2018a
- (Brown et al., 2018) ⇒ Andy Brown, Aaron Tuor, Brian Hutchinson, and Nicole Nichols. (2018). “Recurrent Neural Network Attention Mechanisms for Interpretable System Log Anomaly Detection.” In: Proceedings of the First Workshop on Machine Learning for Computing Systems (MLCS'18). ISBN:978-1-4503-5865-1 doi:10.1145/3217871.3217872
- QUOTE: In this work we use dot product attention (Figure 3), wherein an “attention vector” a is generated from three values: 1) a key matrix $\mathbf{K}$, 2) a value matrix \mathbf{V}, and 3) a query vector \mathbf{q}. In this formulation, keys are a function of the value matrix:
- QUOTE: In this work we use dot product attention (Figure 3), wherein an “attention vector” a is generated from three values: 1) a key matrix $\mathbf{K}$, 2) a value matrix \mathbf{V}, and 3) a query vector \mathbf{q}. In this formulation, keys are a function of the value matrix:
[math]\displaystyle{ \mathbf{K} = \tanh\left(\mathbf{VW}^a\right) }[/math] | (5) |
- parameterized by $\mathbf{W}a$ . The importance of each timestep is determined by the magnitude of the dot product of each key vector with the query vector $\mathbf{q} \in \R^{La}$ for some attention dimension hyperparameter, $La$. These magnitudes determine the weights, $\mathbf{d}$ on the weighted sum of value vectors, $\mathbf{a}$:
- parameterized by $\mathbf{W}a$ . The importance of each timestep is determined by the magnitude of the dot product of each key vector with the query vector $\mathbf{q} \in \R^{La}$ for some attention dimension hyperparameter, $La$. These magnitudes determine the weights, $\mathbf{d}$ on the weighted sum of value vectors, $\mathbf{a}$:
[math]\displaystyle{ \begin{align} \mathbf{d} &= \mathrm{softmax}\left(\mathbf{qKT} \right) \\ \mathbf{a} &= \mathbf{dV} \end{align} }[/math] | (6) |
(7) |
2018b
- (Yogatama et al., 2018) ⇒ Dani Yogatama, Yishu Miao, Gabor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, and Phil Blunsom. (2018). “Memory Architectures in Recurrent Neural Network Language Models.” In: Proceedings of 6th International Conference on Learning Representations.
- QUOTE: Random access memory. One common approach to retrieve information from the distant past more reliably is to augment the model with a random access memory block via an attention based method. In this model, we consider the previous $K$ states as the memory block, and construct a memory vector $\mathbf{m}_t$ by a weighted combination of these states:[math]\displaystyle{ \mathbf{m}_t = \displaystyle \sum_{i=t−K}^{t−1} a_i\mathbf{h}_i \quad }[/math], where [math]\displaystyle{ \quad a_i \propto \exp\left(\mathbf{w}_{m,i}\mathbf{h}_i + \mathbf{w}_{m,h} \mathbf{h}_t\right) }[/math]
- QUOTE: Random access memory. One common approach to retrieve information from the distant past more reliably is to augment the model with a random access memory block via an attention based method. In this model, we consider the previous $K$ states as the memory block, and construct a memory vector $\mathbf{m}_t$ by a weighted combination of these states:
2017a
- (Gupta et al., 2017) ⇒ Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. (2017). “DeepFix: Fixing Common C Language Errors by Deep Learning.” In: Proceeding of AAAI.
- QUOTE: We present an end-to-end solution, called DeepFix, that does not use any external tool to localize or fix errors. We use a compiler only to validate the fixes suggested by DeepFix. At the heart of DeepFix is a multi-layered sequence-to-sequence neural network with attention (Bahdanau, Cho, and Bengio 2014), comprising of an encoder recurrent neural network (RNN) to process the input and a decoder RNN with attention that generates the output. The network is trained to predict an erroneous program location along with the correct statement. DeepFix invokes it iteratively to fix multiple errors in the program one-by-one. (...)
DeepFix uses a simple yet effective iterative strategy to fix multiple errors in a program as shown in Figure 2 (...)
- QUOTE: We present an end-to-end solution, called DeepFix, that does not use any external tool to localize or fix errors. We use a compiler only to validate the fixes suggested by DeepFix. At the heart of DeepFix is a multi-layered sequence-to-sequence neural network with attention (Bahdanau, Cho, and Bengio 2014), comprising of an encoder recurrent neural network (RNN) to process the input and a decoder RNN with attention that generates the output. The network is trained to predict an erroneous program location along with the correct statement. DeepFix invokes it iteratively to fix multiple errors in the program one-by-one. (...)
2017b
- (See et al., 2017) ⇒ Abigail See, Peter J. Liu, and Christopher D. Manning. (2017). “Get To The Point: Summarization with Pointer-Generator Networks.” In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). DOI:10.18653/v1/P17-1099.
- QUOTE: The attention distribution at is calculated as in Bahdanau et al. (2015):
[math]\displaystyle{ \begin{align} e^t_i &= \nu^T \mathrm{tanh}\left(W_hh_i +W_sS_t +b_{attn}\right) \\ a^t &= \mathrm{softmax}\left(e^t \right) \end{align} }[/math] | (1) |
(2) |
- where $\nu$, $W_h$, $W_s$ and $b_{attn}$ are learnable parameters. The attention distribution can be viewed as a probability distribution over the source words, that tells the decoder where to look to produce the next word. Next, the attention distribution is used to produce a weighted sum of the encoder hidden states, known as the context vector $h^*_t$
- where $\nu$, $W_h$, $W_s$ and $b_{attn}$ are learnable parameters. The attention distribution can be viewed as a probability distribution over the source words, that tells the decoder where to look to produce the next word. Next, the attention distribution is used to produce a weighted sum of the encoder hidden states, known as the context vector $h^*_t$
[math]\displaystyle{ h^∗_t = \displaystyle\sum_i a^t_i h_i }[/math] | (3) |
2017c
- (Synced Review, 2017) ⇒ Synced (2017). "A Brief Overview of Attention Mechanism." In: Medium - Synced Review Blog Post.
- QUOTE: And to build context vector is fairly simple. For a fixed target word, first, we loop over all encoders' states to compare target and source states to generate scores for each state in encoders. Then we could use softmax to normalize all scores, which generates the probability distribution conditioned on target states. At last, the weights are introduced to make context vector easy to train. That’s it. Math is shown below:
[math]\displaystyle{ \begin{align} \alpha_{ts} &=\dfrac{\exp\left(\mathrm{score}\left(\mathbf{h}_t,\mathbf{\overline{h}}_s\right)\right)}{\displaystyle\sum_{s'=1}^S \exp\left(\mathrm{score}\left(\mathbf{h}_t,\mathbf{\overline{h}}_{s'}\right)\right)}\\ \mathbf{c}_t &=\displaystyle\sum_s \alpha_{ts}\mathbf{\overline{h}}_s\\ \mathbf{a}_t &=f\left(\mathbf{c}_t,\mathbf{h}_t\right) = \mathrm{tanh}\left(\mathbf{W}_c\big[\mathbf{c}_t; \mathbf{h}_t\big]\right) \end{align} }[/math] | [Attention weights] | (1) |
[Context vector] | (2) | |
[Attention vector] | (3) |
- To understand the seemingly complicated math, we need to keep three key points in mind:
- 1. During decoding, context vectors are computed for every output word. So we will have a 2D matrix whose size is # of target words multiplied by # of source words. Equation (1) demonstrates how to compute a single value given one target word and a set of source word.
- 2. Once context vector is computed, attention vector could be computed by context vector, target word, and attention function $f$.
- 3. We need attention mechanism to be trainable. According to [equation (4), both styles offer the trainable weights (W in Luong’s, W1 and W2 in Bahdanau’s). Thus, different styles may result in different performance.
- To understand the seemingly complicated math, we need to keep three key points in mind:
2017d
- (Vaswani et al., 2017) ⇒ Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. (2017). “Attention is all You Need.” In: Advances in Neural Information Processing Systems.
- QUOTE: We call our particular attention “Scaled Dot-Product Attention” (Figure 2). The input consists of queries and keys of dimension [math]\displaystyle{ d_k }[/math], and values of dimension [math]\displaystyle{ d_v }[/math]. We compute the dot products of the query with all keys, divide each by [math]\displaystyle{ \sqrt{d_k} }[/math] , and apply a softmax function to obtain the weights on the values.
- QUOTE: We call our particular attention “Scaled Dot-Product Attention” (Figure 2). The input consists of queries and keys of dimension [math]\displaystyle{ d_k }[/math], and values of dimension [math]\displaystyle{ d_v }[/math]. We compute the dot products of the query with all keys, divide each by [math]\displaystyle{ \sqrt{d_k} }[/math] , and apply a softmax function to obtain the weights on the values.
- In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix [math]\displaystyle{ Q }[/math]. The keys and values are also packed together into matrices [math]\displaystyle{ K }[/math] and [math]\displaystyle{ V }[/math]. We compute the matrix of outputs as:
- In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix [math]\displaystyle{ Q }[/math]. The keys and values are also packed together into matrices [math]\displaystyle{ K }[/math] and [math]\displaystyle{ V }[/math]. We compute the matrix of outputs as:
[math]\displaystyle{ Attention(Q,K,V)=\mathrm{softmax}\left(\dfrac{QK_T}{\sqrt{d_k}}\right)V }[/math] | (1) |
- The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of [math]\displaystyle{ 1/\sqrt{d_k} }[/math]. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
2016a
- (Yang et al., 2016) ⇒ Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. (2016). “Hierarchical Attention Networks for Document Classification.” In: Proceedings of the 2016_Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- QUOTE: The overall architecture of the Hierarchical Attention Network (HAN) is shown in Fig. 2. It consists of several parts: a word sequence encoder, a word-level attention layer, a sentence encoder and a sentence-level attention layer. (...)
- QUOTE: The overall architecture of the Hierarchical Attention Network (HAN) is shown in Fig. 2. It consists of several parts: a word sequence encoder, a word-level attention layer, a sentence encoder and a sentence-level attention layer. (...)
2016c
- (Tilk & Alumae, 2016) ⇒ Ottokar Tilk, and Tanel Alumae. (2016). “Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration.” In: Proceedings of Interspeech 2016. doi:10.21437/Interspeech.2016
- QUOTE: We incorporated an attention mechanism [25] into our model to further increase its capacity of finding relevant parts of the context for punctuation decisions. For example the model might focus on words that indicate a question, but may be relatively far from the current word, to nudge the model towards ending the sentence with a question mark instead of a period.
To fuse together the model state at current input word and the output from the attention mechanism we use a late fusion approach [28] adapted from LSTM to GRU. This allows the attention model output to directly interact with the recurrent layer state while not interfering with its memory.
- QUOTE: We incorporated an attention mechanism [25] into our model to further increase its capacity of finding relevant parts of the context for punctuation decisions. For example the model might focus on words that indicate a question, but may be relatively far from the current word, to nudge the model towards ending the sentence with a question mark instead of a period.
2015a
- (Bahdanau et al., 2015) ⇒ Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. (2015). “Neural Machine Translation by Jointly Learning to Align and Translate.” In: Proceedings of the Third International Conference on Learning Representations (ICLR-2015).
- QUOTE: The context vector $c_i$ is, then, computed as a weighted sum of these annotations $h_i$ :
- QUOTE: The context vector $c_i$ is, then, computed as a weighted sum of these annotations $h_i$ :
[math]\displaystyle{ c_i = \displaystyle\sum^{Tx}_{j=1} \alpha_{ij}h_j }[/math]. | (5) |
- The probability $\alpha_{ij}$, or its associated energy $e_{ij}$, reflects the importance of the annotation $h_j$ with respect to the previous hidden state $s_{i-1}$ in deciding the next state $s_i$ and generating $y_i$. Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed- length vector. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.
2015b
- (Luong, Pham et al., 2015) ⇒ Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. (2015). “Effective Approaches to Attention-based Neural Machine Translation". In: Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP-2015).
- QUOTE: Our various attention-based models are classifed into two broad categories, global and local. These classes differ in terms of whether the “attention” is placed on all source positions or on only a few source positions. We illustrate these two model types in Figure 2 and 3 respectively.
- QUOTE: Our various attention-based models are classifed into two broad categories, global and local. These classes differ in terms of whether the “attention” is placed on all source positions or on only a few source positions. We illustrate these two model types in Figure 2 and 3 respectively.
2015b
- (Rush et al., 2015) ⇒ Alexander M. Rush, Sumit Chopra, and Jason Weston. (2015). “A Neural Attention Model for Abstractive Sentence Summarization.” In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP-2015).
- QUOTE: ... The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-) search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
2015c
- (Vinyals et al., 2015) ⇒ Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. (2015). "Pointer Networks". In: Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems (NIPS 2015).
2015d
- (Xu et al., 2015) ⇒ Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. (2015). “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Volume 37.