2018 SelfAttentionwithRelativePositi

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Self-Attention Mechanism; Transformer Network; Relation-Aware Self-Attention Mechanism.

Notes

Cited By

Quotes

Abstract

Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs. In this work we present an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements. On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, we observe that combining relative and absolute position representations yields no further improvement in translation quality. We describe an efficient implementation of our method and cast it as an instance of relation-aware self-attention mechanisms that can generalize to arbitrary graph-labeled inputs.

1. Introduction

2. Background

2.1. Transformer

2.2. Self-Attention

Self-attention sublayers employ $h$ attention heads. To form the sublayer output, results from each head are concatenated and a parameterized linear transformation is applied.

Each attention head operates on an input sequence, $x = \left(x_1, \ldots , x_n\right)$ of $n$ elements where $x_i \in \R^{d_x}$, and computes a new sequence $z = \left(z_1, \ldots , z_n\right) of the same length where $z_i \in \R^{d_z}$.

Each output element, $z_i$ , is computed as weighted sum of a linearly transformed input elements:

$z_{i}=\sum_{j=1}^{n} \alpha_{i j}\left(x_{j} W^{V}\right)$

(1)

Each weight coefficient, $\alpha_{ij}$, is computed using a softmax function:

$\alpha_{i j}=\dfrac{\exp e_{i j}}{\sum_{k=1}^{n} \exp e_{i k}}$

And $e_{ij}$ is computed using a compatibility function that compares two input elements:

$e_{i j}=\dfrac{\left(x_{i} W^{Q}\right)\left(x_{j} W^{K}\right)^{T}}{\sqrt{d_{z}}}$

(2)

 Scaled dot product was chosen for the compatibility function, which enables efficient computation. Linear transformations of the inputs add sufficient expressive power.

$W^Q$, $W^K$, $W^V\; \in \R^{d_x\times d_z}$ are parameter matrices. These parameter matrices are unique per layer and attention head.

3. Proposed Architecture

3.1. Relation-aware Self-Attention

We propose an extension to self-attention to consider the pairwise relationships between input elements. In this sense, we model the input as a labeled, directed, fully-connected graph.

The edge between input elements $x_i$ and $x_j$ is represented by vectors $a^V_{ij},\;a^K_{ij} \in \R^{da}$. The motivation for learning two distinct edge representations is that $a^{V}_{ij}$ and $a^{K}_{ij}$ are suitable for use in eq.(3) and eq.(4), respectively, without requiring additional linear transformations. These representations can be shared across attention heads. We use $d_a = d_z$.

We modify eq.(1) to propagate edge information to the sublayer output:

$z_{i}=\sum_{j=1}^{n} \alpha_{i j}\left(x_{j} W^{V}+a_{i j}^{V}\right)$

(3)

This extension is presumably important for tasks where information about the edge types selected by a given attention head is useful to downstream encoder or decoder layers. However, as explored in 4.3, this may not be necessary for machine translation.

We also, importantly, modify eq.(2) to consider edges when determining compatibility:

$e_{i j}=\dfrac{x_{i} W^{Q}\left(x_{j} W^{K}+a_{i j}^{K}\right)^{T}}{\sqrt{d_{z}}}$

(4)

The primary motivation for using simple addition to incorporate edge representations in eq. (3) and eq. (4) is to enable an efficient implementation described in 3.3.

3.2. Relative Position Representations

For linear sequences, edges can capture information about the relative position differences between input elements. The maximum relative position we consider is clipped to a maximum absolute value of $k$. We hypothesized that precise relative position information is not useful beyond a certain distance. Clipping the maximum distance also enables the model to generalize to sequence lengths not seen during training. Therefore, we consider $2k + 1$ unique edge labels.

$\begin{aligned} a_{i j}^{K} &=w_{\operatorname{clip}(j-i, k)}^{K} \\ a_{i j}^{V} &=w_{\operatorname{clip}(j-i, k)}^{V} \\ \operatorname{clip}(x, k) &=\max (-k, \min (k, x)) \end{aligned}$

We then learn relative position representations $w^K = \left(w^K_{−k} , \ldots , w^K_k \right)$ and $w^V = \left(w^V_{−k} , \ldots , w^V_k \right)$ where $w^K_i ,\; w^V_i \in \R^{da}$.

3.3. Efficient Implementation

There are practical space complexity concerns when considering edges between input elements, as noted by Velickovic et al. (2017), which considers unlabeled graph inputs to an attention model.

For a sequence of length $n$ and $h$ attention heads, we reduce the space complexity of storing relative position representations from $O\left(hn^2da\right)$ to $O\left(n^2da\right)$ by sharing them across each heads. Additionally, relative position representations can be shared across sequences. Therefore, the overall self-attention space complexity increases from $O\left(bhndz\right)$ to $O\left(bhndz + n^2da\right)$. Given $d_a = d_z$, the size of the relative increase depends on $n/bh$.

The Transformer computes self-attention efficiently for all sequences, heads, and positions in a batch using parallel matrix multiplication operations (Vaswani et al., 2017). Without relative position representations, each $e_{ij}$ can be computed using $bh$ parallel multiplications of $n\times d_z$ and $d_z\times n$ matrices. Each matrix multiplication computes $e_{ij}$ for all sequence positions, for a particular head and sequence. For any sequence and head, this requires sharing the same representation for each position across all compatibility function applications (dot products) with other positions.

When we consider relative positions the representations differ with different pairs of positions. This prevents us from computing all $e_{ij}$ for all pairs of positions in a single matrix multiplication. We also want to avoid broadcasting relative position representations. However, both issues can be resolved by splitting the computation of eq.(4) into two terms:

$e_{i j}=\dfrac{x_{i} W^{Q}\left(x_{j} W^{K}\right)^{T}+x_{i} W^{Q}\left(a_{i j}^{K}\right)^{T}}{\sqrt{d_{z}}}$

(5)

The first term is identical to eq.(2), and can be computed as described above. For the second term involving relative position representations, tensor reshaping can be used to compute $n$ parallel multiplications of $bh\times d_z$ and $d_z\times n$ matrices. Each matrix multiplication computes contributions to $e_{ij}$ for all heads and batches, corresponding to a particular sequence position. Further reshaping allows adding the two terms. The same approach can be used to efficiently compute eq.(3).

For our machine translation experiments, the result was a modest 7% decrease in steps per second, but we were able to maintain the same model and batch sizes on P100 GPUs as Vaswani et al. (2017).

4. Experiments

5. Conclusions

References

BibTeX

@inproceedings{2018_SelfAttentionwithRelativePositi,
  author    = {Peter Shaw and
               Jakob Uszkoreit and
               Ashish Vaswani},
  editor    = {Marilyn A. Walker and
               Heng Ji and
               Amanda Stent},
  title     = {Self-Attention with Relative Position Representations},
  booktitle = {Proceedings of the 2018 Conference of the North American Chapter of
               the Association for Computational Linguistics: Human Language Technologies
               (NAACL-HLT 2018) Volume 2 (Short
               Papers)},
  pages     = {464--468},
  publisher = {Association for Computational Linguistics},
  year      = {2018},
  url       = {https://doi.org/10.18653/v1/n18-2074},
  doi       = {10.18653/v1/n18-2074},
}


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2018 SelfAttentionwithRelativePositiAshish Vaswani
Jakob Uszkoreit
Peter Shaw
Self-Attention with Relative Position Representations2018