Attention Mechanism Computational Complexity Analysis Task

From GM-RKB
Jump to navigation Jump to search

An Attention Mechanism Computational Complexity Analysis Task is a computational complexity analysis task that focuses on analyzing the number of attention mechanism operations to process input data.



References

2023

  • (Ding, Ma et al., 2023) ⇒ Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, and Furu Wei. (2023). “LongNet: Scaling Transformers to 1,000,000,000 Tokens.” doi:10.48550/arXiv.2307.02486
    • QUOTE: ... Another strand of scaling the sequence length is to decrease the complexity of Transformers, i.e., the quadratic complexity of self-attention. Implementing sliding windows or convolution modules over the attention is a straightforward way to make the complexity nearly linear. Nevertheless, this sacrifices the ability to recall the early tokens, forgetting the prompts at the very beginning of the sequence. Sparse attention reduces the computation by sparsifying the attention matrix, p√reserving the possibility of recalling long-distant information. For example, [CGRS19] obtains O(N Nd) time complexity with a fixed sparse pattern. Besides the heuristic patterns [ZGD+20, BPC20], the learnable patterns prove to be useful for sparse attention [KKL20, ALdJ+23]. There are also some other efficient Transformer-based variants, including low-rank attention [WLK+20, WCL+20], kernel-based meth- ods [KVPF20, CLD+21, QHS+22], downsampling approaches [LLK+19, JGB+21, MKW+21], recurrent models [DYY+19, BKB23], and retrieval-based methods [WRHS22, WDC+23]. Yet, none has been scaled to 1 billion tokens (see Figure 1). ...
Method	Computation Complexity
Recurrent	O(Nd2)
Vanilla Attention	O(N 2d)
Sparse Attention	O(N  Nd)
Dilated Attention (This Work)	O(Nd)
Table 1: Comparison of computation complexity among different methods. N is the sequence length and d is the hidden dimension.