2023 LongNetScalingTransformersto100

From GM-RKB
(Redirected from Ding, Ma et al., 2023)
Jump to navigation Jump to search

Subject Headings: LongNet, Dilated Attention, Vanilla Transformer, Attention Mechanism Computational Complexity.

Notes

Cited By

Quotes

Abstract

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. In this work, we introduce LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between tokens; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization. Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.

1 Introduction

Recent years have witnessed a trend toward scaling neural networks [BMR+20, KMH+20, ZKHB22, CND+22, DDM+23]. The depth is primarily scaled up for exponential expressivity, producing many powerful deep networks [HZRS16, HCB+19, WMD+22]. Then, the sparse MoE mod- els [LLX+21, FZS21, ZBK+22] and model parallelism approaches [SPP+19, KCL+22] efficiently enlarge the hidden dimension. Sequence length, as the last atomic dimension of the neural net- work, is desirable to be unlimited. Breaking the limitation of sequence length introduces significant advantages. First, it provides large memory and receptive field for models, which is practical for them to interact with human and the world. Second, a longer context contains more complex causality and reasoning paths that models can exploit in training data. In contrast, short dependency has more spurious correlations, which is harmful to generalization. Third, it enables to explore the limits of in-context learning, which has the potential to be a paradigm shift for many-shot learning, as an extremely long context may help the models alleviate catastrophic forgetting. The major challenge of scaling up sequence length is striking the right balance between the computational complexity and the model expressivity. RNN-style models are primarily imple- mented to increase the length. However, its sequential nature limits the parallelization dur- ing training, which is essential in long-sequence modeling. More recently, state space mod- els [GGR22, SWL23, FDS+23, PMN+23] are appealing to sequence modeling. It can operate as a CNN during training, and transform to an efficient RNN at test time. While they perform well at long-range benchmarks [TDA+21], their performance on regular lengths is not as good as Transformers, limited mainly by the model expressivity [FPB+23].

Another strand of scaling the sequence length is to decrease the complexity of Transformers, i.e., the quadratic complexity of self-attention. Implementing sliding windows or convolution modules over the attention is a straightforward way to make the complexity nearly linear. Nevertheless, this sacrifices the ability to recall the early tokens, forgetting the prompts at the very beginning of the sequence. Sparse attention reduces the computation by sparsifying the attention matrix, p√reserving the possibility of recalling long-distant information. For example, [CGRS19] obtains O(N Nd) time complexity with a fixed sparse pattern. Besides the heuristic patterns [ZGD+20, BPC20], the learnable patterns prove to be useful for sparse attention [KKL20, ALdJ+23]. There are also some other efficient Transformer-based variants, including low-rank attention [WLK+20, WCL+20], kernel-based meth- ods [KVPF20, CLD+21, QHS+22], downsampling approaches [LLK+19, JGB+21, MKW+21], recurrent models [DYY+19, BKB23], and retrieval-based methods [WRHS22, WDC+23]. Yet, none has been scaled to 1 billion tokens (see Figure 1).

Method	Computation Complexity
Recurrent	O(Nd2)
Vanilla Attention	O(N 2d)
Sparse Attention	O(N  Nd)
Dilated Attention (This Work)	O(Nd)
Table 1: Comparison of computation complexity among different methods. N is the sequence length and d is the hidden dimension.

In this work, we successfully scale the sequence length to 1 billion tokens. Our solution is LONGNET, which replaces the attention of vanilla Transformers with a novel component named dilated attention. The general design principle is - attention allocation decreases exponentially as the distance between tokens grows. We prove that it obtains a linear computation complexity and a logarithm dependency between tokens. This deals with the contradiction between limited attention resources and the accessibility to every token. In the implementation, LONGNET can be transformed into a dense Transformer, which seamlessly supports the off-the-shelf optimization for Transformers (e.g., kernel fusion, quantization, and distributed training). Taking advantage of the linear complexity, LONGNET can parallelize the training across nodes, breaking the constraint of both computation and memory with a distributed algorithm. This allows us to efficiently scale up the sequence length to 1B tokens with nearly constant runtime (see Figure 5), while vanilla Transformer suffers from quadratic complexity.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 LongNetScalingTransformersto100Furu Wei
Li Dong
Jiayu Ding
Shuming Ma
Xingxing Zhang
Shaohan Huang
Wenhui Wang
LongNet: Scaling Transformers to 1,000,000,000 Tokens10.48550/arXiv.2307.024862023