Xiong Yunyang, Zeng Zhanpeng, Chakraborty Rudrasis, Tan Mingxing, Fung Glenn, Li Yin, Singh Vikas
University of Wisconsin-Madison.
UC Berkeley.
Proc AAAI Conf Artif Intell. 2021;35(16):14138-14148. Epub 2021 May 18.
Transformers have emerged as a powerful tool for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is the self-attention mechanism that encodes the influence or dependence of other tokens on each specific token. While beneficial, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences - a topic being actively studied in the community. To address this limitation, we propose Nyströmformer - a model that exhibits favorable scalability as a function of sequence length. Our idea is based on adapting the Nyström method to approximate standard self-attention with () complexity. The scalability of Nyströmformer enables application to longer sequences with thousands of tokens. We perform evaluations on multiple downstream tasks on the GLUE benchmark and IMDB reviews with standard sequence length, and find that our Nyströmformer performs comparably, or in a few cases, even slightly better, than standard self-attention. On longer sequence tasks in the Long Range Arena (LRA) benchmark, Nyströmformer performs favorably relative to other efficient self-attention methods. Our code is available at https://github.com/mlpen/Nystromformer.
Transformer已成为广泛的自然语言处理任务中的强大工具。驱动Transformer取得令人印象深刻性能的一个关键组件是自注意力机制,它对每个特定token上其他token的影响或依赖性进行编码。虽然有好处,但自注意力在输入序列长度上的二次复杂度限制了其在更长序列上的应用——这是该领域正在积极研究的一个课题。为了解决这一限制,我们提出了Nyströmformer——一种随着序列长度增加表现出良好可扩展性的模型。我们的想法基于采用Nyström方法以()复杂度近似标准自注意力。Nyströmformer的可扩展性使其能够应用于包含数千个token的更长序列。我们在GLUE基准测试和IMDB评论的多个下游任务上进行了评估,这些任务具有标准序列长度,并且发现我们的Nyströmformer与标准自注意力相比表现相当,在少数情况下甚至略好。在Long Range Arena(LRA)基准测试的更长序列任务中,Nyströmformer相对于其他高效自注意力方法表现出色。我们的代码可在https://github.com/mlpen/Nystromformer获取。